de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikt.multi.en
Class WPOSEn

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikt.multi.en.WPOSEn

public class WPOSEn
extends Object

Splits text to fragments related to different parts of speech (POS). POS is a level 3 or 4 header in English Wiktionary:

 1)
 ==English==
 ===Etymology===
 ===Noun===
 ===Verb===

 ==Finnish==
 ===Etymology===
 ===Noun===       (level 3 in English Wiktionary: ===Noun===)

 2)
 In the case of multiple etymologies, all subordinate headers need to have
 their levels increased by 1:
 ===Etymology 1===
 ====Pronunciation====
 ====Noun====             POS=noun
 ===Etymology 2===
 ====Pronunciation====
 ====Noun====             POS=noun
 ====Verb====             POS=verb
 (level 4 in English Wiktionary: ===Verb===)

See Also:
http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained, http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained/POS_headers

Constructor Summary
WPOSEn()
           
 
Method Summary
static boolean isSecondLevelHeaderWordNotPOS(String str)
          Gets true, if str is known header, e.g.
static POSText[] splitToPOSSections(String page_title, LangText[] etymology_sections)
          Splits each etymology section into POS sections.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WPOSEn

public WPOSEn()
Method Detail

isSecondLevelHeaderWordNotPOS

public static boolean isSecondLevelHeaderWordNotPOS(String str)
Gets true, if str is known header, e.g. "References", but it's not a part of speech name, e.g. "Verb".


splitToPOSSections

public static POSText[] splitToPOSSections(String page_title,
                                           LangText[] etymology_sections)
Splits each etymology section into POS sections. Then merge all POS sections into one big array. page_title - word which are described in this article 'text'

Parameters:
lt - .text will be parsed and splitted, .lang is not using now, may be in future...

1) Splits the following text to "Noun" and "Verb" 2) Extracts part of speech "noun" and "verb"
 ===Noun===
 {{en-noun}}
 ===Verb===
 
Todo: save info about the link Etymology <-> POS.


Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.