de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikt.multi.en
Class WPOSEn
java.lang.Object
de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikt.multi.en.WPOSEn
public class WPOSEn
- extends Object
Splits text to fragments related to different parts of speech (POS).
POS is a level 3 or 4 header in English Wiktionary:
1)
==English==
===Etymology===
===Noun===
===Verb===
==Finnish==
===Etymology===
===Noun=== (level 3 in English Wiktionary: ===Noun===)
2)
In the case of multiple etymologies, all subordinate headers need to have
their levels increased by 1:
===Etymology 1===
====Pronunciation====
====Noun==== POS=noun
===Etymology 2===
====Pronunciation====
====Noun==== POS=noun
====Verb==== POS=verb
(level 4 in English Wiktionary: ===Verb===)
- See Also:
http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained
,
http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained/POS_headers
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WPOSEn
public WPOSEn()
isSecondLevelHeaderWordNotPOS
public static boolean isSecondLevelHeaderWordNotPOS(String str)
- Gets true, if str is known header, e.g. "References",
but it's not a part of speech name, e.g. "Verb".
splitToPOSSections
public static POSText[] splitToPOSSections(String page_title,
LangText[] etymology_sections)
- Splits each etymology section into POS sections.
Then merge all POS sections into one big array.
page_title - word which are described in this article 'text'
- Parameters:
lt
- .text will be parsed and splitted,
.lang is not using now, may be in future...
1) Splits the following text to "Noun" and "Verb"
2) Extracts part of speech "noun" and "verb"
===Noun===
{{en-noun}}
===Verb===
Todo: save info about the link Etymology <-> POS.
Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.