de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikipedia.text
Class WikiParser

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikipedia.text.WikiParser

public class WikiParser
extends Object

Converts wiki-texts to texts without [[]], interwiki, .., etc. Definitions: [[...]] - wikilink, [http:// site name] - hyperlink.


Constructor Summary
WikiParser()
           
 
Method Summary
static StringBuffer convertWikiToText(StringBuffer wiki_text, LanguageType lang, boolean b_remove_not_expand_iwiki)
          Removes / expands interwiki, removes categories, expands wiki links.
static StringBuffer parseCurlyBrackets(StringBuffer text)
          Removes texts withing curly brackets, e.g.
static StringBuffer parseDoubleApostrophe(StringBuffer text)
          Removes douple apostrophes used in pairs, e.g.
static StringBuffer parseDoubleBrackets(StringBuffer text, LanguageType lang, boolean b_remove_not_expand_iwiki)
          Removes and expands interwiki, categories, and wiki links in wiki texts.
static StringBuffer parseSingleBrackets(StringBuffer text)
          Expands / removes hyperlinks.
static StringBuffer parseTripleApostrophe(StringBuffer text)
          Removes triple apostrophes used in pairs, e.g.
static StringBuffer removeAcuteAccent(StringBuffer text, LanguageType wiki_lang)
          Removes sign of acute accent "'" for Russian wiki texts, it is placed in the begin of article often e.g.
static StringBuffer removeBracketsInInterwiki(StringBuffer text)
          Expands interwiki by removing interwiki brackets and language code, e.g.
static StringBuffer removeBracketsInWikiLink(StringBuffer text)
          Deprecated. Use parseDoubleBrackets()
static StringBuffer removeCategory(StringBuffer text, LanguageType lang)
          Removes categories for selected language, e.g.
static StringBuffer removeHTMLComments(StringBuffer text)
          Removes all comments: <!
static StringBuffer removeInterwiki(StringBuffer text)
          Removes interwiki, e.g.
static StringBuffer removePreCode(StringBuffer text)
          Removes preformatted code (e.g. xml): <pre> ...
static StringBuffer removeSourceCode(StringBuffer text)
          Removes all source codes: <source ...
static StringBuffer removeXMLTag(StringBuffer text, String tag)
          Removes XML tag with text till the next .
static StringBuffer removeXMLTagCode(StringBuffer text)
          Removes XML tag with text till the next .
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WikiParser

public WikiParser()
Method Detail

removeInterwiki

public static StringBuffer removeInterwiki(StringBuffer text)
Removes interwiki, e.g. "[[et:Talvepalee]] text" -> " text", where language code (e.g. 'et') can have two or three letters.


removeBracketsInInterwiki

public static StringBuffer removeBracketsInInterwiki(StringBuffer text)
Expands interwiki by removing interwiki brackets and language code, e.g. "[[et:Talvepalee]] text" -> "Talvepalee text".


removeCategory

public static StringBuffer removeCategory(StringBuffer text,
                                          LanguageType lang)
Removes categories for selected language, e.g. English: "[[Category:Russia]] text" -> " text", or Esperanto: "[[Kategorio:Galaksioj]] text" -> " text".


removeXMLTag

public static StringBuffer removeXMLTag(StringBuffer text,
                                        String tag)
Removes XML tag with text till the next .


removeXMLTagCode

public static StringBuffer removeXMLTagCode(StringBuffer text)
Removes XML tag with text till the next . e.g. "a x+y b" -> "a b".


removeHTMLComments

public static StringBuffer removeHTMLComments(StringBuffer text)
Removes all comments: <!-- ... -->.


removePreCode

public static StringBuffer removePreCode(StringBuffer text)
Removes preformatted code (e.g. xml): <pre> ... </pre>.


removeSourceCode

public static StringBuffer removeSourceCode(StringBuffer text)
Removes all source codes: <source ... </source>.


removeBracketsInWikiLink

public static StringBuffer removeBracketsInWikiLink(StringBuffer text)
Deprecated. Use parseDoubleBrackets()

Expands wiki links removing brackets. There are two cases: (1) remove brackets, e.g. [[run]] -> run and (2) (todo) [[run|running]] -> run, or [[Russian language|Russian] -> Russian, i.e. the visible (to reader) words will remain.


parseSingleBrackets

public static StringBuffer parseSingleBrackets(StringBuffer text)
Expands / removes hyperlinks. Expands hyperlinks with text, e.g. "[http:site name of site]" -> "name of site". Removes links without text, e.g. [www.site].


parseDoubleBrackets

public static StringBuffer parseDoubleBrackets(StringBuffer text,
                                               LanguageType lang,
                                               boolean b_remove_not_expand_iwiki)
Removes and expands interwiki, categories, and wiki links in wiki texts.
1. expands links to Wikimedia sister projects, see [[w:Wikipedia:Interwikimedia_links|text to expand]] -> "text to expand" 2. interwiki

Parameters:
b_remove_not_expand_iwiki - if true then Removes interwiki, e.g. "[[et:Talvepalee]] text" -> " text";
if false then expands interwiki by removing interwiki brackets and language code, e.g. "[[et:Talvepalee]] text" -> "Talvepalee text".
lang - defines parsed wiki language, it is needed to remove category for the selected language, e.g. English (Category) or Esperanto (Kategorio).

3. Removes categories for selected language, e.g. English: "[[Category:Russia]] text" -> " text".

4. Expands wiki links by removing brackets. There are two cases: (1) remove brackets, e.g. [[run]] -> run and (2) [[run|running]] -> running, or [[Russian language|Russian]] -> Russian, i.e. the visible (to reader) words will remain.

It is recommended to call StringUtil.escapeCharDollarAndBackslash(text) before this function. See also WikiWord.parseDoubleBrackets

parseCurlyBrackets

public static StringBuffer parseCurlyBrackets(StringBuffer text)
Removes texts withing curly brackets, e.g. {{templates}}.

Todo: expand templates (optionally).


parseDoubleApostrophe

public static StringBuffer parseDoubleApostrophe(StringBuffer text)
Removes douple apostrophes used in pairs, e.g. ''italics'' -> italics. It is recommended to call StringUtil.escapeCharDollarAndBackslash(text) before this function.


parseTripleApostrophe

public static StringBuffer parseTripleApostrophe(StringBuffer text)
Removes triple apostrophes used in pairs, e.g. '''bold''' -> bold. It is recommended to call StringUtil.escapeCharDollarAndBackslash(text) before this function.


removeAcuteAccent

public static StringBuffer removeAcuteAccent(StringBuffer text,
                                             LanguageType wiki_lang)
Removes sign of acute accent "'" for Russian wiki texts, it is placed in the begin of article often e.g. '''itálics''' -> '''italics'''.


convertWikiToText

public static StringBuffer convertWikiToText(StringBuffer wiki_text,
                                             LanguageType lang,
                                             boolean b_remove_not_expand_iwiki)
Removes / expands interwiki, removes categories, expands wiki links.

Parameters:
b_remove_not_expand_iwiki - if true then removes interwiki, e.g. "[[et:Talvepalee]] text" -> " text"; else expands interwiki by removing interwiki brackets and language code, e.g. "[[et:Talvepalee]] text" -> "Talvepalee text".
lang - defines parsed wiki language, it is needed to remove category for the selected language, e.g. English (Category) or Esperanto (Kategorio).

2. Removes categories for selected language, e.g. English: "[[Category:Russia]] text" -> " text".

3. Expands wiki links by removing brackets. There are two cases: (1) remove brackets, e.g. [[run]] -> run and
(2) [[run|running]] -> running, or [[Russian language|Russian]] -> Russian, i.e. the visible (to reader) words will remain.


Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.