de.tudarmstadt.ukp.jwktl.parser.util
Class WordListProcessor

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.util.WordListProcessor

public class WordListProcessor
extends Object

Helper class for segmenting word lists separated by comma, semicolon, line breaks, etc. This is, for example, the case for semantic relations which are often encoded as comma-separated lists.

Author:
Christof Müller, Lizhen Qu

Field Summary
protected static Pattern ESCAPE_DELIMITER1
           
protected static Pattern ESCAPE_DELIMITER2
           
protected static Pattern ESCAPE_DELIMITER3
           
protected static Pattern HTML_REMOVER
           
protected static Pattern REFERENCE_PATTERN
           
protected static Pattern SUPERSCRIPT_PATTERN
           
 
Constructor Summary
WordListProcessor()
           
 
Method Summary
protected  String deWikify(String word)
           
protected  String escapeDelimiters(String text)
           
protected  String removeBrackets(String word)
           
protected  String removeComments(String word)
           
protected  String removeTemplates(String word)
           
 List<String> splitWordList(String text)
          Splits the given text by comma, semicolon, line break, etc. and removes multiple types of special characters and affixes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

HTML_REMOVER

protected static final Pattern HTML_REMOVER

ESCAPE_DELIMITER1

protected static final Pattern ESCAPE_DELIMITER1

ESCAPE_DELIMITER2

protected static final Pattern ESCAPE_DELIMITER2

ESCAPE_DELIMITER3

protected static final Pattern ESCAPE_DELIMITER3

REFERENCE_PATTERN

protected static final Pattern REFERENCE_PATTERN

SUPERSCRIPT_PATTERN

protected static final Pattern SUPERSCRIPT_PATTERN
Constructor Detail

WordListProcessor

public WordListProcessor()
Method Detail

escapeDelimiters

protected String escapeDelimiters(String text)

splitWordList

public List<String> splitWordList(String text)
Splits the given text by comma, semicolon, line break, etc. and removes multiple types of special characters and affixes. The resulting segments are returned as a list of strings.


deWikify

protected String deWikify(String word)

removeBrackets

protected String removeBrackets(String word)

removeComments

protected String removeComments(String word)

removeTemplates

protected String removeTemplates(String word)


Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.