de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikipedia.text
Class XMLTagsParser

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.ru.wikokit.base.wikipedia.text.XMLTagsParser

public class XMLTagsParser
extends Object

Parser of XML (HTML) tags, e.g.   " <br />, etc.


Constructor Summary
XMLTagsParser()
           
 
Method Summary
static String escapeCharFromXML(String text)
          Escapes characters for text appearing as XML data by HTML tags.
protected static String isAmpersandTag(String text, int pos)
          If the 'text' (from the position 'pos') is one of tags: < > & " '  , – or — then this tag is returned, else empty string will be returned.
protected static String isBRNewlineTag(String text, int pos)
          If the 'text' (from the position 'pos') is one of tags: <br />,<br/>,<br> then this tag is returned, else empty string will be returned.
static String replaceCharFromXML(String text, char replacement)
          Removes the following characters from the text: <, >, &, ", also their expansions (&lt;, &gt;, &amp;, &quot;, &#039;, &nbsp;, &ndash;, &mdash;) by the 'replacement' character.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLTagsParser

public XMLTagsParser()
Method Detail

isAmpersandTag

protected static String isAmpersandTag(String text,
                                       int pos)
If the 'text' (from the position 'pos') is one of tags: < > & " '  , – or — then this tag is returned, else empty string will be returned.

Parameters:
pos - position in 'text' from which the tag will be extracted

isBRNewlineTag

protected static String isBRNewlineTag(String text,
                                       int pos)
If the 'text' (from the position 'pos') is one of tags: <br />,<br/>,<br> then this tag is returned, else empty string will be returned.

Parameters:
pos - position in 'text' from which the tag will be extracted

replaceCharFromXML

public static String replaceCharFromXML(String text,
                                        char replacement)
Removes the following characters from the text: <, >, &, ", also their expansions (&lt;, &gt;, &amp;, &quot;, &#039;, &nbsp;, &ndash;, &mdash;) by the 'replacement' character.

Replaces <br />,<br/>,<br> by newline symbol.

Remains the character '.

Attention: the parsing of other XML (HTML) tags should be done before this function execution, since open bracket '<' will be deleted.


escapeCharFromXML

public static String escapeCharFromXML(String text)
Escapes characters for text appearing as XML data by HTML tags.

The following characters are replaced with corresponding character entities :

Character Encoding
< &lt;
> &gt;
& &amp;
" &quot;
' &#039;

Note that JSTL's <c:out> escapes the exact same set of characters as this method. That is, <c:out> is good for escaping to produce valid XML, but not for producing safe HTML. see: http://www.javapractices.com/topic/TopicAction.do?Id=96



Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.