de.tudarmstadt.ukp.jwktl.parser
Class WiktionaryDumpParser

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser
      extended by de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser
All Implemented Interfaces:
IWiktionaryDumpParser

public class WiktionaryDumpParser
extends XMLDumpParser

Extension of the XMLDumpParser that reads the different XML tags of the Wiktionary XML dump file format and provides hotspots for each type of information. A number of IWiktionaryPageParsers can be registered for this dump parser. The page parsers are called whenever a certain information has been read. Different page parsers can, for example, handle different page types or namespaces.

Author:
Christian M. Meyer

Nested Class Summary
 
Nested classes/interfaces inherited from class de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser
XMLDumpParser.XMLDumpHandler
 
Field Summary
protected  DumpInfo dumpInfo
           
protected  boolean inPage
           
protected  List<IWiktionaryPageParser> parserRegistry
           
protected  DateFormat timestampFormat
           
 
Fields inherited from class de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser
BZ2_FILE_EXTENSION
 
Constructor Summary
WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
          Initializes the dump parser and registers the given page parsers.
 
Method Summary
protected  void addNamespace(String namespace)
           
 IDumpInfo getDumpInfo()
          Returns information on the current dump file and its parsing progress.
 Iterable<IWiktionaryPageParser> getPageParsers()
          Returns the list of all registered IWiktionaryPageParsers.
protected  void onClose()
           
protected  void onElementEnd(String name, XMLDumpParser.XMLDumpHandler handler)
          Hotspot that is invoked for each closing XML element.
protected  void onElementStart(String name, XMLDumpParser.XMLDumpHandler handler)
          Hotspot that is invoked for each opening XML element.
protected  void onPageEnd()
           
protected  void onPageStart()
           
protected  void onParserEnd()
          Hotspot that is invoked on finishing the parsing.
protected  void onParserStart()
          Hotspot that is invoked on starting the parser.
protected  void onSiteInfoComplete()
           
 void parse(File dumpFile)
          Parses the given XML dump file.
protected  Date parseTimestamp(String dateString)
           
 void register(IWiktionaryPageParser pageParser)
          Register the given IWiktionaryPageParser.
protected static ILanguage resolveLanguage(String baseURL)
           
protected  void setAuthor(String author)
           
protected  void setBaseURL(String baseURL)
           
protected  void setPageId(long pageId)
           
protected  void setRevision(long revisionId)
           
protected  void setText(String text)
           
protected  void setTimestamp(Date timestamp)
           
protected  void setTitle(String title)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

parserRegistry

protected List<IWiktionaryPageParser> parserRegistry

inPage

protected boolean inPage

dumpInfo

protected DumpInfo dumpInfo

timestampFormat

protected DateFormat timestampFormat
Constructor Detail

WiktionaryDumpParser

public WiktionaryDumpParser(IWiktionaryPageParser... pageParsers)
Initializes the dump parser and registers the given page parsers.

Method Detail

register

public void register(IWiktionaryPageParser pageParser)
Description copied from interface: IWiktionaryDumpParser
Register the given IWiktionaryPageParser. The registered parser will then be notified once a Wiktionary-related XML tag has been processed.


getPageParsers

public Iterable<IWiktionaryPageParser> getPageParsers()
Description copied from interface: IWiktionaryDumpParser
Returns the list of all registered IWiktionaryPageParsers.


parse

public void parse(File dumpFile)
           throws WiktionaryException
Description copied from class: XMLDumpParser
Parses the given XML dump file. The file format is automatically detected using the file extension: it can be either bzip2 compressed or uncompressed XML. Internally, a SAX parser is used.

Specified by:
parse in interface IWiktionaryDumpParser
Overrides:
parse in class XMLDumpParser
Throws:
WiktionaryException - in case of any parser errors.

onParserStart

protected void onParserStart()
Description copied from class: XMLDumpParser
Hotspot that is invoked on starting the parser. Use this hotspot to initialize your data.

Overrides:
onParserStart in class XMLDumpParser

onSiteInfoComplete

protected void onSiteInfoComplete()

onParserEnd

protected void onParserEnd()
Description copied from class: XMLDumpParser
Hotspot that is invoked on finishing the parsing. Use this hotspot for cleaning up and closing resources.

Overrides:
onParserEnd in class XMLDumpParser

onClose

protected void onClose()

onElementStart

protected void onElementStart(String name,
                              XMLDumpParser.XMLDumpHandler handler)
Description copied from class: XMLDumpParser
Hotspot that is invoked for each opening XML element.

Specified by:
onElementStart in class XMLDumpParser

onElementEnd

protected void onElementEnd(String name,
                            XMLDumpParser.XMLDumpHandler handler)
Description copied from class: XMLDumpParser
Hotspot that is invoked for each closing XML element.

Specified by:
onElementEnd in class XMLDumpParser

onPageStart

protected void onPageStart()

onPageEnd

protected void onPageEnd()

setBaseURL

protected void setBaseURL(String baseURL)

resolveLanguage

protected static ILanguage resolveLanguage(String baseURL)

addNamespace

protected void addNamespace(String namespace)

setAuthor

protected void setAuthor(String author)

setRevision

protected void setRevision(long revisionId)

setTimestamp

protected void setTimestamp(Date timestamp)

setPageId

protected void setPageId(long pageId)

setTitle

protected void setTitle(String title)

setText

protected void setText(String text)

parseTimestamp

protected Date parseTimestamp(String dateString)
                       throws ParseException
Throws:
ParseException

getDumpInfo

public IDumpInfo getDumpInfo()
Returns information on the current dump file and its parsing progress. The result is null if the parser has not yet been started (i.e., the parse(File) method has not been called).



Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.