de.tudarmstadt.ukp.jwktl.parser
Class XMLDumpParser

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser
All Implemented Interfaces:
IWiktionaryDumpParser
Direct Known Subclasses:
WiktionaryDumpParser

public abstract class XMLDumpParser
extends Object
implements IWiktionaryDumpParser

Implementation of IWiktionaryDumpParser for processing XML files downloaded from http://download.wikimedia.org/backup-index.html. There can be different specializations of this class that focus on a certain aspect of the dump, e.g., parsing the full text on the article pages and create an object structure from them, processing some aspects of the user pages, filtering the article pages, etc. The base class should be somewhat generic.

Author:
Christian M. Meyer

Nested Class Summary
protected  class XMLDumpParser.XMLDumpHandler
           
 
Field Summary
static String BZ2_FILE_EXTENSION
          The file extension for bzip2 files that is used for the automatic detection of the file format.
 
Constructor Summary
XMLDumpParser()
           
 
Method Summary
protected abstract  void onElementEnd(String name, XMLDumpParser.XMLDumpHandler handler)
          Hotspot that is invoked for each closing XML element.
protected abstract  void onElementStart(String name, XMLDumpParser.XMLDumpHandler handler)
          Hotspot that is invoked for each opening XML element.
protected  void onParserEnd()
          Hotspot that is invoked on finishing the parsing.
protected  void onParserStart()
          Hotspot that is invoked on starting the parser.
 void parse(File dumpFile)
          Parses the given XML dump file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.tudarmstadt.ukp.jwktl.parser.IWiktionaryDumpParser
getPageParsers, register
 

Field Detail

BZ2_FILE_EXTENSION

public static final String BZ2_FILE_EXTENSION
The file extension for bzip2 files that is used for the automatic detection of the file format.

See Also:
Constant Field Values
Constructor Detail

XMLDumpParser

public XMLDumpParser()
Method Detail

parse

public void parse(File dumpFile)
           throws WiktionaryException
Parses the given XML dump file. The file format is automatically detected using the file extension: it can be either bzip2 compressed or uncompressed XML. Internally, a SAX parser is used.

Specified by:
parse in interface IWiktionaryDumpParser
Throws:
WiktionaryException - in case of any parser errors.

onParserStart

protected void onParserStart()
Hotspot that is invoked on starting the parser. Use this hotspot to initialize your data.


onElementStart

protected abstract void onElementStart(String name,
                                       XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each opening XML element.


onElementEnd

protected abstract void onElementEnd(String name,
                                     XMLDumpParser.XMLDumpHandler handler)
Hotspot that is invoked for each closing XML element.


onParserEnd

protected void onParserEnd()
Hotspot that is invoked on finishing the parsing. Use this hotspot for cleaning up and closing resources.



Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.