de.tudarmstadt.ukp.jwktl.parser
Class WiktionaryArticleParser

java.lang.Object
  extended by de.tudarmstadt.ukp.jwktl.parser.WiktionaryPageParser<WiktionaryPage>
      extended by de.tudarmstadt.ukp.jwktl.parser.WiktionaryArticleParser
All Implemented Interfaces:
IWiktionaryPageParser

public class WiktionaryArticleParser
extends WiktionaryPageParser<WiktionaryPage>

Parses a Wiktionary XML dump and stores the parsed information as a Berkeley DB within a specified directory. The parsed Wiktionary dump can then be accessed using the main JWKTL API. This implementation parses only article pages within the main namespace; discussions, user pages, revisions, etc. are not handled. An article page's text is passed to an implementation of IWiktionaryEntryParser, which is either automatically detected from the Wiktionary's base URL, or specified in the constructor. Note that each directory can only contain one Wiktionary database.

Author:
Christian M. Meyer

Field Summary
protected  IWiktionaryEntryParser entryParser
           
protected  IWritableWiktionaryEdition wiktionaryDB
           
 
Fields inherited from class de.tudarmstadt.ukp.jwktl.parser.WiktionaryPageParser
currentNamespace, dumpInfo, page
 
Constructor Summary
WiktionaryArticleParser(IWritableWiktionaryEdition wiktionaryDB)
          Creates a caching article parser that saves the parsed Wiktionary data into a Berkeley DB within the given target directory.
WiktionaryArticleParser(IWritableWiktionaryEdition wiktionaryDB, IWiktionaryEntryParser entryParser)
          Creates a caching article parser that saves the parsed Wiktionary data into a Berkeley DB within the given target directory.
 
Method Summary
protected  WiktionaryPage createPage()
           
protected  boolean isAllowed(IWiktionaryPage page)
           
 void onClose(IDumpInfo dumpInfo)
          Hotspot that is invoked after the parser has finished its work.
 void onPageEnd()
          Hotspot that is invoked upon finishing the current article page.
 void onSiteInfoComplete(IDumpInfo dumpInfo)
          Hotspot that is invoked after the siteinfo header has been read.
protected  void saveParsedWiktionaryPage()
           
 void setText(String text)
          Hotspot that is invoked after the current page's text is read.
 
Methods inherited from class de.tudarmstadt.ukp.jwktl.parser.WiktionaryPageParser
onPageStart, onParserEnd, onParserStart, setAuthor, setPageId, setRevision, setTimestamp, setTitle
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wiktionaryDB

protected IWritableWiktionaryEdition wiktionaryDB

entryParser

protected IWiktionaryEntryParser entryParser
Constructor Detail

WiktionaryArticleParser

public WiktionaryArticleParser(IWritableWiktionaryEdition wiktionaryDB)
                        throws WiktionaryException
Creates a caching article parser that saves the parsed Wiktionary data into a Berkeley DB within the given target directory. A previously parsed Wiktionary database is replaced if overwriteExisting is true. The entry parser will be created based on the dump's base URL.

Throws:
WiktionaryException - if the target dictionary is not empty and overwriteExisting was set to false.

WiktionaryArticleParser

public WiktionaryArticleParser(IWritableWiktionaryEdition wiktionaryDB,
                               IWiktionaryEntryParser entryParser)
                        throws WiktionaryException
Creates a caching article parser that saves the parsed Wiktionary data into a Berkeley DB within the given target directory. A previously parsed Wiktionary database is replaced if overwriteExisting is true. The specified entry parser is used rather than auto detecting the language specific parser.

Throws:
WiktionaryException - if the target dictionary is not empty and overwriteExisting was set to false.
Method Detail

onSiteInfoComplete

public void onSiteInfoComplete(IDumpInfo dumpInfo)
Description copied from interface: IWiktionaryPageParser
Hotspot that is invoked after the siteinfo header has been read. At this point in time, the dump info contains all information, including dump language and namespaces.

Specified by:
onSiteInfoComplete in interface IWiktionaryPageParser
Overrides:
onSiteInfoComplete in class WiktionaryPageParser<WiktionaryPage>

onPageEnd

public void onPageEnd()
Description copied from interface: IWiktionaryPageParser
Hotspot that is invoked upon finishing the current article page.

Specified by:
onPageEnd in interface IWiktionaryPageParser
Overrides:
onPageEnd in class WiktionaryPageParser<WiktionaryPage>

onClose

public void onClose(IDumpInfo dumpInfo)
Description copied from interface: IWiktionaryPageParser
Hotspot that is invoked after the parser has finished its work. This method is supposed to close and cleanup any resources (e.g., closing a database connection). It is called after all IWiktionaryPageParser.onParserEnd(IDumpInfo) calls have been handled.

Specified by:
onClose in interface IWiktionaryPageParser
Overrides:
onClose in class WiktionaryPageParser<WiktionaryPage>

createPage

protected WiktionaryPage createPage()
Specified by:
createPage in class WiktionaryPageParser<WiktionaryPage>

setText

public void setText(String text)
Description copied from interface: IWiktionaryPageParser
Hotspot that is invoked after the current page's text is read.

Specified by:
setText in interface IWiktionaryPageParser
Specified by:
setText in class WiktionaryPageParser<WiktionaryPage>

saveParsedWiktionaryPage

protected void saveParsedWiktionaryPage()

isAllowed

protected boolean isAllowed(IWiktionaryPage page)


Copyright © 2011-2013 Ubiquitous Knowledge Processing (UKP) Lab. All Rights Reserved.