org.apache.nutch.tools
Class DmozParser
- java.lang.Object
- org.apache.nutch.tools.DmozParser
public class DmozParser extends Object
Utility that converts DMOZ RDF into a flat file of URLs to be injected.
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description DmozParser()
Method Summary
Methods Modifier and Type Method and Description static void
main(String[] argv)
Command-line access.
void
parseDmozFile(File dmozFile,
int subsetDenom,
boolean includeAdult,
int skew,
Pattern topicPattern)
Iterate through all the items in this structured DMOZ file.
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
DmozParser
public DmozParser()
Method Detail
-
parseDmozFile
public void parseDmozFile(File dmozFile, int subsetDenom, boolean includeAdult, int skew, Pattern topicPattern) throws IOException, SAXException, ParserConfigurationException
Iterate through all the items in this structured DMOZ file. Add each URL to the web db.
- Throws:
- <code>IOException</code>
- <code>SAXException</code>
- <code>ParserConfigurationException</code>
-
main
public static void main(String[] argv) throws Exception
Command-line access. User may add URLs via a flat text file or the structured DMOZ file. By default, we ignore Adult material (as categorized by DMOZ).
- Throws:
- <code>Exception</code>