org.apache.nutch.parse
Class ParserChecker
- java.lang.Object
- org.apache.nutch.parse.ParserChecker
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class ParserChecker extends Object implements org.apache.hadoop.util.Tool
Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:
- contentType: The URL
Content
type. - signature: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using
MD5Signature
orTextProfileSignature
. - Version: From
ParseData
. - Status: From
ParseData
. - Title: of the URL
- Outlinks: associated with the URL
- Content Metadata: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.
- Parse Metadata: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.
- ParseText: The page parse text which varies in length depdnecing on
content.length
configuration.
- Author:
- John Xing
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description ParserChecker()
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration
getConf()
static void
main(String[] args)
int
run(String[] args)
void
setConf(org.apache.hadoop.conf.Configuration c)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
ParserChecker
public ParserChecker()
Method Detail
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration c)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>