[TOC]

org.apache.nutch.parse

Class ParserChecker

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class ParserChecker
extends Object
implements org.apache.hadoop.util.Tool

Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:

  • contentType: The URL Content type.
  • signature: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using MD5Signature or TextProfileSignature.
  • Version: From ParseData.
  • Status: From ParseData.
  • Title: of the URL
  • Outlinks: associated with the URL
  • Content Metadata: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.
  • Parse Metadata: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.
  • ParseText: The page parse text which varies in length depdnecing on content.length configuration.
  • Author:
  • John Xing

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

Constructor Summary

Constructors Constructor and Description ParserChecker()

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf() static void main(String[] args) int run(String[] args) void setConf(org.apache.hadoop.conf.Configuration c)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

ParserChecker

public ParserChecker()

Method Detail

-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>       
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration c)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation