[TOC]

org.apache.nutch.parse.html

Class HtmlParser

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, Parser, Pluggable

public class HtmlParser
extends Object
implements Parser

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.parse.Parser

X_POINT_ID

Constructor Summary

Constructors Constructor and Description HtmlParser()

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf() ParseResult getParse(Content content) This method parses the given content and returns a map of pairs. static void main(String[] args) void setConf(org.apache.hadoop.conf.Configuration conf)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

HtmlParser

public HtmlParser()

Method Detail

-  

getParse

public ParseResult getParse(Content content)

Description copied from interface: Parser

This method parses the given content and returns a map of pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:

Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

  - Specified by: 
  - <code>getParse</code> in interface <code>Parser</code> 
  - Parameters:
  - <code>content</code> - Content to be parsed 
  - Returns:
  - a map containing <key, parse=""> pairs       
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>       

Copyright © 2014 The Apache Software Foundation