org.apache.nutch.parse.html
Class HtmlParser
- java.lang.Object
- org.apache.nutch.parse.html.HtmlParser
public class HtmlParser extends Object implements Parser
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
Constructor Summary
Constructors Constructor and Description HtmlParser()
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration
getConf()
ParseResult
getParse(Content content)
This method parses the given content and returns a map of static void
main(String[] args)
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
HtmlParser
public HtmlParser()
Method Detail
-
getParse
public ParseResult getParse(Content content)
Description copied from interface: Parser
This method parses the given content and returns a map of Parse
instances will be persisted under the given key.
Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse
with a ParseStatus
indicating the redirect>.
- Specified by:
- <code>getParse</code> in interface <code>Parser</code>
- Parameters:
- <code>content</code> - Content to be parsed
- Returns:
- a map containing <key, parse=""> pairs
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>