- Prev Class
- Next Class
org.apache.nutch.parse.js
Class JSParseFilter
- java.lang.Object
- org.apache.nutch.parse.js.JSParseFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, HtmlParseFilter, Parser, Pluggable
public class JSParseFilter extends Object implements HtmlParseFilter, Parser
This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
Constructor Summary
Constructors Constructor and Description JSParseFilter()
Method Summary
Methods Modifier and Type Method and Description ParseResult
filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
org.apache.hadoop.conf.Configuration
getConf()
ParseResult
getParse(Content c)
This method parses the given content and returns a map of static void
main(String[] args)
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
JSParseFilter
public JSParseFilter()
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface: HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
- Specified by:
- <code>filter</code> in interface <code>HtmlParseFilter</code>
-
getParse
public ParseResult getParse(Content c)
Description copied from interface: Parser
This method parses the given content and returns a map of Parse
instances will be persisted under the given key.
Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse
with a ParseStatus
indicating the redirect>.
- Specified by:
- <code>getParse</code> in interface <code>Parser</code>
- Parameters:
- <code>c</code> - Content to be parsed
- Returns:
- a map containing <key, parse=""> pairs
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
- Prev Class
- Next Class