[TOC]

  • Prev Class
  • Next Class

org.apache.nutch.parse.js

Class JSParseFilter


public class JSParseFilter
extends Object
implements HtmlParseFilter, Parser

This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter

X_POINT_ID

-    

Fields inherited from interface org.apache.nutch.parse.Parser

X_POINT_ID

Constructor Summary

Constructors Constructor and Description JSParseFilter()

Method Summary

Methods Modifier and Type Method and Description ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page. org.apache.hadoop.conf.Configuration getConf() ParseResult getParse(Content c) This method parses the given content and returns a map of pairs. static void main(String[] args) void setConf(org.apache.hadoop.conf.Configuration conf)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

JSParseFilter

public JSParseFilter()

Method Detail

-  

filter

public ParseResult filter(Content content,
                 ParseResult parseResult,
                 HTMLMetaTags metaTags,
                 DocumentFragment doc)

Description copied from interface: HtmlParseFilter

Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.

  - Specified by: 
  - <code>filter</code> in interface <code>HtmlParseFilter</code>        
-  

getParse

public ParseResult getParse(Content c)

Description copied from interface: Parser

This method parses the given content and returns a map of pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:

Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

  - Specified by: 
  - <code>getParse</code> in interface <code>Parser</code> 
  - Parameters:
  - <code>c</code> - Content to be parsed 
  - Returns:
  - a map containing <key, parse=""> pairs       
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>       

  • Prev Class
  • Next Class

Copyright © 2014 The Apache Software Foundation