[TOC]

  • Prev Class
  • Next Class

org.apache.nutch.parse.headings

Class HeadingsParseFilter


public class HeadingsParseFilter
extends Object
implements HtmlParseFilter

HtmlParseFilter to retrieve h1 and h2 values from the DOM.

Field Summary

Fields Modifier and Type Field and Description protected static Pattern whitespacePattern Pattern used to strip surpluss whitespace

-    

Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter

X_POINT_ID

Constructor Summary

Constructors Constructor and Description HeadingsParseFilter()

Method Summary

Methods Modifier and Type Method and Description ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page. org.apache.hadoop.conf.Configuration getConf() protected List getElement(DocumentFragment doc, String element) Finds the specified element and returns its value protected static String getNodeValue(Node node) Returns the text value of the specified Node and child nodes void setConf(org.apache.hadoop.conf.Configuration conf)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

whitespacePattern

protected static Pattern whitespacePattern

Pattern used to strip surpluss whitespace

Constructor Detail

-  

HeadingsParseFilter

public HeadingsParseFilter()

Method Detail

-  

filter

public ParseResult filter(Content content,
                 ParseResult parseResult,
                 HTMLMetaTags metaTags,
                 DocumentFragment doc)

Description copied from interface: HtmlParseFilter

Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.

  - Specified by: 
  - <code>filter</code> in interface <code>HtmlParseFilter</code>        
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getElement

protected List<String> getElement(DocumentFragment doc,
                      String element)

Finds the specified element and returns its value

-  

getNodeValue

protected static String getNodeValue(Node node)

Returns the text value of the specified Node and child nodes

  • Prev Class
  • Next Class

Copyright © 2014 The Apache Software Foundation