- Prev Class
- Next Class
org.apache.nutch.parse.headings
Class HeadingsParseFilter
- java.lang.Object
- org.apache.nutch.parse.headings.HeadingsParseFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, HtmlParseFilter, Pluggable
public class HeadingsParseFilter extends Object implements HtmlParseFilter
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
Field Summary
Fields Modifier and Type Field and Description protected static Pattern
whitespacePattern
Pattern used to strip surpluss whitespace
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
Constructor Summary
Constructors Constructor and Description HeadingsParseFilter()
Method Summary
Methods Modifier and Type Method and Description ParseResult
filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
org.apache.hadoop.conf.Configuration
getConf()
protected List
getElement(DocumentFragment doc,
String element)
Finds the specified element and returns its value
protected static String
getNodeValue(Node node)
Returns the text value of the specified Node and child nodes
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
whitespacePattern
protected static Pattern whitespacePattern
Pattern used to strip surpluss whitespace
Constructor Detail
-
HeadingsParseFilter
public HeadingsParseFilter()
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface: HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
- Specified by:
- <code>filter</code> in interface <code>HtmlParseFilter</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getElement
protected List<String> getElement(DocumentFragment doc, String element)
Finds the specified element and returns its value
-
getNodeValue
protected static String getNodeValue(Node node)
Returns the text value of the specified Node and child nodes
- Prev Class
- Next Class