org.apache.nutch.parse.html
Class DOMContentUtils
- java.lang.Object
- org.apache.nutch.parse.html.DOMContentUtils
public class DOMContentUtils extends Object
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
DOMContentUtils.LinkParams
Constructor Summary
Constructors Constructor and Description DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description URL
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned.
void
getOutlinks(URL base,
ArrayList
This method finds all anchors below the supplied DOM node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
void
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false)
).
boolean
getText(StringBuffer sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuffer
and a DOM Node
, and will append all the content text found beneath the DOM node to the StringBuffer
.
boolean
getTitle(StringBuffer sb,
Node node)
This method takes a StringBuffer
and a DOM Node
, and will append the content text found beneath the first title
node to the StringBuffer
.
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
DOMContentUtils
public DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
-
getText
public boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)
This method takes a StringBuffer
and a DOM Node
, and will append all the content text found beneath the DOM node to the StringBuffer
.
If abortOnNestedAnchors
is true, DOM traversal will be aborted and the StringBuffer
will not contain any text encountered after a nested anchor is found.
- Returns:
- true if nested anchors were found
-
getText
public void getText(StringBuffer sb, Node node)
This is a convinience method, equivalent to getText(sb, node, false)
).
-
getTitle
public boolean getTitle(StringBuffer sb, Node node)
This method takes a StringBuffer
and a DOM Node
, and will append the content text found beneath the first title
node to the StringBuffer
.
- Returns:
- true if a title node was found, false otherwise
-
getBase
public URL getBase(Node node)
If Node contains a BASE tag then it's HREF is returned.
-
getOutlinks
public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
This method finds all anchors below the supplied DOM node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).