- Prev Class
- Next Class
org.apache.nutch.parse.tika
Class DOMContentUtils
- java.lang.Object
- org.apache.nutch.parse.tika.DOMContentUtils
public class DOMContentUtils extends Object
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
Constructor Summary
Constructors Constructor and Description DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description void
getOutlinks(URL base,
ArrayList
This method finds all anchors below the supplied DOM node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
void
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false)
).
boolean
getTitle(StringBuffer sb,
Node node)
This method takes a StringBuffer
and a DOM Node
, and will append the content text found beneath the first title
node to the StringBuffer
.
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
DOMContentUtils
public DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
-
getText
public void getText(StringBuffer sb, Node node)
This is a convinience method, equivalent to getText(sb, node, false)
).
-
getTitle
public boolean getTitle(StringBuffer sb, Node node)
This method takes a StringBuffer
and a DOM Node
, and will append the content text found beneath the first title
node to the StringBuffer
.
- Returns:
- true if a title node was found, false otherwise
-
getOutlinks
public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
This method finds all anchors below the supplied DOM node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
- Prev Class
- Next Class