org.apache.nutch.parse
Class OutlinkExtractor
- java.lang.Object
- org.apache.nutch.parse.OutlinkExtractor
public class OutlinkExtractor extends Object
Extractor to extract Outlink
s / URLs from plain text using Regular Expressions.
- Since:
- 0.7
- Version:
- 1.0
- Author:
- Stephan Strittmatter - http://www.sybit.de
- See Also:
- Comparison of different regexp-Implementations , Overview about Java Regexp APIs
Constructor Summary
Constructors Constructor and Description OutlinkExtractor()
Method Summary
Methods Modifier and Type Method and Description static Outlink[]
getOutlinks(String plainText,
org.apache.hadoop.conf.Configuration conf)
Extracts Outlink
from given plain text.
static Outlink[]
getOutlinks(String plainText,
String anchor,
org.apache.hadoop.conf.Configuration conf)
Extracts Outlink
from given plain text and adds anchor to the extracted Outlink
s
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
OutlinkExtractor
public OutlinkExtractor()
Method Detail
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, org.apache.hadoop.conf.Configuration conf)
Extracts Outlink
from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).
- Parameters:
- <code>plainText</code> - the plain text from wich URLs should be extracted.
- Returns:
- Array of <code>Outlink</code>s within found in plainText
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, String anchor, org.apache.hadoop.conf.Configuration conf)
Extracts Outlink
from given plain text and adds anchor to the extracted Outlink
s
- Parameters:
- <code>plainText</code> - the plain text from wich URLs should be extracted.
- <code>anchor</code> - the anchor of the url
- Returns:
- Array of <code>Outlink</code>s within found in plainText