Class OutlinkExtractor

[TOC]

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

org.apache.nutch.parse

java.lang.Object
- org.apache.nutch.parse.OutlinkExtractor

public class OutlinkExtractor
extends Object

Extractor to extract Outlinks / URLs from plain text using Regular Expressions.

Since:
0.7
Version:
1.0
Author:
Stephan Strittmatter - http://www.sybit.de
See Also:
Comparison of different regexp-Implementations , Overview about Java Regexp APIs

Constructor Summary

Constructors Constructor and Description OutlinkExtractor()

Method Summary

Methods Modifier and Type Method and Description static Outlink[] getOutlinks(String plainText, org.apache.hadoop.conf.Configuration conf) Extracts Outlink from given plain text. static Outlink[] getOutlinks(String plainText, String anchor, org.apache.hadoop.conf.Configuration conf) Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

OutlinkExtractor

public OutlinkExtractor()

Method Detail

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                    org.apache.hadoop.conf.Configuration conf)

Extracts Outlink from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).

  - Parameters:
  - <code>plainText</code> - the plain text from wich URLs should be extracted. 
  - Returns:
  - Array of <code>Outlink</code>s within found in plainText       
-

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                    String anchor,
                    org.apache.hadoop.conf.Configuration conf)

Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

  - Parameters:
  - <code>plainText</code> - the plain text from wich URLs should be extracted.
  - <code>anchor</code> - the anchor of the url 
  - Returns:
  - Array of <code>Outlink</code>s within found in plainText

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method