[TOC]

org.apache.nutch.parse

Class OutlinkExtractor


public class OutlinkExtractor
extends Object

Extractor to extract Outlinks / URLs from plain text using Regular Expressions.

Constructor Summary

Constructors Constructor and Description OutlinkExtractor()

Method Summary

Methods Modifier and Type Method and Description static Outlink[] getOutlinks(String plainText, org.apache.hadoop.conf.Configuration conf) Extracts Outlink from given plain text. static Outlink[] getOutlinks(String plainText, String anchor, org.apache.hadoop.conf.Configuration conf) Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

OutlinkExtractor

public OutlinkExtractor()

Method Detail

-  

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                    org.apache.hadoop.conf.Configuration conf)

Extracts Outlink from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).

  - Parameters:
  - <code>plainText</code> - the plain text from wich URLs should be extracted. 
  - Returns:
  - Array of <code>Outlink</code>s within found in plainText       
-  

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                    String anchor,
                    org.apache.hadoop.conf.Configuration conf)

Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

  - Parameters:
  - <code>plainText</code> - the plain text from wich URLs should be extracted.
  - <code>anchor</code> - the anchor of the url 
  - Returns:
  - Array of <code>Outlink</code>s within found in plainText      

Copyright © 2014 The Apache Software Foundation