[TOC]

org.apache.nutch.urlfilter.api

Class RegexURLFilterBase


public abstract class RegexURLFilterBase
extends Object
implements URLFilter

Generic URL filter based on regular expressions. The regular expressions rules are expressed in a file. The file of rules is determined for each implementation using the getRulesReader(Configuration conf)) method.

The format of this file is made of many rules (one per line):

[+-]

where plus (+)means go ahead and index it and minus (-)means no.

  • Author:
  • Jérôme Charron

Field Summary

-    

Fields inherited from interface org.apache.nutch.net.URLFilter

X_POINT_ID

Constructor Summary

Constructors Modifier Constructor and Description RegexURLFilterBase() Constructs a new empty RegexURLFilterBase RegexURLFilterBase(File filename) Constructs a new RegexURLFilter and init it with a file of rules. protected RegexURLFilterBase(Reader reader) Constructs a new RegexURLFilter and init it with a Reader of rules. RegexURLFilterBase(String rules) Constructs a new RegexURLFilter and inits it with a list of rules.

Method Summary

Methods Modifier and Type Method and Description protected abstract RegexRule createRule(boolean sign, String regex) Creates a new RegexRule. String filter(String url) org.apache.hadoop.conf.Configuration getConf() protected abstract Reader getRulesReader(org.apache.hadoop.conf.Configuration conf) Returns the name of the file of rules to use for a particular implementation. static void main(RegexURLFilterBase filter, String[] args) Filter the standard input using a RegexURLFilterBase. void setConf(org.apache.hadoop.conf.Configuration conf)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

RegexURLFilterBase

public RegexURLFilterBase()

Constructs a new empty RegexURLFilterBase

-  

RegexURLFilterBase

public RegexURLFilterBase(File filename)
                   throws IOException,
                          IllegalArgumentException

Constructs a new RegexURLFilter and init it with a file of rules.

  - Parameters:
  - <code>filename</code> - is the name of rules file. 
  - Throws: 
  - <code>IOException</code> 
  - <code>IllegalArgumentException</code>       
-  

RegexURLFilterBase

public RegexURLFilterBase(String rules)
                   throws IOException,
                          IllegalArgumentException

Constructs a new RegexURLFilter and inits it with a list of rules.

  - Parameters:
  - <code>rules</code> - string with a list of rules, one rule per line 
  - Throws: 
  - <code>IOException</code> 
  - <code>IllegalArgumentException</code>       
-  

RegexURLFilterBase

protected RegexURLFilterBase(Reader reader)
                      throws IOException,
                             IllegalArgumentException

Constructs a new RegexURLFilter and init it with a Reader of rules.

  - Parameters:
  - <code>reader</code> - is a reader of rules. 
  - Throws: 
  - <code>IOException</code> 
  - <code>IllegalArgumentException</code>       

Method Detail

-  

createRule

protected abstract RegexRule createRule(boolean sign,
                   String regex)

Creates a new RegexRule.

  - Parameters:
  - <code>sign</code> - of the regular expression. A <code>true</code> value means that any URL matching this rule must be included, whereas a <code>false</code> value means that any URL matching this rule must be excluded.
  - <code>regex</code> - is the regular expression associated to this rule.       
-  

getRulesReader

protected abstract Reader getRulesReader(org.apache.hadoop.conf.Configuration conf)
                                  throws IOException

Returns the name of the file of rules to use for a particular implementation.

  - Parameters:
  - <code>conf</code> - is the current configuration. 
  - Returns:
  - the name of the resource containing the rules to use. 
  - Throws: 
  - <code>IOException</code>       
-  

filter

public String filter(String url)
  - Specified by: 
  - <code>filter</code> in interface <code>URLFilter</code>        
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

main

public static void main(RegexURLFilterBase filter,
        String[] args)
                 throws IOException,
                        IllegalArgumentException

Filter the standard input using a RegexURLFilterBase.

  - Parameters:
  - <code>filter</code> - is the RegexURLFilterBase to use for filtering the standard input.
  - <code>args</code> - some optional parameters (not used). 
  - Throws: 
  - <code>IOException</code> 
  - <code>IllegalArgumentException</code>      

Copyright © 2014 The Apache Software Foundation