- Prev Class
- Next Class
org.apache.nutch.urlfilter.api
Class RegexURLFilterBase
- java.lang.Object
- org.apache.nutch.urlfilter.api.RegexURLFilterBase
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, URLFilter, Pluggable
- Direct Known Subclasses:
- AutomatonURLFilter, RegexURLFilter
public abstract class RegexURLFilterBase extends Object implements URLFilter
Generic URL filter
based on regular expressions.
The regular expressions rules are expressed in a file. The file of rules is determined for each implementation using the getRulesReader(Configuration conf)
) method.
The format of this file is made of many rules (one per line):
[+-]
where plus (+
)means go ahead and index it and minus (-
)means no.
- Author:
- Jérôme Charron
Field Summary
-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
Constructor Summary
Constructors Modifier Constructor and Description
RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase
RegexURLFilterBase(File filename)
Constructs a new RegexURLFilter and init it with a file of rules.
protected
RegexURLFilterBase(Reader reader)
Constructs a new RegexURLFilter and init it with a Reader of rules.
RegexURLFilterBase(String rules)
Constructs a new RegexURLFilter and inits it with a list of rules.
Method Summary
Methods Modifier and Type Method and Description protected abstract RegexRule
createRule(boolean sign,
String regex)
Creates a new RegexRule
.
String
filter(String url)
org.apache.hadoop.conf.Configuration
getConf()
protected abstract Reader
getRulesReader(org.apache.hadoop.conf.Configuration conf)
Returns the name of the file of rules to use for a particular implementation.
static void
main(RegexURLFilterBase filter,
String[] args)
Filter the standard input using a RegexURLFilterBase.
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
RegexURLFilterBase
public RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase
-
RegexURLFilterBase
public RegexURLFilterBase(File filename) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a file of rules.
- Parameters:
- <code>filename</code> - is the name of rules file.
- Throws:
- <code>IOException</code>
- <code>IllegalArgumentException</code>
-
RegexURLFilterBase
public RegexURLFilterBase(String rules) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and inits it with a list of rules.
- Parameters:
- <code>rules</code> - string with a list of rules, one rule per line
- Throws:
- <code>IOException</code>
- <code>IllegalArgumentException</code>
-
RegexURLFilterBase
protected RegexURLFilterBase(Reader reader) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a Reader of rules.
- Parameters:
- <code>reader</code> - is a reader of rules.
- Throws:
- <code>IOException</code>
- <code>IllegalArgumentException</code>
Method Detail
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex)
Creates a new RegexRule
.
- Parameters:
- <code>sign</code> - of the regular expression. A <code>true</code> value means that any URL matching this rule must be included, whereas a <code>false</code> value means that any URL matching this rule must be excluded.
- <code>regex</code> - is the regular expression associated to this rule.
-
getRulesReader
protected abstract Reader getRulesReader(org.apache.hadoop.conf.Configuration conf) throws IOException
Returns the name of the file of rules to use for a particular implementation.
- Parameters:
- <code>conf</code> - is the current configuration.
- Returns:
- the name of the resource containing the rules to use.
- Throws:
- <code>IOException</code>
-
filter
public String filter(String url)
- Specified by:
- <code>filter</code> in interface <code>URLFilter</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
main
public static void main(RegexURLFilterBase filter, String[] args) throws IOException, IllegalArgumentException
Filter the standard input using a RegexURLFilterBase.
- Parameters:
- <code>filter</code> - is the RegexURLFilterBase to use for filtering the standard input.
- <code>args</code> - some optional parameters (not used).
- Throws:
- <code>IOException</code>
- <code>IllegalArgumentException</code>
- Prev Class
- Next Class