[TOC]

  • Prev Class
  • Next Class

org.apache.nutch.urlfilter.suffix

Class SuffixURLFilter


public class SuffixURLFilter
extends Object
implements URLFilter

Filters URLs based on a file of URL suffixes. The file is named by

  • property "urlfilter.suffix.file" in ./conf/nutch-default.xml, and
  • attribute "file" in plugin.xml of this plugin

    Attribute "file" has higher precedence if defined. If the config file is missing, all URLs will be rejected. This filter can be configured to work in one of two modes:

  • default to reject ('-'): in this mode, only URLs that match suffixes specified in the config file will be accepted, all other URLs will be rejected.

  • default to accept ('+'): in this mode, only URLs that match suffixes specified in the config file will be rejected, all other URLs will be accepted.
    The format of this config file is one URL suffix per line, with no preceding whitespace. Order, in which suffixes are specified, doesn't matter. Blank lines and comments (#) are allowed.

    A single '+' or '-' sign not followed by any suffix must be used once, to signify the mode this plugin operates in. An optional single 'I' can be appended, to signify that suffix matches should be case-insensitive. The default, if not specified, is to use case-sensitive matches, i.e. suffix '.JPG' does not match '.jpg'.

    NOTE: the format of this file is different from urlfilter-prefix, because that plugin doesn't support allowed/prohibited prefixes (only supports allowed prefixes). Please note that this plugin does not support regular expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most probably wrong, you should use "+.jpg" instead.

Example 1

The configuration shown below will accept all URLs with '.html' or '.htm' suffixes (case-sensitive - '.HTML' or '.HTM' will be rejected), and prohibit all other suffixes.

  # this is a comment

  # prohibit all unknown, case-sensitive matching
  -

  # collect only HTML files.
  .html
  .htm
 

Example 2

The configuration shown below will accept all URLs except common graphical formats.

  # this is a comment

  # allow all unknown, case-insensitive matching
  +I

  # prohibited suffixes
  .gif
  .png
  .jpg
  .jpeg
  .bmp
 
  • Author:
  • Andrzej Bialecki

Field Summary

-    

Fields inherited from interface org.apache.nutch.net.URLFilter

X_POINT_ID

Constructor Summary

Constructors Constructor and Description SuffixURLFilter() SuffixURLFilter(Reader reader)

Method Summary

Methods Modifier and Type Method and Description String filter(String url) org.apache.hadoop.conf.Configuration getConf() boolean isIgnoreCase() boolean isModeAccept() static void main(String[] args) void readConfiguration(Reader reader) void setConf(org.apache.hadoop.conf.Configuration conf) void setFilterFromPath(boolean filterFromPath) void setIgnoreCase(boolean ignoreCase) void setModeAccept(boolean modeAccept)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

SuffixURLFilter

public SuffixURLFilter()
                throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

SuffixURLFilter

public SuffixURLFilter(Reader reader)
                throws IOException
  - Throws: 
  - <code>IOException</code>       

Method Detail

-  

filter

public String filter(String url)
  - Specified by: 
  - <code>filter</code> in interface <code>URLFilter</code>        
-  

readConfiguration

public void readConfiguration(Reader reader)
                       throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

main

public static void main(String[] args)
                 throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

isModeAccept

public boolean isModeAccept()
-  

setModeAccept

public void setModeAccept(boolean modeAccept)
-  

isIgnoreCase

public boolean isIgnoreCase()
-  

setIgnoreCase

public void setIgnoreCase(boolean ignoreCase)
-  

setFilterFromPath

public void setFilterFromPath(boolean filterFromPath)

  • Prev Class
  • Next Class

Copyright © 2014 The Apache Software Foundation