[TOC]

  • Prev Class
  • Next Class

org.apache.nutch.indexer.basic

Class BasicIndexingFilter


public class BasicIndexingFilter
extends Object
implements IndexingFilter

Adds basic searchable fields to a document. The fields added are : domain, host, url, content, title, cache, tstamp domain is included depending on indexer.add.domain in nutch-default.xml. title is truncated as per indexer.max.title.length in nutch-default.xml. (As per NUTCH-1004, a zero-length title is not added) content is truncated as per indexer.max.content.length in nutch-default.xml.

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.indexer.IndexingFilter

X_POINT_ID

Constructor Summary

Constructors Constructor and Description BasicIndexingFilter()

Method Summary

Methods Modifier and Type Method and Description NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields. org.apache.hadoop.conf.Configuration getConf() Get the Configuration object void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

BasicIndexingFilter

public BasicIndexingFilter()

Method Detail

-  

filter

public NutchDocument filter(NutchDocument doc,
                   Parse parse,
                   org.apache.hadoop.io.Text url,
                   CrawlDatum datum,
                   Inlinks inlinks)
                     throws IndexingException

The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields. See indexer.add.domain, indexer.max.title.length, indexer.max.content.length in nutch-default.xml.

  - Specified by: 
  - <code>filter</code> in interface <code>IndexingFilter</code> 
  - Parameters:
  - <code>doc</code> - The [<code>NutchDocument</code>](../../../../../org/apache/nutch/indexer/NutchDocument.html) object
  - <code>parse</code> - The relevant [<code>Parse</code>](../../../../../org/apache/nutch/parse/Parse.html) object passing through the filter
  - <code>url</code> - URL to be filtered for anchor text
  - <code>datum</code> - The [<code>CrawlDatum</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html) entry
  - <code>inlinks</code> - The [<code>Inlinks</code>](../../../../../org/apache/nutch/crawl/Inlinks.html) containing anchor text 
  - Returns:
  - filtered NutchDocument 
  - Throws: 
  - <code>IndexingException</code>       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()

Get the Configuration object

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>       

  • Prev Class
  • Next Class

Copyright © 2014 The Apache Software Foundation