Interface IndexingFilter

[TOC]

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

org.apache.nutch.indexer

- All Superinterfaces:
- org.apache.hadoop.conf.Configurable, Pluggable
- All Known Implementing Classes:
- AnchorIndexingFilter, BasicIndexingFilter, CCIndexingFilter, FeedIndexingFilter, LanguageIndexingFilter, MetadataIndexer, MoreIndexingFilter, RelTagIndexingFilter, StaticFieldIndexer, SubcollectionIndexingFilter, TLDIndexingFilter, URLMetaIndexingFilter

public interface IndexingFilter
extends Pluggable, org.apache.hadoop.conf.Configurable

Extension point for indexing. Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse.

Field Summary

Fields Modifier and Type Field and Description static String X_POINT_ID The name of the extension point.

Method Summary

Methods Modifier and Type Method and Description NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) Adds fields or otherwise modifies the document that will be indexed for a parse.

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Field Detail

X_POINT_ID

static final String X_POINT_ID

The name of the extension point.

Method Detail

filter

NutchDocument filter(NutchDocument doc,
                   Parse parse,
                   org.apache.hadoop.io.Text url,
                   CrawlDatum datum,
                   Inlinks inlinks)
                     throws IndexingException

Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.

  - Parameters:
  - <code>doc</code> - document instance for collecting fields
  - <code>parse</code> - parse data instance
  - <code>url</code> - page url
  - <code>datum</code> - crawl datum for the page
  - <code>inlinks</code> - page inlinks 
  - Returns:
  - modified (or a new) document instance, or null (meaning the document should be discarded) 
  - Throws: 
  - <code>IndexingException</code>

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method