org.apache.nutch.indexer
Interface IndexingFilter
- All Superinterfaces:
- org.apache.hadoop.conf.Configurable, Pluggable
- All Known Implementing Classes:
- AnchorIndexingFilter, BasicIndexingFilter, CCIndexingFilter, FeedIndexingFilter, LanguageIndexingFilter, MetadataIndexer, MoreIndexingFilter, RelTagIndexingFilter, StaticFieldIndexer, SubcollectionIndexingFilter, TLDIndexingFilter, URLMetaIndexingFilter
public interface IndexingFilter extends Pluggable, org.apache.hadoop.conf.Configurable
Extension point for indexing. Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse.
Field Summary
Fields Modifier and Type Field and Description static String
X_POINT_ID
The name of the extension point.
Method Summary
Methods Modifier and Type Method and Description NutchDocument
filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
Method Detail
-
filter
NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
- Parameters:
- <code>doc</code> - document instance for collecting fields
- <code>parse</code> - parse data instance
- <code>url</code> - page url
- <code>datum</code> - crawl datum for the page
- <code>inlinks</code> - page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
- <code>IndexingException</code>