- Prev Class
- Next Class
org.apache.nutch.indexer.basic
Class BasicIndexingFilter
- java.lang.Object
- org.apache.nutch.indexer.basic.BasicIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, Pluggable
public class BasicIndexingFilter extends Object implements IndexingFilter
Adds basic searchable fields to a document. The fields added are : domain, host, url, content, title, cache, tstamp domain is included depending on indexer.add.domain in nutch-default.xml. title is truncated as per indexer.max.title.length in nutch-default.xml. (As per NUTCH-1004, a zero-length title is not added) content is truncated as per indexer.max.content.length in nutch-default.xml.
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger LOG
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
Constructor Summary
Constructors Constructor and Description BasicIndexingFilter()
Method Summary
Methods Modifier and Type Method and Description NutchDocument filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields.
org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object
void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
BasicIndexingFilter
public BasicIndexingFilter()
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields. See indexer.add.domain, indexer.max.title.length, indexer.max.content.length in nutch-default.xml.
- Specified by:
- <code>filter</code> in interface <code>IndexingFilter</code>
- Parameters:
- <code>doc</code> - The [<code>NutchDocument</code>](../../../../../org/apache/nutch/indexer/NutchDocument.html) object
- <code>parse</code> - The relevant [<code>Parse</code>](../../../../../org/apache/nutch/parse/Parse.html) object passing through the filter
- <code>url</code> - URL to be filtered for anchor text
- <code>datum</code> - The [<code>CrawlDatum</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html) entry
- <code>inlinks</code> - The [<code>Inlinks</code>](../../../../../org/apache/nutch/crawl/Inlinks.html) containing anchor text
- Returns:
- filtered NutchDocument
- Throws:
- <code>IndexingException</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
- Prev Class
- Next Class
