- Prev Class
- Next Class
org.apache.nutch.indexer.basic
Class BasicIndexingFilter
- java.lang.Object
- org.apache.nutch.indexer.basic.BasicIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, Pluggable
public class BasicIndexingFilter extends Object implements IndexingFilter
Adds basic searchable fields to a document. The fields added are : domain, host, url, content, title, cache, tstamp domain is included depending on indexer.add.domain
in nutch-default.xml. title is truncated as per indexer.max.title.length
in nutch-default.xml. (As per NUTCH-1004, a zero-length title is not added) content is truncated as per indexer.max.content.length
in nutch-default.xml.
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
Constructor Summary
Constructors Constructor and Description BasicIndexingFilter()
Method Summary
Methods Modifier and Type Method and Description NutchDocument
filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The BasicIndexingFilter
filter object which supports few configuration settings for adding basic searchable fields.
org.apache.hadoop.conf.Configuration
getConf()
Get the Configuration
object
void
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
BasicIndexingFilter
public BasicIndexingFilter()
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
The BasicIndexingFilter
filter object which supports few configuration settings for adding basic searchable fields. See indexer.add.domain
, indexer.max.title.length
, indexer.max.content.length
in nutch-default.xml.
- Specified by:
- <code>filter</code> in interface <code>IndexingFilter</code>
- Parameters:
- <code>doc</code> - The [<code>NutchDocument</code>](../../../../../org/apache/nutch/indexer/NutchDocument.html) object
- <code>parse</code> - The relevant [<code>Parse</code>](../../../../../org/apache/nutch/parse/Parse.html) object passing through the filter
- <code>url</code> - URL to be filtered for anchor text
- <code>datum</code> - The [<code>CrawlDatum</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html) entry
- <code>inlinks</code> - The [<code>Inlinks</code>](../../../../../org/apache/nutch/crawl/Inlinks.html) containing anchor text
- Returns:
- filtered NutchDocument
- Throws:
- <code>IndexingException</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration
object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
- Prev Class
- Next Class