- Prev Class
- Next Class
org.apache.nutch.indexer.tld
Class TLDIndexingFilter
- java.lang.Object
- org.apache.nutch.indexer.tld.TLDIndexingFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, IndexingFilter, Pluggable
public class TLDIndexingFilter extends Object implements IndexingFilter
Adds the Top level domain extensions to the index
- Author:
- Enis Soztutar enis.soz.nutch@gmail.com
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
Constructor Summary
Constructors Constructor and Description TLDIndexingFilter()
Method Summary
Methods Modifier and Type Method and Description NutchDocument
filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text urlText,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.
org.apache.hadoop.conf.Configuration
getConf()
void
setConf(org.apache.hadoop.conf.Configuration conf)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
TLDIndexingFilter
public TLDIndexingFilter()
Method Detail
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text urlText, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface: IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
- Specified by:
- <code>filter</code> in interface <code>IndexingFilter</code>
- Parameters:
- <code>doc</code> - document instance for collecting fields
- <code>parse</code> - parse data instance
- <code>urlText</code> - page url
- <code>datum</code> - crawl datum for the page
- <code>inlinks</code> - page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
- <code>IndexingException</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
- Prev Class
- Next Class