[TOC]

  • Prev Class
  • Next Class

org.apache.nutch.indexer.urlmeta

Class URLMetaIndexingFilter


public class URLMetaIndexingFilter
extends Object
implements IndexingFilter

This is part of the URL Meta plugin. It is designed to enhance the NUTCH-655 patch, by doing two things: 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs. 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs—and can be directly queried, assuming you have done everything else correctly. The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of: [www.url.com]\t[key1]=[value1]\t[key2]=[value2]…[keyN]=[valueN] Be aware that if you collide with keywords that are already in use (such as nutch.score/nutch.fetchInterval) then you are in for some unpredictable behavior. Furthermore, in your nutch-site.xml config, you must specify that this plugin is to be used (1), as well as what (2) Meta Tags it should actively look for. This does not mean that you must use these tags for every URL, but it does mean that you must list all of meta tags that you have specified. If you want them to be propagated and indexed, that is. 1. As of Nutch 1.2, the property "plugin.includes" looks as follows: protocol-http|urlfilter-regex|parse-(text|html|js|tika|rss)|index -(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic |scoring-opic|urlnormalizer-(pass|regex|basic) You must change "index-(basic|anchor)" to "index-(basic|anchor|urlmeta)", in order to call this plugin. 2. You must also specify the property "urlmeta.tags", who's values are comma-delimited key1, key2, key3 TODO: It may be ideal to offer two separate properties, to specify what gets indexed versus merely propagated.

Field Summary

-    

Fields inherited from interface org.apache.nutch.indexer.IndexingFilter

X_POINT_ID

Constructor Summary

Constructors Constructor and Description URLMetaIndexingFilter()

Method Summary

Methods Modifier and Type Method and Description NutchDocument filter(NutchDocument doc, Parse parse, org.apache.hadoop.io.Text url, CrawlDatum datum, Inlinks inlinks) This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object. org.apache.hadoop.conf.Configuration getConf() Boilerplate void setConf(org.apache.hadoop.conf.Configuration conf) handles conf assignment and pulls the value assignment from the "urlmeta.tags" property

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

URLMetaIndexingFilter

public URLMetaIndexingFilter()

Method Detail

-  

filter

public NutchDocument filter(NutchDocument doc,
                   Parse parse,
                   org.apache.hadoop.io.Text url,
                   CrawlDatum datum,
                   Inlinks inlinks)
                     throws IndexingException

This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object. If they exist, this will add it as an attribute inside the NutchDocument.

  - Specified by: 
  - <code>filter</code> in interface <code>IndexingFilter</code> 
  - Parameters:
  - <code>doc</code> - document instance for collecting fields
  - <code>parse</code> - parse data instance
  - <code>url</code> - page url
  - <code>datum</code> - crawl datum for the page
  - <code>inlinks</code> - page inlinks 
  - Returns:
  - modified (or a new) document instance, or null (meaning the document should be discarded) 
  - Throws: 
  - <code>IndexingException</code>
  - See Also:
  - [<code>IndexingFilter.filter(org.apache.nutch.indexer.NutchDocument, org.apache.nutch.parse.Parse, org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.crawl.Inlinks)</code>](../../../../../org/apache/nutch/indexer/IndexingFilter.html#filter(org.apache.nutch.indexer.NutchDocument, org.apache.nutch.parse.Parse, org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.crawl.Inlinks))       
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()

Boilerplate

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

handles conf assignment and pulls the value assignment from the "urlmeta.tags" property

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>       

  • Prev Class
  • Next Class

Copyright © 2014 The Apache Software Foundation