Class OPICScoringFilter

[TOC]

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

org.apache.nutch.scoring.opic

java.lang.Object
- org.apache.nutch.scoring.opic.OPICScoringFilter

- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, Pluggable, ScoringFilter

public class OPICScoringFilter
extends Object
implements ScoringFilter

This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation .

Author:
Andrzej Bialecki

Field Summary

Fields inherited from interface org.apache.nutch.scoring.ScoringFilter

X_POINT_ID

Constructor Summary

Constructors Constructor and Description OPICScoringFilter()

Method Summary

Methods Modifier and Type Method and Description CrawlDatum distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl, ParseData parseData, Collection> targets, CrawlDatum adjust, int allCount) Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. float generatorSortValue(org.apache.hadoop.io.Text url, CrawlDatum datum, float initSort) Use CrawlDatum.getScore()). org.apache.hadoop.conf.Configuration getConf() float indexerScore(org.apache.hadoop.io.Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) Dampen the boost value by scorePower. void initialScore(org.apache.hadoop.io.Text url, CrawlDatum datum) Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. void injectedScore(org.apache.hadoop.io.Text url, CrawlDatum datum) Set an initial score for newly injected pages. void passScoreAfterParsing(org.apache.hadoop.io.Text url, Content content, Parse parse) Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData. void passScoreBeforeParsing(org.apache.hadoop.io.Text url, CrawlDatum datum, Content content) Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. void setConf(org.apache.hadoop.conf.Configuration conf) void updateDbScore(org.apache.hadoop.io.Text url, CrawlDatum old, CrawlDatum datum, List inlinked) Increase the score by a sum of inlinked scores.

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

OPICScoringFilter

public OPICScoringFilter()

Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-

injectedScore

public void injectedScore(org.apache.hadoop.io.Text url,
                 CrawlDatum datum)
                   throws ScoringFilterException

Description copied from interface: ScoringFilter

Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.

  - Specified by: 
  - <code>injectedScore</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - new datum. Filters will modify it in-place. 
  - Throws: 
  - <code>ScoringFilterException</code>       
-

initialScore

public void initialScore(org.apache.hadoop.io.Text url,
                CrawlDatum datum)
                  throws ScoringFilterException

Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. Newly discovered pages have at least one inlink.

  - Specified by: 
  - <code>initialScore</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - new datum. Filters will modify it in-place. 
  - Throws: 
  - <code>ScoringFilterException</code>       
-

generatorSortValue

public float generatorSortValue(org.apache.hadoop.io.Text url,
                       CrawlDatum datum,
                       float initSort)
                         throws ScoringFilterException

Use CrawlDatum.getScore()).

  - Specified by: 
  - <code>generatorSortValue</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - page&#39;s datum, should not be modified
  - <code>initSort</code> - initial sort value, or a value from previous filters in chain 
  - Throws: 
  - <code>ScoringFilterException</code>       
-

updateDbScore

public void updateDbScore(org.apache.hadoop.io.Text url,
                 CrawlDatum old,
                 CrawlDatum datum,
                 List<CrawlDatum> inlinked)
                   throws ScoringFilterException

Increase the score by a sum of inlinked scores.

  - Specified by: 
  - <code>updateDbScore</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>old</code> - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the <code>datum</code> parameter may contain values that are no longer valid, if other updates occured between generation and this update.
  - <code>datum</code> - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
  - <code>inlinked</code> - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch. 
  - Throws: 
  - <code>ScoringFilterException</code>       
-

passScoreBeforeParsing

public void passScoreBeforeParsing(org.apache.hadoop.io.Text url,
                          CrawlDatum datum,
                          Content content)

Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.

  - Specified by: 
  - <code>passScoreBeforeParsing</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - source datum. NOTE: modifications to this value are not persisted.
  - <code>content</code> - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.       
-

passScoreAfterParsing

public void passScoreAfterParsing(org.apache.hadoop.io.Text url,
                         Content content,
                         Parse parse)

Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.

  - Specified by: 
  - <code>passScoreAfterParsing</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - page url
  - <code>content</code> - original content. NOTE: modifications to this value are not persisted.
  - <code>parse</code> - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.       
-

distributeScoreToOutlinks

public CrawlDatum distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
                                   ParseData parseData,
                                   Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
                                   CrawlDatum adjust,
                                   int allCount)
                                     throws ScoringFilterException

Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.

  - Specified by: 
  - <code>distributeScoreToOutlinks</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>fromUrl</code> - url of the source page
  - <code>parseData</code> - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
  - <code>targets</code> - <url, crawldatum=""> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
  - <code>adjust</code> - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to [<code>CrawlDatum.STATUS\_LINKED</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html#STATUS_LINKED).
  - <code>allCount</code> - number of all collected outlinks from the source page 
  - Returns:
  - if needed, implementations may return an instance of CrawlDatum, with status [<code>CrawlDatum.STATUS\_LINKED</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html#STATUS_LINKED), which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed. 
  - Throws: 
  - <code>ScoringFilterException</code>       
-

indexerScore

public float indexerScore(org.apache.hadoop.io.Text url,
                 NutchDocument doc,
                 CrawlDatum dbDatum,
                 CrawlDatum fetchDatum,
                 Parse parse,
                 Inlinks inlinks,
                 float initScore)
                   throws ScoringFilterException

Dampen the boost value by scorePower.

  - Specified by: 
  - <code>indexerScore</code> in interface <code>ScoringFilter</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>doc</code> - Lucene document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
  - <code>dbDatum</code> - current page from CrawlDb. NOTE: changes made to this instance are not persisted.
  - <code>fetchDatum</code> - datum from FetcherOutput (containing among others the fetching status)
  - <code>parse</code> - parsing result. NOTE: changes made to this instance are not persisted.
  - <code>inlinks</code> - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
  - <code>initScore</code> - initial boost value for the Lucene document. 
  - Returns:
  - boost value for the Lucene document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying Lucene document directly. 
  - Throws: 
  - <code>ScoringFilterException</code>

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method