- Prev Class
- Next Class
org.apache.nutch.scoring.tld
Class TLDScoringFilter
- java.lang.Object
- org.apache.nutch.scoring.tld.TLDScoringFilter
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, Pluggable, ScoringFilter
public class TLDScoringFilter extends Object implements ScoringFilter
Scoring filter to boost tlds.
- Author:
- Enis Soztutar enis.soz.nutch@gmail.com
Field Summary
-
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
X_POINT_ID
Constructor Summary
Constructors Constructor and Description TLDScoringFilter()
Method Summary
Methods Modifier and Type Method and Description CrawlDatum
distributeScoreToOutlink(org.apache.hadoop.io.Text fromUrl,
org.apache.hadoop.io.Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
CrawlDatum
distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection
Distribute score value from the current page to all its outlinked pages.
float
generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
org.apache.hadoop.conf.Configuration
getConf()
float
indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost.
void
initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages.
void
injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Set an initial score for newly injected pages.
void
passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process.
void
passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content
metadata.
void
setConf(org.apache.hadoop.conf.Configuration conf)
void
updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
TLDScoringFilter
public TLDScoringFilter()
Method Detail
-
indexerScore
public float indexerScore(org.apache.hadoop.io.Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
Description copied from interface: ScoringFilter
This method calculates a Lucene document boost.
- Specified by:
- <code>indexerScore</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>doc</code> - Lucene document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
- <code>dbDatum</code> - current page from CrawlDb. NOTE: changes made to this instance are not persisted.
- <code>fetchDatum</code> - datum from FetcherOutput (containing among others the fetching status)
- <code>parse</code> - parsing result. NOTE: changes made to this instance are not persisted.
- <code>inlinks</code> - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
- <code>initScore</code> - initial boost value for the Lucene document.
- Returns:
- boost value for the Lucene document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying Lucene document directly.
- Throws:
- <code>ScoringFilterException</code>
-
distributeScoreToOutlink
public CrawlDatum distributeScoreToOutlink(org.apache.hadoop.io.Text fromUrl, org.apache.hadoop.io.Text toUrl, ParseData parseData, CrawlDatum target, CrawlDatum adjust, int allCount, int validCount) throws ScoringFilterException
- Throws:
- <code>ScoringFilterException</code>
-
generatorSortValue
public float generatorSortValue(org.apache.hadoop.io.Text url, CrawlDatum datum, float initSort) throws ScoringFilterException
Description copied from interface: ScoringFilter
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
- Specified by:
- <code>generatorSortValue</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>datum</code> - page's datum, should not be modified
- <code>initSort</code> - initial sort value, or a value from previous filters in chain
- Throws:
- <code>ScoringFilterException</code>
-
initialScore
public void initialScore(org.apache.hadoop.io.Text url, CrawlDatum datum) throws ScoringFilterException
Description copied from interface: ScoringFilter
Set an initial score for newly discovered pages. Note: newly discovered pages have at least one inlink with its score contribution, so filter implementations may choose to set initial score to zero (unknown value), and then the inlink score contribution will set the "real" value of the new page.
- Specified by:
- <code>initialScore</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>datum</code> - new datum. Filters will modify it in-place.
- Throws:
- <code>ScoringFilterException</code>
-
injectedScore
public void injectedScore(org.apache.hadoop.io.Text url, CrawlDatum datum) throws ScoringFilterException
Description copied from interface: ScoringFilter
Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.
- Specified by:
- <code>injectedScore</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>datum</code> - new datum. Filters will modify it in-place.
- Throws:
- <code>ScoringFilterException</code>
-
passScoreAfterParsing
public void passScoreAfterParsing(org.apache.hadoop.io.Text url, Content content, Parse parse) throws ScoringFilterException
Description copied from interface: ScoringFilter
Currently a part of score distribution is performed using only data coming from the parsing process. We need this method in order to ensure the presence of score data in these steps.
- Specified by:
- <code>passScoreAfterParsing</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - page url
- <code>content</code> - original content. NOTE: modifications to this value are not persisted.
- <code>parse</code> - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
- Throws:
- <code>ScoringFilterException</code>
-
passScoreBeforeParsing
public void passScoreBeforeParsing(org.apache.hadoop.io.Text url, CrawlDatum datum, Content content) throws ScoringFilterException
Description copied from interface: ScoringFilter
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content
metadata. This is needed in order to pass this value(s) to the mechanism that distributes it to outlinked pages.
- Specified by:
- <code>passScoreBeforeParsing</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>datum</code> - source datum. NOTE: modifications to this value are not persisted.
- <code>content</code> - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
- Throws:
- <code>ScoringFilterException</code>
-
updateDbScore
public void updateDbScore(org.apache.hadoop.io.Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) throws ScoringFilterException
Description copied from interface: ScoringFilter
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages.
- Specified by:
- <code>updateDbScore</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>url</code> - url of the page
- <code>old</code> - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the <code>datum</code> parameter may contain values that are no longer valid, if other updates occured between generation and this update.
- <code>datum</code> - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
- <code>inlinked</code> - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
- Throws:
- <code>ScoringFilterException</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
distributeScoreToOutlinks
public CrawlDatum distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl, ParseData parseData, Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException
Description copied from interface: ScoringFilter
Distribute score value from the current page to all its outlinked pages.
- Specified by:
- <code>distributeScoreToOutlinks</code> in interface <code>ScoringFilter</code>
- Parameters:
- <code>fromUrl</code> - url of the source page
- <code>parseData</code> - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
- <code>targets</code> - <url, crawldatum=""> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
- <code>adjust</code> - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to [<code>CrawlDatum.STATUS\_LINKED</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html#STATUS_LINKED).
- <code>allCount</code> - number of all collected outlinks from the source page
- Returns:
- if needed, implementations may return an instance of CrawlDatum, with status [<code>CrawlDatum.STATUS\_LINKED</code>](../../../../../org/apache/nutch/crawl/CrawlDatum.html#STATUS_LINKED), which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
- Throws:
- <code>ScoringFilterException</code>
- Prev Class
- Next Class