Class MimeAdaptiveFetchSchedule

[TOC]

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

org.apache.nutch.crawl

java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.crawl.AbstractFetchSchedule
- - org.apache.nutch.crawl.AdaptiveFetchSchedule
  - org.apache.nutch.crawl.MimeAdaptiveFetchSchedule

- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, FetchSchedule

public class MimeAdaptiveFetchSchedule
extends AdaptiveFetchSchedule

Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types. This class can be typically used in cases where a recrawl consists of many different MIME-types. It's not very common for MIME-types other than text/html to change frequently. Using this class you can configure different factors per MIME-type so to prefer frequently changing MIME-types over others. For it to work this class relies on the Content-Type MetaData key being present in the CrawlDB. This can either be done when injecting new URL's or by adding "Content-Type" to the db.parsemeta.to.crawldb configuration setting to force MIME-types of newly discovered URL's to be added to the CrawlDB.

Author:
markus

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG static String SCHEDULE_DEC_RATE static String SCHEDULE_INC_RATE static String SCHEDULE_MIME_FILE

Fields inherited from class org.apache.nutch.crawl.AdaptiveFetchSchedule

DEC_RATE, INC_RATE

Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

defaultInterval, maxInterval

Fields inherited from interface org.apache.nutch.crawl.FetchSchedule

SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN

Constructor Summary

Constructors Constructor and Description MimeAdaptiveFetchSchedule()

Method Summary

Methods Modifier and Type Method and Description static void main(String[] args) void setConf(org.apache.hadoop.conf.Configuration conf) CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state) Sets the fetchInterval and fetchTime on a successfully fetched page.

Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch

Methods inherited from class org.apache.hadoop.conf.Configured

getConf

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf

Field Detail

LOG

public static final org.slf4j.Logger LOG

SCHEDULE_INC_RATE

public static final String SCHEDULE_INC_RATE

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_INC_RATE)       
-

SCHEDULE_DEC_RATE

public static final String SCHEDULE_DEC_RATE

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_DEC_RATE)       
-

SCHEDULE_MIME_FILE

public static final String SCHEDULE_MIME_FILE

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_MIME_FILE)

Constructor Detail

MimeAdaptiveFetchSchedule

public MimeAdaptiveFetchSchedule()

Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code> 
  - Overrides: 
  - <code>setConf</code> in class <code>AdaptiveFetchSchedule</code>        
-

setFetchSchedule

public CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url,
                          CrawlDatum datum,
                          long prevFetchTime,
                          long prevModifiedTime,
                          long fetchTime,
                          long modifiedTime,
                          int state)

Description copied from class: AbstractFetchSchedule

Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

  - Specified by: 
  - <code>setFetchSchedule</code> in interface <code>FetchSchedule</code> 
  - Overrides: 
  - <code>setFetchSchedule</code> in class <code>AdaptiveFetchSchedule</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.
  - <code>prevFetchTime</code> - previous value of fetch time, or 0 if not available.
  - <code>prevModifiedTime</code> - previous value of modifiedTime, or 0 if not available.
  - <code>fetchTime</code> - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.
  - <code>modifiedTime</code> - last time the content was modified. This information comes from the protocol implementations, or is set to &lt; 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.
  - <code>state</code> - if [<code>FetchSchedule.STATUS\_MODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_MODIFIED), then the content is considered to be &#34;changed&#34; before the <code>fetchTime</code>, if [<code>FetchSchedule.STATUS\_NOTMODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_NOTMODIFIED) then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to [<code>FetchSchedule.STATUS\_UNKNOWN</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_UNKNOWN), then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior. 
  - Returns:
  - adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.       
-

main

public static void main(String[] args)
                 throws Exception

  - Throws: 
  - <code>Exception</code>

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method