[TOC]

org.apache.nutch.crawl

Class MimeAdaptiveFetchSchedule

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, FetchSchedule

public class MimeAdaptiveFetchSchedule
extends AdaptiveFetchSchedule

Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types. This class can be typically used in cases where a recrawl consists of many different MIME-types. It's not very common for MIME-types other than text/html to change frequently. Using this class you can configure different factors per MIME-type so to prefer frequently changing MIME-types over others. For it to work this class relies on the Content-Type MetaData key being present in the CrawlDB. This can either be done when injecting new URL's or by adding "Content-Type" to the db.parsemeta.to.crawldb configuration setting to force MIME-types of newly discovered URL's to be added to the CrawlDB.

  • Author:
  • markus

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG static String SCHEDULE_DEC_RATE static String SCHEDULE_INC_RATE static String SCHEDULE_MIME_FILE

-    

Fields inherited from class org.apache.nutch.crawl.AdaptiveFetchSchedule

DEC_RATE, INC_RATE

-    

Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

defaultInterval, maxInterval

-    

Fields inherited from interface org.apache.nutch.crawl.FetchSchedule

SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN

Constructor Summary

Constructors Constructor and Description MimeAdaptiveFetchSchedule()

Method Summary

Methods Modifier and Type Method and Description static void main(String[] args) void setConf(org.apache.hadoop.conf.Configuration conf) CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state) Sets the fetchInterval and fetchTime on a successfully fetched page.

-    

Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

SCHEDULE_INC_RATE

public static final String SCHEDULE_INC_RATE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_INC_RATE)       
-  

SCHEDULE_DEC_RATE

public static final String SCHEDULE_DEC_RATE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_DEC_RATE)       
-  

SCHEDULE_MIME_FILE

public static final String SCHEDULE_MIME_FILE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.MimeAdaptiveFetchSchedule.SCHEDULE_MIME_FILE)       

Constructor Detail

-  

MimeAdaptiveFetchSchedule

public MimeAdaptiveFetchSchedule()

Method Detail

-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code> 
  - Overrides: 
  - <code>setConf</code> in class <code>AdaptiveFetchSchedule</code>        
-  

setFetchSchedule

public CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url,
                          CrawlDatum datum,
                          long prevFetchTime,
                          long prevModifiedTime,
                          long fetchTime,
                          long modifiedTime,
                          int state)

Description copied from class: AbstractFetchSchedule

Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

  - Specified by: 
  - <code>setFetchSchedule</code> in interface <code>FetchSchedule</code> 
  - Overrides: 
  - <code>setFetchSchedule</code> in class <code>AdaptiveFetchSchedule</code> 
  - Parameters:
  - <code>url</code> - url of the page
  - <code>datum</code> - page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.
  - <code>prevFetchTime</code> - previous value of fetch time, or 0 if not available.
  - <code>prevModifiedTime</code> - previous value of modifiedTime, or 0 if not available.
  - <code>fetchTime</code> - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.
  - <code>modifiedTime</code> - last time the content was modified. This information comes from the protocol implementations, or is set to &lt; 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.
  - <code>state</code> - if [<code>FetchSchedule.STATUS\_MODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_MODIFIED), then the content is considered to be &#34;changed&#34; before the <code>fetchTime</code>, if [<code>FetchSchedule.STATUS\_NOTMODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_NOTMODIFIED) then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to [<code>FetchSchedule.STATUS\_UNKNOWN</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_UNKNOWN), then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior. 
  - Returns:
  - adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.       
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation