org.apache.nutch.crawl
Interface FetchSchedule
- All Superinterfaces:
- org.apache.hadoop.conf.Configurable
- All Known Implementing Classes:
- AbstractFetchSchedule, AdaptiveFetchSchedule, DefaultFetchSchedule, MimeAdaptiveFetchSchedule
public interface FetchSchedule extends org.apache.hadoop.conf.Configurable
This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.
- Author:
- Andrzej Bialecki
Field Summary
Fields Modifier and Type Field and Description static int
SECONDS_PER_DAY
static int
STATUS_MODIFIED
Page is known to have been modified since our last visit.
static int
STATUS_NOTMODIFIED
Page is known to remain unmodified since our last visit.
static int
STATUS_UNKNOWN
It is unknown whether page was changed since our last visit.
Method Summary
Methods Modifier and Type Method and Description long
calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum.
CrawlDatum
forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.
CrawlDatum
initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
CrawlDatum
setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval
and fetchTime
on a successfully fetched page.
CrawlDatum
setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.
CrawlDatum
setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
boolean
shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist.
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
STATUS_UNKNOWN
static final int STATUS_UNKNOWN
It is unknown whether page was changed since our last visit.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.FetchSchedule.STATUS_UNKNOWN)
-
STATUS_MODIFIED
static final int STATUS_MODIFIED
Page is known to have been modified since our last visit.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.FetchSchedule.STATUS_MODIFIED)
-
STATUS_NOTMODIFIED
static final int STATUS_NOTMODIFIED
Page is known to remain unmodified since our last visit.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.FetchSchedule.STATUS_NOTMODIFIED)
-
SECONDS_PER_DAY
static final int SECONDS_PER_DAY
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.FetchSchedule.SECONDS_PER_DAY)
Method Detail
-
initializeSchedule
CrawlDatum initializeSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum)
Initialize fetch schedule related data. Implementations should at least set the fetchTime
and fetchInterval
. The default implementation set the fetchTime
to now, using the default fetchInterval
.
- Parameters:
- <code>url</code> - URL of the page.
- <code>datum</code> - datum instance to be initialized.
- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
setFetchSchedule
CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets the fetchInterval
and fetchTime
on a successfully fetched page. Implementations may use supplied arguments to support different re-fetching schedules.
- Parameters:
- <code>url</code> - url of the page
- <code>datum</code> - page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.
- <code>prevFetchTime</code> - previous value of fetch time, or 0 if not available.
- <code>prevModifiedTime</code> - previous value of modifiedTime, or 0 if not available.
- <code>fetchTime</code> - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.
- <code>modifiedTime</code> - last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.
- <code>state</code> - if [<code>STATUS\_MODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_MODIFIED), then the content is considered to be "changed" before the <code>fetchTime</code>, if [<code>STATUS\_NOTMODIFIED</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_NOTMODIFIED) then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to [<code>STATUS\_UNKNOWN</code>](../../../../org/apache/nutch/crawl/FetchSchedule.html#STATUS_UNKNOWN), then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.
- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
setPageGoneSchedule
CrawlDatum setPageGoneSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50%, and if it exceeds the maxInterval
it calls forceRefetch(Text, CrawlDatum, boolean)
).
- Parameters:
- <code>url</code> - URL of the page
- <code>datum</code> - datum instance to be adjusted.
- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
setPageRetrySchedule
CrawlDatum setPageRetrySchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.
- Parameters:
- <code>url</code> - URL of the page.
- <code>datum</code> - page information.
- <code>prevFetchTime</code> - previous fetch time.
- <code>prevModifiedTime</code> - previous modified time.
- <code>fetchTime</code> - current fetch time.
- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum.
-
calculateLastFetchTime
long calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum.
- Returns:
- the date as a long.
-
shouldFetch
boolean shouldFetch(org.apache.hadoop.io.Text url, CrawlDatum datum, long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks fetchTime
, if it is higher than the curTime it returns false, and true otherwise. It will also check that fetchTime is not too remote (more than maxInterval
Parameters:url - URL of the page.datum - datum instance.curTime - reference time (usually set to the time when the
fetchlist generation process was started).
Returns:true, if the page should be considered for inclusion in the current
fetchlist, otherwise false.
forceRefetch
CrawlDatum forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and
page signature, so that it forces refetching.
Parameters:url - URL of the page.datum - datum instance.asap - if true, force refetch as soon as possible - this sets
the fetchTime to now. If false, force refetch whenever the next fetch
time is set.
Returns:adjusted page information, including all original information.
NOTE: this may be a different instance than @see CrawlDatum, but
implementations should make sure that it contains at least all
information from @see CrawlDatum.
Overview
Package
Class
Use
Tree
Deprecated
Index
Help
Prev Class
Next Class
Frames
No Frames
All Classes
Summary:
Nested |
Field |
Constr |
Method
Detail:
Field |
Constr |
Method
Copyright © 2014 The Apache Software Foundation