org.apache.nutch.crawl
Class CrawlDatum
- java.lang.Object
- org.apache.nutch.crawl.CrawlDatum
- All Implemented Interfaces:
- Cloneable, Comparable<CrawlDatum>, org.apache.hadoop.io.Writable, org.apache.hadoop.io.WritableComparable<CrawlDatum>
public class CrawlDatum extends Object implements org.apache.hadoop.io.WritableComparable<CrawlDatum>, Cloneable
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
CrawlDatum.Comparator
A Comparator optimized for CrawlDatum.
Field Summary
Fields Modifier and Type Field and Description static String
FETCH_DIR_NAME
static String
GENERATE_DIR_NAME
static String
PARSE_DIR_NAME
static HashMap
statNames
static byte
STATUS_DB_DUPLICATE
static byte
STATUS_DB_FETCHED
Page was successfully fetched.
static byte
STATUS_DB_GONE
Page no longer exists.
static byte
STATUS_DB_MAX
Maximum value of DB-related status.
static byte
STATUS_DB_NOTMODIFIED
Page was successfully fetched and found not modified.
static byte
STATUS_DB_REDIR_PERM
Page permanently redirects to other page.
static byte
STATUS_DB_REDIR_TEMP
Page temporarily redirects to other page.
static byte
STATUS_DB_UNFETCHED
Page was not fetched yet.
static byte
STATUS_FETCH_GONE
Fetching unsuccessful - page is gone.
static byte
STATUS_FETCH_MAX
Maximum value of fetch-related status.
static byte
STATUS_FETCH_NOTMODIFIED
Fetching successful - page is not modified.
static byte
STATUS_FETCH_REDIR_PERM
Fetching permanently redirected to other page.
static byte
STATUS_FETCH_REDIR_TEMP
Fetching temporarily redirected to other page.
static byte
STATUS_FETCH_RETRY
Fetching unsuccessful, needs to be retried (transient errors).
static byte
STATUS_FETCH_SUCCESS
Fetching was successful.
static byte
STATUS_INJECTED
Page was newly injected.
static byte
STATUS_LINKED
Page discovered through a link.
static byte
STATUS_PARSE_META
Page got metadata from a parser
static byte
STATUS_SIGNATURE
Page signature.
Constructor Summary
Constructors Constructor and Description CrawlDatum()
CrawlDatum(int status,
int fetchInterval)
CrawlDatum(int status,
int fetchInterval,
float score)
Method Summary
Methods Modifier and Type Method and Description Object
clone()
int
compareTo(CrawlDatum that)
Sort by decreasing score.
boolean
equals(Object o)
int
getFetchInterval()
long
getFetchTime()
Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
org.apache.hadoop.io.MapWritable
getMetaData()
returns a MapWritable if it was set or read in @see readFields(DataInput), returns empty map in case CrawlDatum was freshly created (lazily instantiated).
long
getModifiedTime()
byte
getRetriesSinceFetch()
float
getScore()
byte[]
getSignature()
byte
getStatus()
static String
getStatusName(byte value)
static boolean
hasDbStatus(CrawlDatum datum)
static boolean
hasFetchStatus(CrawlDatum datum)
int
hashCode()
void
putAllMetaData(CrawlDatum other)
Add all metadata from other CrawlDatum to this CrawlDatum.
static CrawlDatum
read(DataInput in)
void
readFields(DataInput in)
void
set(CrawlDatum that)
Copy the contents of another instance into this instance.
void
setFetchInterval(float fetchInterval)
void
setFetchInterval(int fetchInterval)
void
setFetchTime(long fetchTime)
Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
void
setMetaData(org.apache.hadoop.io.MapWritable mapWritable)
void
setModifiedTime(long modifiedTime)
void
setRetriesSinceFetch(int retries)
void
setScore(float score)
void
setSignature(byte[] signature)
void
setStatus(int status)
String
toString()
void
write(DataOutput out)
-
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
Field Detail
-
GENERATE_DIR_NAME
public static final String GENERATE_DIR_NAME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.GENERATE_DIR_NAME)
-
FETCH_DIR_NAME
public static final String FETCH_DIR_NAME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.FETCH_DIR_NAME)
-
PARSE_DIR_NAME
public static final String PARSE_DIR_NAME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.PARSE_DIR_NAME)
-
STATUS_DB_UNFETCHED
public static final byte STATUS_DB_UNFETCHED
Page was not fetched yet.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_UNFETCHED)
-
STATUS_DB_FETCHED
public static final byte STATUS_DB_FETCHED
Page was successfully fetched.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_FETCHED)
-
STATUS_DB_GONE
public static final byte STATUS_DB_GONE
Page no longer exists.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_GONE)
-
STATUS_DB_REDIR_TEMP
public static final byte STATUS_DB_REDIR_TEMP
Page temporarily redirects to other page.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_REDIR_TEMP)
-
STATUS_DB_REDIR_PERM
public static final byte STATUS_DB_REDIR_PERM
Page permanently redirects to other page.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_REDIR_PERM)
-
STATUS_DB_NOTMODIFIED
public static final byte STATUS_DB_NOTMODIFIED
Page was successfully fetched and found not modified.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_NOTMODIFIED)
-
STATUS_DB_DUPLICATE
public static final byte STATUS_DB_DUPLICATE
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_DUPLICATE)
-
STATUS_DB_MAX
public static final byte STATUS_DB_MAX
Maximum value of DB-related status.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_MAX)
-
STATUS_FETCH_SUCCESS
public static final byte STATUS_FETCH_SUCCESS
Fetching was successful.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_SUCCESS)
-
STATUS_FETCH_RETRY
public static final byte STATUS_FETCH_RETRY
Fetching unsuccessful, needs to be retried (transient errors).
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_RETRY)
-
STATUS_FETCH_REDIR_TEMP
public static final byte STATUS_FETCH_REDIR_TEMP
Fetching temporarily redirected to other page.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_REDIR_TEMP)
-
STATUS_FETCH_REDIR_PERM
public static final byte STATUS_FETCH_REDIR_PERM
Fetching permanently redirected to other page.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_REDIR_PERM)
-
STATUS_FETCH_GONE
public static final byte STATUS_FETCH_GONE
Fetching unsuccessful - page is gone.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_GONE)
-
STATUS_FETCH_NOTMODIFIED
public static final byte STATUS_FETCH_NOTMODIFIED
Fetching successful - page is not modified.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_NOTMODIFIED)
-
STATUS_FETCH_MAX
public static final byte STATUS_FETCH_MAX
Maximum value of fetch-related status.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_MAX)
-
STATUS_SIGNATURE
public static final byte STATUS_SIGNATURE
Page signature.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_SIGNATURE)
-
STATUS_INJECTED
public static final byte STATUS_INJECTED
Page was newly injected.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_INJECTED)
-
STATUS_LINKED
public static final byte STATUS_LINKED
Page discovered through a link.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_LINKED)
-
STATUS_PARSE_META
public static final byte STATUS_PARSE_META
Page got metadata from a parser
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_PARSE_META)
-
statNames
public static final HashMap<Byte,String> statNames
Constructor Detail
-
CrawlDatum
public CrawlDatum()
-
CrawlDatum
public CrawlDatum(int status, int fetchInterval)
-
CrawlDatum
public CrawlDatum(int status, int fetchInterval, float score)
Method Detail
-
hasDbStatus
public static boolean hasDbStatus(CrawlDatum datum)
-
hasFetchStatus
public static boolean hasFetchStatus(CrawlDatum datum)
-
getStatus
public byte getStatus()
-
getStatusName
public static String getStatusName(byte value)
-
setStatus
public void setStatus(int status)
-
getFetchTime
public long getFetchTime()
Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
-
setFetchTime
public void setFetchTime(long fetchTime)
Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.
-
getModifiedTime
public long getModifiedTime()
-
setModifiedTime
public void setModifiedTime(long modifiedTime)
-
getRetriesSinceFetch
public byte getRetriesSinceFetch()
-
setRetriesSinceFetch
public void setRetriesSinceFetch(int retries)
-
getFetchInterval
public int getFetchInterval()
-
setFetchInterval
public void setFetchInterval(int fetchInterval)
-
setFetchInterval
public void setFetchInterval(float fetchInterval)
-
getScore
public float getScore()
-
setScore
public void setScore(float score)
-
getSignature
public byte[] getSignature()
-
setSignature
public void setSignature(byte[] signature)
-
setMetaData
public void setMetaData(org.apache.hadoop.io.MapWritable mapWritable)
-
putAllMetaData
public void putAllMetaData(CrawlDatum other)
Add all metadata from other CrawlDatum to this CrawlDatum.
- Parameters:
- <code>other</code> - CrawlDatum
-
getMetaData
public org.apache.hadoop.io.MapWritable getMetaData()
returns a MapWritable if it was set or read in @see readFields(DataInput), returns empty map in case CrawlDatum was freshly created (lazily instantiated).
-
read
public static CrawlDatum read(DataInput in) throws IOException
- Throws:
- <code>IOException</code>
-
readFields
public void readFields(DataInput in) throws IOException
- Specified by:
- <code>readFields</code> in interface <code>org.apache.hadoop.io.Writable</code>
- Throws:
- <code>IOException</code>
-
write
public void write(DataOutput out) throws IOException
- Specified by:
- <code>write</code> in interface <code>org.apache.hadoop.io.Writable</code>
- Throws:
- <code>IOException</code>
-
set
public void set(CrawlDatum that)
Copy the contents of another instance into this instance.
-
compareTo
public int compareTo(CrawlDatum that)
Sort by decreasing score.
- Specified by:
- <code>compareTo</code> in interface <code>Comparable<crawldatum></crawldatum></code>
-
toString
public String toString()
- Overrides:
- <code>toString</code> in class <code>Object</code>
-
equals
public boolean equals(Object o)
- Overrides:
- <code>equals</code> in class <code>Object</code>
-
hashCode
public int hashCode()
- Overrides:
- <code>hashCode</code> in class <code>Object</code>
-
clone
public Object clone()
- Overrides:
- <code>clone</code> in class <code>Object</code>