[TOC]

org.apache.nutch.crawl

Class CrawlDatum


public class CrawlDatum
extends Object
implements org.apache.hadoop.io.WritableComparable<CrawlDatum>, Cloneable

Nested Class Summary

Nested Classes Modifier and Type Class and Description static class CrawlDatum.Comparator A Comparator optimized for CrawlDatum.

Field Summary

Fields Modifier and Type Field and Description static String FETCH_DIR_NAME static String GENERATE_DIR_NAME static String PARSE_DIR_NAME static HashMap statNames static byte STATUS_DB_DUPLICATE static byte STATUS_DB_FETCHED Page was successfully fetched. static byte STATUS_DB_GONE Page no longer exists. static byte STATUS_DB_MAX Maximum value of DB-related status. static byte STATUS_DB_NOTMODIFIED Page was successfully fetched and found not modified. static byte STATUS_DB_REDIR_PERM Page permanently redirects to other page. static byte STATUS_DB_REDIR_TEMP Page temporarily redirects to other page. static byte STATUS_DB_UNFETCHED Page was not fetched yet. static byte STATUS_FETCH_GONE Fetching unsuccessful - page is gone. static byte STATUS_FETCH_MAX Maximum value of fetch-related status. static byte STATUS_FETCH_NOTMODIFIED Fetching successful - page is not modified. static byte STATUS_FETCH_REDIR_PERM Fetching permanently redirected to other page. static byte STATUS_FETCH_REDIR_TEMP Fetching temporarily redirected to other page. static byte STATUS_FETCH_RETRY Fetching unsuccessful, needs to be retried (transient errors). static byte STATUS_FETCH_SUCCESS Fetching was successful. static byte STATUS_INJECTED Page was newly injected. static byte STATUS_LINKED Page discovered through a link. static byte STATUS_PARSE_META Page got metadata from a parser static byte STATUS_SIGNATURE Page signature.

Constructor Summary

Constructors Constructor and Description CrawlDatum() CrawlDatum(int status, int fetchInterval) CrawlDatum(int status, int fetchInterval, float score)

Method Summary

Methods Modifier and Type Method and Description Object clone() int compareTo(CrawlDatum that) Sort by decreasing score. boolean equals(Object o) int getFetchInterval() long getFetchTime() Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time. org.apache.hadoop.io.MapWritable getMetaData() returns a MapWritable if it was set or read in @see readFields(DataInput), returns empty map in case CrawlDatum was freshly created (lazily instantiated). long getModifiedTime() byte getRetriesSinceFetch() float getScore() byte[] getSignature() byte getStatus() static String getStatusName(byte value) static boolean hasDbStatus(CrawlDatum datum) static boolean hasFetchStatus(CrawlDatum datum) int hashCode() void putAllMetaData(CrawlDatum other) Add all metadata from other CrawlDatum to this CrawlDatum. static CrawlDatum read(DataInput in) void readFields(DataInput in) void set(CrawlDatum that) Copy the contents of another instance into this instance. void setFetchInterval(float fetchInterval) void setFetchInterval(int fetchInterval) void setFetchTime(long fetchTime) Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time. void setMetaData(org.apache.hadoop.io.MapWritable mapWritable) void setModifiedTime(long modifiedTime) void setRetriesSinceFetch(int retries) void setScore(float score) void setSignature(byte[] signature) void setStatus(int status) String toString() void write(DataOutput out)

-    

Methods inherited from class java.lang.Object

finalize, getClass, notify, notifyAll, wait, wait, wait

Field Detail

-  

GENERATE_DIR_NAME

public static final String GENERATE_DIR_NAME
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.GENERATE_DIR_NAME)       
-  

FETCH_DIR_NAME

public static final String FETCH_DIR_NAME
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.FETCH_DIR_NAME)       
-  

PARSE_DIR_NAME

public static final String PARSE_DIR_NAME
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.PARSE_DIR_NAME)       
-  

STATUS_DB_UNFETCHED

public static final byte STATUS_DB_UNFETCHED

Page was not fetched yet.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_UNFETCHED)       
-  

STATUS_DB_FETCHED

public static final byte STATUS_DB_FETCHED

Page was successfully fetched.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_FETCHED)       
-  

STATUS_DB_GONE

public static final byte STATUS_DB_GONE

Page no longer exists.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_GONE)       
-  

STATUS_DB_REDIR_TEMP

public static final byte STATUS_DB_REDIR_TEMP

Page temporarily redirects to other page.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_REDIR_TEMP)       
-  

STATUS_DB_REDIR_PERM

public static final byte STATUS_DB_REDIR_PERM

Page permanently redirects to other page.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_REDIR_PERM)       
-  

STATUS_DB_NOTMODIFIED

public static final byte STATUS_DB_NOTMODIFIED

Page was successfully fetched and found not modified.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_NOTMODIFIED)       
-  

STATUS_DB_DUPLICATE

public static final byte STATUS_DB_DUPLICATE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_DUPLICATE)       
-  

STATUS_DB_MAX

public static final byte STATUS_DB_MAX

Maximum value of DB-related status.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_DB_MAX)       
-  

STATUS_FETCH_SUCCESS

public static final byte STATUS_FETCH_SUCCESS

Fetching was successful.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_SUCCESS)       
-  

STATUS_FETCH_RETRY

public static final byte STATUS_FETCH_RETRY

Fetching unsuccessful, needs to be retried (transient errors).

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_RETRY)       
-  

STATUS_FETCH_REDIR_TEMP

public static final byte STATUS_FETCH_REDIR_TEMP

Fetching temporarily redirected to other page.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_REDIR_TEMP)       
-  

STATUS_FETCH_REDIR_PERM

public static final byte STATUS_FETCH_REDIR_PERM

Fetching permanently redirected to other page.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_REDIR_PERM)       
-  

STATUS_FETCH_GONE

public static final byte STATUS_FETCH_GONE

Fetching unsuccessful - page is gone.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_GONE)       
-  

STATUS_FETCH_NOTMODIFIED

public static final byte STATUS_FETCH_NOTMODIFIED

Fetching successful - page is not modified.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_NOTMODIFIED)       
-  

STATUS_FETCH_MAX

public static final byte STATUS_FETCH_MAX

Maximum value of fetch-related status.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_FETCH_MAX)       
-  

STATUS_SIGNATURE

public static final byte STATUS_SIGNATURE

Page signature.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_SIGNATURE)       
-  

STATUS_INJECTED

public static final byte STATUS_INJECTED

Page was newly injected.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_INJECTED)       
-  

STATUS_LINKED

public static final byte STATUS_LINKED

Page discovered through a link.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_LINKED)       
-  

STATUS_PARSE_META

public static final byte STATUS_PARSE_META

Page got metadata from a parser

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDatum.STATUS_PARSE_META)       
-  

statNames

public static final HashMap<Byte,String> statNames

Constructor Detail

-  

CrawlDatum

public CrawlDatum()
-  

CrawlDatum

public CrawlDatum(int status,
          int fetchInterval)
-  

CrawlDatum

public CrawlDatum(int status,
          int fetchInterval,
          float score)

Method Detail

-  

hasDbStatus

public static boolean hasDbStatus(CrawlDatum datum)
-  

hasFetchStatus

public static boolean hasFetchStatus(CrawlDatum datum)
-  

getStatus

public byte getStatus()
-  

getStatusName

public static String getStatusName(byte value)
-  

setStatus

public void setStatus(int status)
-  

getFetchTime

public long getFetchTime()

Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.

-  

setFetchTime

public void setFetchTime(long fetchTime)

Sets either the time of the last fetch or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time.

-  

getModifiedTime

public long getModifiedTime()
-  

setModifiedTime

public void setModifiedTime(long modifiedTime)
-  

getRetriesSinceFetch

public byte getRetriesSinceFetch()
-  

setRetriesSinceFetch

public void setRetriesSinceFetch(int retries)
-  

getFetchInterval

public int getFetchInterval()
-  

setFetchInterval

public void setFetchInterval(int fetchInterval)
-  

setFetchInterval

public void setFetchInterval(float fetchInterval)
-  

getScore

public float getScore()
-  

setScore

public void setScore(float score)
-  

getSignature

public byte[] getSignature()
-  

setSignature

public void setSignature(byte[] signature)
-  

setMetaData

public void setMetaData(org.apache.hadoop.io.MapWritable mapWritable)
-  

putAllMetaData

public void putAllMetaData(CrawlDatum other)

Add all metadata from other CrawlDatum to this CrawlDatum.

  - Parameters:
  - <code>other</code> - CrawlDatum       
-  

getMetaData

public org.apache.hadoop.io.MapWritable getMetaData()

returns a MapWritable if it was set or read in @see readFields(DataInput), returns empty map in case CrawlDatum was freshly created (lazily instantiated).

-  

read

public static CrawlDatum read(DataInput in)
                       throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

readFields

public void readFields(DataInput in)
                throws IOException
  - Specified by: 
  - <code>readFields</code> in interface <code>org.apache.hadoop.io.Writable</code> 
  - Throws: 
  - <code>IOException</code>       
-  

write

public void write(DataOutput out)
           throws IOException
  - Specified by: 
  - <code>write</code> in interface <code>org.apache.hadoop.io.Writable</code> 
  - Throws: 
  - <code>IOException</code>       
-  

set

public void set(CrawlDatum that)

Copy the contents of another instance into this instance.

-  

compareTo

public int compareTo(CrawlDatum that)

Sort by decreasing score.

  - Specified by: 
  - <code>compareTo</code> in interface <code>Comparable<crawldatum></crawldatum></code>        
-  

toString

public String toString()
  - Overrides: 
  - <code>toString</code> in class <code>Object</code>        
-  

equals

public boolean equals(Object o)
  - Overrides: 
  - <code>equals</code> in class <code>Object</code>        
-  

hashCode

public int hashCode()
  - Overrides: 
  - <code>hashCode</code> in class <code>Object</code>        
-  

clone

public Object clone()
  - Overrides: 
  - <code>clone</code> in class <code>Object</code>       

Copyright © 2014 The Apache Software Foundation