org.apache.nutch.crawl
Class CrawlDb
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.crawl.CrawlDb
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class CrawlDb extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool
This class takes the output of the fetcher and updates the crawldb accordingly.
Field Summary
Fields Modifier and Type Field and Description static String
CRAWLDB_ADDITIONS_ALLOWED
static String
CRAWLDB_PURGE_404
static String
CURRENT_NAME
static String
LOCK_NAME
static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description CrawlDb()
CrawlDb(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description static org.apache.hadoop.mapred.JobConf
createJob(org.apache.hadoop.conf.Configuration config,
org.apache.hadoop.fs.Path crawlDb)
static void
install(org.apache.hadoop.mapred.JobConf job,
org.apache.hadoop.fs.Path crawlDb)
static void
main(String[] args)
int
run(String[] args)
void
update(org.apache.hadoop.fs.Path crawlDb,
org.apache.hadoop.fs.Path[] segments,
boolean normalize,
boolean filter)
void
update(org.apache.hadoop.fs.Path crawlDb,
org.apache.hadoop.fs.Path[] segments,
boolean normalize,
boolean filter,
boolean additionsAllowed,
boolean force)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
CRAWLDB_ADDITIONS_ALLOWED
public static final String CRAWLDB_ADDITIONS_ALLOWED
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDb.CRAWLDB_ADDITIONS_ALLOWED)
-
CRAWLDB_PURGE_404
public static final String CRAWLDB_PURGE_404
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDb.CRAWLDB_PURGE_404)
-
CURRENT_NAME
public static final String CURRENT_NAME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDb.CURRENT_NAME)
-
LOCK_NAME
public static final String LOCK_NAME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.CrawlDb.LOCK_NAME)
Constructor Detail
-
CrawlDb
public CrawlDb()
-
CrawlDb
public CrawlDb(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
update
public void update(org.apache.hadoop.fs.Path crawlDb, org.apache.hadoop.fs.Path[] segments, boolean normalize, boolean filter) throws IOException
- Throws:
- <code>IOException</code>
-
update
public void update(org.apache.hadoop.fs.Path crawlDb, org.apache.hadoop.fs.Path[] segments, boolean normalize, boolean filter, boolean additionsAllowed, boolean force) throws IOException
- Throws:
- <code>IOException</code>
-
createJob
public static org.apache.hadoop.mapred.JobConf createJob(org.apache.hadoop.conf.Configuration config, org.apache.hadoop.fs.Path crawlDb) throws IOException
- Throws:
- <code>IOException</code>
-
install
public static void install(org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.fs.Path crawlDb) throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>