org.apache.nutch.crawl
Class CrawlDbMerger
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.crawl.CrawlDbMerger
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class CrawlDbMerger extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages. It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime()
). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.
- Author:
- Andrzej Bialecki
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
CrawlDbMerger.Merger
Constructor Summary
Constructors Constructor and Description CrawlDbMerger()
CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description static org.apache.hadoop.mapred.JobConf
createMergeJob(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path output,
boolean normalize,
boolean filter)
static void
main(String[] args)
void
merge(org.apache.hadoop.fs.Path output,
org.apache.hadoop.fs.Path[] dbs,
boolean normalize,
boolean filter)
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Constructor Detail
-
CrawlDbMerger
public CrawlDbMerger()
-
CrawlDbMerger
public CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
merge
public void merge(org.apache.hadoop.fs.Path output, org.apache.hadoop.fs.Path[] dbs, boolean normalize, boolean filter) throws Exception
- Throws:
- <code>Exception</code>
-
createMergeJob
public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path output, boolean normalize, boolean filter)
-
main
public static void main(String[] args) throws Exception
- Parameters:
- <code>args</code> -
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>