[TOC]

org.apache.nutch.crawl

Class CrawlDbMerger

  • java.lang.Object
    • org.apache.hadoop.conf.Configured
    • org.apache.nutch.crawl.CrawlDbMerger
    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class CrawlDbMerger
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages. It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime()). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

  • Author:
  • Andrzej Bialecki

Nested Class Summary

Nested Classes Modifier and Type Class and Description static class CrawlDbMerger.Merger

Constructor Summary

Constructors Constructor and Description CrawlDbMerger() CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path output, boolean normalize, boolean filter) static void main(String[] args) void merge(org.apache.hadoop.fs.Path output, org.apache.hadoop.fs.Path[] dbs, boolean normalize, boolean filter) int run(String[] args)

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Constructor Detail

-  

CrawlDbMerger

public CrawlDbMerger()
-  

CrawlDbMerger

public CrawlDbMerger(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

merge

public void merge(org.apache.hadoop.fs.Path output,
         org.apache.hadoop.fs.Path[] dbs,
         boolean normalize,
         boolean filter)
           throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

createMergeJob

public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration conf,
                                              org.apache.hadoop.fs.Path output,
                                              boolean normalize,
                                              boolean filter)
-  

main

public static void main(String[] args)
                 throws Exception
  - Parameters:
  - <code>args</code> -  
  - Throws: 
  - <code>Exception</code>       
-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation