[TOC]

org.apache.nutch.crawl

Class LinkDbMerger

  • java.lang.Object
    • org.apache.hadoop.conf.Configured
    • org.apache.nutch.crawl.LinkDbMerger
    • All Implemented Interfaces:
    • Closeable, AutoCloseable, org.apache.hadoop.conf.Configurable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Reducer, org.apache.hadoop.util.Tool

public class LinkDbMerger
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,Inlinks,org.apache.hadoop.io.Text,Inlinks>

This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links. It's possible to use this tool just for filtering - in that case only one LinkDb should be specified in arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won't count when checking the above-mentioned maximum limit.

  • Author:
  • Andrzej Bialecki

Constructor Summary

Constructors Constructor and Description LinkDbMerger() LinkDbMerger(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description void close() void configure(org.apache.hadoop.mapred.JobConf job) static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration config, org.apache.hadoop.fs.Path linkDb, boolean normalize, boolean filter) static void main(String[] args) void merge(org.apache.hadoop.fs.Path output, org.apache.hadoop.fs.Path[] dbs, boolean normalize, boolean filter) void reduce(org.apache.hadoop.io.Text key, Iterator values, org.apache.hadoop.mapred.OutputCollector output, org.apache.hadoop.mapred.Reporter reporter) int run(String[] args)

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Constructor Detail

-  

LinkDbMerger

public LinkDbMerger()
-  

LinkDbMerger

public LinkDbMerger(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

reduce

public void reduce(org.apache.hadoop.io.Text key,
          Iterator<Inlinks> values,
          org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,Inlinks> output,
          org.apache.hadoop.mapred.Reporter reporter)
            throws IOException
  - Specified by: 
  - <code>reduce</code> in interface <code>org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.text,inlinks,org.apache.hadoop.io.text,inlinks></org.apache.hadoop.io.text,inlinks,org.apache.hadoop.io.text,inlinks></code> 
  - Throws: 
  - <code>IOException</code>       
-  

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
  - Specified by: 
  - <code>configure</code> in interface <code>org.apache.hadoop.mapred.JobConfigurable</code>        
-  

close

public void close()
           throws IOException
  - Specified by: 
  - <code>close</code> in interface <code>Closeable</code> 
  - Specified by: 
  - <code>close</code> in interface <code>AutoCloseable</code> 
  - Throws: 
  - <code>IOException</code>       
-  

merge

public void merge(org.apache.hadoop.fs.Path output,
         org.apache.hadoop.fs.Path[] dbs,
         boolean normalize,
         boolean filter)
           throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

createMergeJob

public static org.apache.hadoop.mapred.JobConf createMergeJob(org.apache.hadoop.conf.Configuration config,
                                              org.apache.hadoop.fs.Path linkDb,
                                              boolean normalize,
                                              boolean filter)
-  

main

public static void main(String[] args)
                 throws Exception
  - Parameters:
  - <code>args</code> -  
  - Throws: 
  - <code>Exception</code>       
-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation