[TOC]

org.apache.nutch.fetcher

Class Fetcher

  • java.lang.Object
    • org.apache.hadoop.conf.Configured
    • org.apache.nutch.fetcher.Fetcher
    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.MapRunnable, org.apache.hadoop.util.Tool

public class Fetcher
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.MapRunnable<org.apache.hadoop.io.Text,CrawlDatum,org.apache.hadoop.io.Text,NutchWritable>

A queue-based fetcher. This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s).

QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads).

As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish.

This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own "politeness" settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for "politeness" reasons this host's queue is not yet ready.

If there are still unfetched items in the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung).

  • Author:
  • Andrzej Bialecki

Nested Class Summary

Nested Classes Modifier and Type Class and Description static class Fetcher.InputFormat

Field Summary

Fields Modifier and Type Field and Description static String CONTENT_REDIR static org.slf4j.Logger LOG static int PERM_REFRESH_TIME static String PROTOCOL_REDIR

Constructor Summary

Constructors Constructor and Description Fetcher() Fetcher(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description void close() void configure(org.apache.hadoop.mapred.JobConf job) void fetch(org.apache.hadoop.fs.Path segment, int threads) static boolean isParsing(org.apache.hadoop.conf.Configuration conf) static boolean isStoringContent(org.apache.hadoop.conf.Configuration conf) static void main(String[] args) Run the fetcher. void run(org.apache.hadoop.mapred.RecordReader input, org.apache.hadoop.mapred.OutputCollector output, org.apache.hadoop.mapred.Reporter reporter) int run(String[] args)

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Field Detail

-  

PERM_REFRESH_TIME

public static final int PERM_REFRESH_TIME
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.fetcher.Fetcher.PERM_REFRESH_TIME)       
-  

CONTENT_REDIR

public static final String CONTENT_REDIR
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.fetcher.Fetcher.CONTENT_REDIR)       
-  

PROTOCOL_REDIR

public static final String PROTOCOL_REDIR
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.fetcher.Fetcher.PROTOCOL_REDIR)       
-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

Fetcher

public Fetcher()
-  

Fetcher

public Fetcher(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
  - Specified by: 
  - <code>configure</code> in interface <code>org.apache.hadoop.mapred.JobConfigurable</code>        
-  

close

public void close()
-  

isParsing

public static boolean isParsing(org.apache.hadoop.conf.Configuration conf)
-  

isStoringContent

public static boolean isStoringContent(org.apache.hadoop.conf.Configuration conf)
-  

run

public void run(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,CrawlDatum> input,
       org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
       org.apache.hadoop.mapred.Reporter reporter)
         throws IOException
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.mapred.MapRunnable<org.apache.hadoop.io.text,crawldatum,org.apache.hadoop.io.text,nutchwritable></org.apache.hadoop.io.text,crawldatum,org.apache.hadoop.io.text,nutchwritable></code> 
  - Throws: 
  - <code>IOException</code>       
-  

fetch

public void fetch(org.apache.hadoop.fs.Path segment,
         int threads)
           throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

main

public static void main(String[] args)
                 throws Exception

Run the fetcher.

  - Throws: 
  - <code>Exception</code>       
-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation