[TOC]

org.apache.nutch.crawl

Class Generator

  • java.lang.Object
    • org.apache.hadoop.conf.Configured
    • org.apache.nutch.crawl.Generator
    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class Generator
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool

Generates a subset of a crawl db to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.

Nested Class Summary

Nested Classes Modifier and Type Class and Description static class Generator.CrawlDbUpdater Update the CrawlDB so that the next generate won't include the same URLs. static class Generator.DecreasingFloatComparator static class Generator.GeneratorOutputFormat static class Generator.HashComparator Sort fetch lists by hash of URL. static class Generator.PartitionReducer static class Generator.Selector Selects entries due for fetch. static class Generator.SelectorEntry static class Generator.SelectorInverseMapper

Field Summary

Fields Modifier and Type Field and Description static String GENERATE_MAX_PER_HOST_BY_IP static String GENERATE_UPDATE_CRAWLDB static String GENERATOR_COUNT_MODE static String GENERATOR_COUNT_VALUE_DOMAIN static String GENERATOR_COUNT_VALUE_HOST static String GENERATOR_CUR_TIME static String GENERATOR_DELAY static String GENERATOR_FILTER static String GENERATOR_MAX_COUNT static String GENERATOR_MAX_NUM_SEGMENTS static String GENERATOR_MIN_INTERVAL static String GENERATOR_MIN_SCORE static String GENERATOR_NORMALISE static String GENERATOR_RESTRICT_STATUS static String GENERATOR_TOP_N static org.slf4j.Logger LOG

Constructor Summary

Constructors Constructor and Description Generator() Generator(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime) org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1 org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments) Generate fetchlists in one or more segments. static String generateSegmentName() static void main(String[] args) Generate a fetchlist from the crawldb. int run(String[] args)

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

GENERATE_UPDATE_CRAWLDB

public static final String GENERATE_UPDATE_CRAWLDB
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATE_UPDATE_CRAWLDB)       
-  

GENERATOR_MIN_SCORE

public static final String GENERATOR_MIN_SCORE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MIN_SCORE)       
-  

GENERATOR_MIN_INTERVAL

public static final String GENERATOR_MIN_INTERVAL
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MIN_INTERVAL)       
-  

GENERATOR_RESTRICT_STATUS

public static final String GENERATOR_RESTRICT_STATUS
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_RESTRICT_STATUS)       
-  

GENERATOR_FILTER

public static final String GENERATOR_FILTER
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_FILTER)       
-  

GENERATOR_NORMALISE

public static final String GENERATOR_NORMALISE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_NORMALISE)       
-  

GENERATOR_MAX_COUNT

public static final String GENERATOR_MAX_COUNT
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MAX_COUNT)       
-  

GENERATOR_COUNT_MODE

public static final String GENERATOR_COUNT_MODE
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_MODE)       
-  

GENERATOR_COUNT_VALUE_DOMAIN

public static final String GENERATOR_COUNT_VALUE_DOMAIN
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_VALUE_DOMAIN)       
-  

GENERATOR_COUNT_VALUE_HOST

public static final String GENERATOR_COUNT_VALUE_HOST
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_VALUE_HOST)       
-  

GENERATOR_TOP_N

public static final String GENERATOR_TOP_N
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_TOP_N)       
-  

GENERATOR_CUR_TIME

public static final String GENERATOR_CUR_TIME
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_CUR_TIME)       
-  

GENERATOR_DELAY

public static final String GENERATOR_DELAY
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_DELAY)       
-  

GENERATOR_MAX_NUM_SEGMENTS

public static final String GENERATOR_MAX_NUM_SEGMENTS
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MAX_NUM_SEGMENTS)       
-  

GENERATE_MAX_PER_HOST_BY_IP

public static final String GENERATE_MAX_PER_HOST_BY_IP
  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATE_MAX_PER_HOST_BY_IP)       

Constructor Detail

-  

Generator

public Generator()
-  

Generator

public Generator(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

generate

public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir,
                                   org.apache.hadoop.fs.Path segments,
                                   int numLists,
                                   long topN,
                                   long curTime)
                                     throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

generate

public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir,
                                   org.apache.hadoop.fs.Path segments,
                                   int numLists,
                                   long topN,
                                   long curTime,
                                   boolean filter,
                                   boolean force)
                                     throws IOException

old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1

  - Throws: 
  - <code>IOException</code>       
-  

generate

public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir,
                                   org.apache.hadoop.fs.Path segments,
                                   int numLists,
                                   long topN,
                                   long curTime,
                                   boolean filter,
                                   boolean norm,
                                   boolean force,
                                   int maxNumSegments)
                                     throws IOException

Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the crawl.generate.filter property in the configuration files. If the property is not found, the URLs are filtered. Same for the normalisation.

  - Parameters:
  - <code>dbDir</code> - Crawl database directory
  - <code>segments</code> - Segments directory
  - <code>numLists</code> - Number of reduce tasks
  - <code>topN</code> - Number of top URLs to be selected
  - <code>curTime</code> - Current time in milliseconds 
  - Returns:
  - Path to generated segment or null if no entries were selected 
  - Throws: 
  - <code>IOException</code> - When an I/O error occurs       
-  

generateSegmentName

public static String generateSegmentName()
-  

main

public static void main(String[] args)
                 throws Exception

Generate a fetchlist from the crawldb.

  - Throws: 
  - <code>Exception</code>       
-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation