org.apache.nutch.crawl
Class Generator
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.crawl.Generator
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class Generator extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool
Generates a subset of a crawl db to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
Generator.CrawlDbUpdater
Update the CrawlDB so that the next generate won't include the same URLs.
static class
Generator.DecreasingFloatComparator
static class
Generator.GeneratorOutputFormat
static class
Generator.HashComparator
Sort fetch lists by hash of URL.
static class
Generator.PartitionReducer
static class
Generator.Selector
Selects entries due for fetch.
static class
Generator.SelectorEntry
static class
Generator.SelectorInverseMapper
Field Summary
Fields Modifier and Type Field and Description static String
GENERATE_MAX_PER_HOST_BY_IP
static String
GENERATE_UPDATE_CRAWLDB
static String
GENERATOR_COUNT_MODE
static String
GENERATOR_COUNT_VALUE_DOMAIN
static String
GENERATOR_COUNT_VALUE_HOST
static String
GENERATOR_CUR_TIME
static String
GENERATOR_DELAY
static String
GENERATOR_FILTER
static String
GENERATOR_MAX_COUNT
static String
GENERATOR_MAX_NUM_SEGMENTS
static String
GENERATOR_MIN_INTERVAL
static String
GENERATOR_MIN_SCORE
static String
GENERATOR_NORMALISE
static String
GENERATOR_RESTRICT_STATUS
static String
GENERATOR_TOP_N
static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description Generator()
Generator(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.fs.Path[]
generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime)
org.apache.hadoop.fs.Path[]
generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean force)
old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1
org.apache.hadoop.fs.Path[]
generate(org.apache.hadoop.fs.Path dbDir,
org.apache.hadoop.fs.Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean norm,
boolean force,
int maxNumSegments)
Generate fetchlists in one or more segments.
static String
generateSegmentName()
static void
main(String[] args)
Generate a fetchlist from the crawldb.
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
GENERATE_UPDATE_CRAWLDB
public static final String GENERATE_UPDATE_CRAWLDB
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATE_UPDATE_CRAWLDB)
-
GENERATOR_MIN_SCORE
public static final String GENERATOR_MIN_SCORE
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MIN_SCORE)
-
GENERATOR_MIN_INTERVAL
public static final String GENERATOR_MIN_INTERVAL
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MIN_INTERVAL)
-
GENERATOR_RESTRICT_STATUS
public static final String GENERATOR_RESTRICT_STATUS
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_RESTRICT_STATUS)
-
GENERATOR_FILTER
public static final String GENERATOR_FILTER
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_FILTER)
-
GENERATOR_NORMALISE
public static final String GENERATOR_NORMALISE
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_NORMALISE)
-
GENERATOR_MAX_COUNT
public static final String GENERATOR_MAX_COUNT
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MAX_COUNT)
-
GENERATOR_COUNT_MODE
public static final String GENERATOR_COUNT_MODE
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_MODE)
-
GENERATOR_COUNT_VALUE_DOMAIN
public static final String GENERATOR_COUNT_VALUE_DOMAIN
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_VALUE_DOMAIN)
-
GENERATOR_COUNT_VALUE_HOST
public static final String GENERATOR_COUNT_VALUE_HOST
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_COUNT_VALUE_HOST)
-
GENERATOR_TOP_N
public static final String GENERATOR_TOP_N
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_TOP_N)
-
GENERATOR_CUR_TIME
public static final String GENERATOR_CUR_TIME
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_CUR_TIME)
-
GENERATOR_DELAY
public static final String GENERATOR_DELAY
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_DELAY)
-
GENERATOR_MAX_NUM_SEGMENTS
public static final String GENERATOR_MAX_NUM_SEGMENTS
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATOR_MAX_NUM_SEGMENTS)
-
GENERATE_MAX_PER_HOST_BY_IP
public static final String GENERATE_MAX_PER_HOST_BY_IP
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.crawl.Generator.GENERATE_MAX_PER_HOST_BY_IP)
Constructor Detail
-
Generator
public Generator()
-
Generator
public Generator(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
generate
public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime) throws IOException
- Throws:
- <code>IOException</code>
-
generate
public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) throws IOException
old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1
- Throws:
- <code>IOException</code>
-
generate
public org.apache.hadoop.fs.Path[] generate(org.apache.hadoop.fs.Path dbDir, org.apache.hadoop.fs.Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments) throws IOException
Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the crawl.generate.filter property in the configuration files. If the property is not found, the URLs are filtered. Same for the normalisation.
- Parameters:
- <code>dbDir</code> - Crawl database directory
- <code>segments</code> - Segments directory
- <code>numLists</code> - Number of reduce tasks
- <code>topN</code> - Number of top URLs to be selected
- <code>curTime</code> - Current time in milliseconds
- Returns:
- Path to generated segment or null if no entries were selected
- Throws:
- <code>IOException</code> - When an I/O error occurs
-
generateSegmentName
public static String generateSegmentName()
-
main
public static void main(String[] args) throws Exception
Generate a fetchlist from the crawldb.
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>