org.apache.nutch.crawl
Class Injector
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.crawl.Injector
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class Injector extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool
This class takes a flat file of URLs and adds them to the of pages to be crawled. Useful for bootstrapping the system. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.
Note that some metadata keys are reserved :
nutch.score : allows to set a custom score for a specific URL
nutch.fetchInterval : allows to set a custom fetch interval for a specific URL
nutch.fetchInterval.fixed : allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
Injector.InjectMapper
Normalize and filter injected urls.
static class
Injector.InjectReducer
Combine multiple new entries for a url.
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
static String
nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URL
static String
nutchFixedFetchIntervalMDName
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
static String
nutchScoreMDName
metadata key reserved for setting a custom score for a specific URL
Constructor Summary
Constructors Constructor and Description Injector()
Injector(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description void
inject(org.apache.hadoop.fs.Path crawlDb,
org.apache.hadoop.fs.Path urlDir)
static void
main(String[] args)
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
nutchScoreMDName
public static String nutchScoreMDName
metadata key reserved for setting a custom score for a specific URL
-
nutchFetchIntervalMDName
public static String nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URL
-
nutchFixedFetchIntervalMDName
public static String nutchFixedFetchIntervalMDName
metadata key reserved for setting a fixed custom fetchInterval for a specific URL
Constructor Detail
-
Injector
public Injector()
-
Injector
public Injector(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
inject
public void inject(org.apache.hadoop.fs.Path crawlDb, org.apache.hadoop.fs.Path urlDir) throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>