[TOC]

org.apache.nutch.crawl

Class Injector

  • java.lang.Object
    • org.apache.hadoop.conf.Configured
    • org.apache.nutch.crawl.Injector
    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class Injector
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool

This class takes a flat file of URLs and adds them to the of pages to be crawled. Useful for bootstrapping the system. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.

Note that some metadata keys are reserved :

  • nutch.score : allows to set a custom score for a specific URL

  • nutch.fetchInterval : allows to set a custom fetch interval for a specific URL

  • nutch.fetchInterval.fixed : allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule

    e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source

Nested Class Summary

Nested Classes Modifier and Type Class and Description static class Injector.InjectMapper Normalize and filter injected urls. static class Injector.InjectReducer Combine multiple new entries for a url.

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG static String nutchFetchIntervalMDName metadata key reserved for setting a custom fetchInterval for a specific URL static String nutchFixedFetchIntervalMDName metadata key reserved for setting a fixed custom fetchInterval for a specific URL static String nutchScoreMDName metadata key reserved for setting a custom score for a specific URL

Constructor Summary

Constructors Constructor and Description Injector() Injector(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description void inject(org.apache.hadoop.fs.Path crawlDb, org.apache.hadoop.fs.Path urlDir) static void main(String[] args) int run(String[] args)

-    

Methods inherited from class org.apache.hadoop.conf.Configured

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

nutchScoreMDName

public static String nutchScoreMDName

metadata key reserved for setting a custom score for a specific URL

-  

nutchFetchIntervalMDName

public static String nutchFetchIntervalMDName

metadata key reserved for setting a custom fetchInterval for a specific URL

-  

nutchFixedFetchIntervalMDName

public static String nutchFixedFetchIntervalMDName

metadata key reserved for setting a fixed custom fetchInterval for a specific URL

Constructor Detail

-  

Injector

public Injector()
-  

Injector

public Injector(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

inject

public void inject(org.apache.hadoop.fs.Path crawlDb,
          org.apache.hadoop.fs.Path urlDir)
            throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

run

public int run(String[] args)
        throws Exception
  - Specified by: 
  - <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code> 
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation