[TOC]

org.apache.nutch.protocol.file

Class File

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, Pluggable, Protocol

public class File
extends Object
implements Protocol

This class is a protocol plugin used for file: scheme. It creates FileResponse object and gets the content of the url from it. Configurable parameters are file.content.limit and file.crawl.parent in nutch-default.xml defined under "file properties" section.

  • Author:
  • John Xing

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.protocol.Protocol

CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID

Constructor Summary

Constructors Constructor and Description File()

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf() Get the Configuration object ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum) Creates a FileResponse object corresponding to the url and return a ProtocolOutput object as per the content received crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum) No robots parsing is done for file protocol. static void main(String[] args) Quick way for running this class. void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object void setMaxContentLength(int maxContentLength) Set the length after at which content is truncated.

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

File

public File()

Method Detail

-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()

Get the Configuration object

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

setMaxContentLength

public void setMaxContentLength(int maxContentLength)

Set the length after at which content is truncated.

-  

getProtocolOutput

public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url,
                               CrawlDatum datum)

Creates a FileResponse object corresponding to the url and return a ProtocolOutput object as per the content received

  - Specified by: 
  - <code>getProtocolOutput</code> in interface <code>Protocol</code> 
  - Parameters:
  - <code>url</code> - Text containing the url
  - <code>datum</code> - The CrawlDatum object corresponding to the url 
  - Returns:
  - [<code>ProtocolOutput</code>](../../../../../org/apache/nutch/protocol/ProtocolOutput.html) object for the content of the file indicated by url       
-  

main

public static void main(String[] args)
                 throws Exception

Quick way for running this class. Useful for debugging.

  - Throws: 
  - <code>Exception</code>       
-  

getRobotRules

public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url,
                                                 CrawlDatum datum)

No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.

  - Specified by: 
  - <code>getRobotRules</code> in interface <code>Protocol</code> 
  - Parameters:
  - <code>url</code> - url to check
  - <code>datum</code> - page datum 
  - Returns:
  - robot rules (specific for this url or default), never null      

Copyright © 2014 The Apache Software Foundation