[TOC]

org.apache.nutch.protocol

Interface Protocol


public interface Protocol
extends Pluggable, org.apache.hadoop.conf.Configurable

A retriever of url content. Implemented by protocol extensions.

Field Summary

Fields Modifier and Type Field and Description static String CHECK_BLOCKING Property name. static String CHECK_ROBOTS Property name. static String X_POINT_ID The name of the extension point.

Method Summary

Methods Modifier and Type Method and Description ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum) Returns the Content for a fetchlist entry. crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum) Retrieve robot rules applicable for this url.

-    

Methods inherited from interface org.apache.hadoop.conf.Configurable

getConf, setConf

Field Detail

-  

X_POINT_ID

static final String X_POINT_ID

The name of the extension point.

-  

CHECK_BLOCKING

static final String CHECK_BLOCKING

Property name. If in the current configuration this property is set to true, protocol implementations should handle "politeness" limits internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.protocol.Protocol.CHECK_BLOCKING)       
-  

CHECK_ROBOTS

static final String CHECK_ROBOTS

Property name. If in the current configuration this property is set to true, protocol implementations should handle robot exclusion rules internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.protocol.Protocol.CHECK_ROBOTS)       

Method Detail

-  

getProtocolOutput

ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url,
                               CrawlDatum datum)

Returns the Content for a fetchlist entry.

-  

getRobotRules

crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url,
                                                 CrawlDatum datum)

Retrieve robot rules applicable for this url.

  - Parameters:
  - <code>url</code> - url to check
  - <code>datum</code> - page datum 
  - Returns:
  - robot rules (specific for this url or default), never null      

Copyright © 2014 The Apache Software Foundation