org.apache.nutch.protocol
Interface Protocol
public interface Protocol extends Pluggable, org.apache.hadoop.conf.Configurable
A retriever of url content. Implemented by protocol extensions.
Field Summary
Fields Modifier and Type Field and Description static String
CHECK_BLOCKING
Property name.
static String
CHECK_ROBOTS
Property name.
static String
X_POINT_ID
The name of the extension point.
Method Summary
Methods Modifier and Type Method and Description ProtocolOutput
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Returns the Content
for a fetchlist entry.
crawlercommons.robots.BaseRobotRules
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url.
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
-
CHECK_BLOCKING
static final String CHECK_BLOCKING
Property name. If in the current configuration this property is set to true, protocol implementations should handle "politeness" limits internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.protocol.Protocol.CHECK_BLOCKING)
-
CHECK_ROBOTS
static final String CHECK_ROBOTS
Property name. If in the current configuration this property is set to true, protocol implementations should handle robot exclusion rules internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.protocol.Protocol.CHECK_ROBOTS)
Method Detail
-
getProtocolOutput
ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
Returns the Content
for a fetchlist entry.
-
getRobotRules
crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
Retrieve robot rules applicable for this url.
- Parameters:
- <code>url</code> - url to check
- <code>datum</code> - page datum
- Returns:
- robot rules (specific for this url or default), never null