Class HttpRobotRulesParser

[TOC]

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

org.apache.nutch.protocol.http.api

java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser
- org.apache.nutch.protocol.http.api.HttpRobotRulesParser

- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable

public class HttpRobotRulesParser
extends RobotRulesParser

This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.

Field Summary

Fields Modifier and Type Field and Description protected boolean allowForbidden static org.slf4j.Logger LOG

Fields inherited from class org.apache.nutch.protocol.RobotRulesParser

agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES

Constructor Summary

Constructors Constructor and Description HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description protected static String getCacheKey(URL url) Compose unique key to store and access robot rules in cache for given URL crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url) Get the rules from robots.txt which applies for the given url. void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object

Methods inherited from class org.apache.nutch.protocol.RobotRulesParser

getConf, getRobotRulesSet, main, parseRules

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

LOG

public static final org.slf4j.Logger LOG

allowForbidden

protected boolean allowForbidden

Constructor Detail

HttpRobotRulesParser

public HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Description copied from class: RobotRulesParser

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code> 
  - Overrides: 
  - <code>setConf</code> in class <code>RobotRulesParser</code>        
-

getCacheKey

protected static String getCacheKey(URL url)

Compose unique key to store and access robot rules in cache for given URL

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                                                    URL url)

Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.

  - Specified by: 
  - <code>getRobotRulesSet</code> in class <code>RobotRulesParser</code> 
  - Parameters:
  - <code>http</code> - The [<code>Protocol</code>](../../../../../../org/apache/nutch/protocol/Protocol.html) object
  - <code>url</code> - URL robots.txt applies to 
  - Returns:
  - <code>BaseRobotRules</code> holding the rules from robots.txt

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method