- Prev Class
 - Next Class
 
org.apache.nutch.protocol.http.api
Class HttpRobotRulesParser
- java.lang.Object
 - org.apache.nutch.protocol.RobotRulesParser
 - org.apache.nutch.protocol.http.api.HttpRobotRulesParser
 
- All Implemented Interfaces:
 - org.apache.hadoop.conf.Configurable
 
public class HttpRobotRulesParser extends RobotRulesParser
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.
Field Summary
 Fields   Modifier and Type Field and Description   protected boolean allowForbidden    static org.slf4j.Logger LOG   
-    
Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
 agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES      
Constructor Summary
 Constructors   Constructor and Description   HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)   
Method Summary
 Methods   Modifier and Type Method and Description   protected static String getCacheKey(URL url) 
Compose unique key to store and access robot rules in cache for given URL
    crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                URL url) 
Get the rules from robots.txt which applies for the given url.
    void setConf(org.apache.hadoop.conf.Configuration conf) 
Set the Configuration object
-    
Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
 getConf, getRobotRulesSet, main, parseRules   
-    
Methods inherited from class java.lang.Object
 clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait     
Field Detail
-  
LOG
public static final org.slf4j.Logger LOG
-  
allowForbidden
protected boolean allowForbidden
Constructor Detail
-  
HttpRobotRulesParser
public HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail
-  
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Description copied from class: RobotRulesParser
Set the Configuration object
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code> 
  - Overrides: 
  - <code>setConf</code> in class <code>RobotRulesParser</code>        
-  
getCacheKey
protected static String getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
-  
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url)
Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.
  - Specified by: 
  - <code>getRobotRulesSet</code> in class <code>RobotRulesParser</code> 
  - Parameters:
  - <code>http</code> - The [<code>Protocol</code>](../../../../../../org/apache/nutch/protocol/Protocol.html) object
  - <code>url</code> - URL robots.txt applies to 
  - Returns:
  - <code>BaseRobotRules</code> holding the rules from robots.txt      
   
- Prev Class
 - Next Class
 
