[TOC]

org.apache.nutch.protocol.http.api

Class HttpRobotRulesParser

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable

public class HttpRobotRulesParser
extends RobotRulesParser

This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.

Field Summary

Fields Modifier and Type Field and Description protected boolean allowForbidden static org.slf4j.Logger LOG

-    

Fields inherited from class org.apache.nutch.protocol.RobotRulesParser

agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES

Constructor Summary

Constructors Constructor and Description HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description protected static String getCacheKey(URL url) Compose unique key to store and access robot rules in cache for given URL crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url) Get the rules from robots.txt which applies for the given url. void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object

-    

Methods inherited from class org.apache.nutch.protocol.RobotRulesParser

getConf, getRobotRulesSet, main, parseRules

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

allowForbidden

protected boolean allowForbidden

Constructor Detail

-  

HttpRobotRulesParser

public HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Description copied from class: RobotRulesParser

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code> 
  - Overrides: 
  - <code>setConf</code> in class <code>RobotRulesParser</code>        
-  

getCacheKey

protected static String getCacheKey(URL url)

Compose unique key to store and access robot rules in cache for given URL

-  

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                                                    URL url)

Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.

  - Specified by: 
  - <code>getRobotRulesSet</code> in class <code>RobotRulesParser</code> 
  - Parameters:
  - <code>http</code> - The [<code>Protocol</code>](../../../../../../org/apache/nutch/protocol/Protocol.html) object
  - <code>url</code> - URL robots.txt applies to 
  - Returns:
  - <code>BaseRobotRules</code> holding the rules from robots.txt      

Copyright © 2014 The Apache Software Foundation