- Prev Class
- Next Class
org.apache.nutch.protocol.http.api
Class HttpRobotRulesParser
- java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser
- org.apache.nutch.protocol.http.api.HttpRobotRulesParser
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
public class HttpRobotRulesParser extends RobotRulesParser
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser
class and contains Http protocol specific implementation for obtaining the robots file.
Field Summary
Fields Modifier and Type Field and Description protected boolean
allowForbidden
static org.slf4j.Logger
LOG
-
Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES
Constructor Summary
Constructors Constructor and Description HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description protected static String
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol http,
URL url)
Get the rules from robots.txt which applies for the given url
.
void
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
-
Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
getConf, getRobotRulesSet, main, parseRules
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
allowForbidden
protected boolean allowForbidden
Constructor Detail
-
HttpRobotRulesParser
public HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Description copied from class: RobotRulesParser
Set the Configuration
object
1
2
3
4
5
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
- Overrides:
- <code>setConf</code> in class <code>RobotRulesParser</code>
-
getCacheKey
protected static String getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url)
Get the rules from robots.txt which applies for the given url
. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.
1
2
3
4
5
6
7
- Specified by:
- <code>getRobotRulesSet</code> in class <code>RobotRulesParser</code>
- Parameters:
- <code>http</code> - The [<code>Protocol</code>](../../../../../../org/apache/nutch/protocol/Protocol.html) object
- <code>url</code> - URL robots.txt applies to
- Returns:
- <code>BaseRobotRules</code> holding the rules from robots.txt
- Prev Class
- Next Class