- Prev Class
- Next Class
org.apache.nutch.protocol
Class RobotRulesParser
- java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
- Direct Known Subclasses:
- FtpRobotRulesParser, HttpRobotRulesParser
public abstract class RobotRulesParser extends Object implements org.apache.hadoop.conf.Configurable
This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.
Field Summary
Fields Modifier and Type Field and Description protected String agentNames protected static Hashtable CACHE static crawlercommons.robots.BaseRobotRules EMPTY_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
static org.slf4j.Logger LOG
Constructor Summary
Constructors Constructor and Description RobotRulesParser() RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object
crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
org.apache.hadoop.io.Text url) abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
URL url) static void main(String[] argv)
command-line main for testing
crawlercommons.robots.BaseRobotRules parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the SimpleRobotRulesParser from crawler commons
void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
CACHE
protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
-
EMPTY_RULES
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
-
FORBID_ALL_RULES
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
-
agentNames
protected String agentNames
Constructor Detail
-
RobotRulesParser
public RobotRulesParser()
-
RobotRulesParser
public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
parseRules
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
Parses the robots content using the SimpleRobotRulesParser from crawler commons
- Parameters:
- <code>url</code> - A string containing url
- <code>content</code> - Contents of the robots file in a byte array
- <code>contentType</code> - The content type of the robots file
- <code>robotName</code> - A string containing all the robots agent names used by parser for matching
- Returns:
- BaseRobotRules object
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, org.apache.hadoop.io.Text url)
-
getRobotRulesSet
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
-
main
public static void main(String[] argv)
command-line main for testing
- Prev Class
- Next Class
