- Prev Class
- Next Class
org.apache.nutch.protocol
Class RobotRulesParser
- java.lang.Object
- org.apache.nutch.protocol.RobotRulesParser
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
- Direct Known Subclasses:
- FtpRobotRulesParser, HttpRobotRulesParser
public abstract class RobotRulesParser extends Object implements org.apache.hadoop.conf.Configurable
This class uses crawler-commons for handling the parsing of robots.txt
files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.
Field Summary
Fields Modifier and Type Field and Description protected String
agentNames
protected static Hashtable
CACHE
static crawlercommons.robots.BaseRobotRules
EMPTY_RULES
A BaseRobotRules
object appropriate for use when the robots.txt
file is empty or missing; all requests are allowed.
static crawlercommons.robots.BaseRobotRules
FORBID_ALL_RULES
A BaseRobotRules
object appropriate for use when the robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.
static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description RobotRulesParser()
RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration
getConf()
Get the Configuration
object
crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol,
org.apache.hadoop.io.Text url)
abstract crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol,
URL url)
static void
main(String[] argv)
command-line main for testing
crawlercommons.robots.BaseRobotRules
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the SimpleRobotRulesParser
from crawler commons
void
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
CACHE
protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
-
EMPTY_RULES
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
A BaseRobotRules
object appropriate for use when the robots.txt
file is empty or missing; all requests are allowed.
-
FORBID_ALL_RULES
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
A BaseRobotRules
object appropriate for use when the robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.
-
agentNames
protected String agentNames
Constructor Detail
-
RobotRulesParser
public RobotRulesParser()
-
RobotRulesParser
public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration
object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
parseRules
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
Parses the robots content using the SimpleRobotRulesParser
from crawler commons
- Parameters:
- <code>url</code> - A string containing url
- <code>content</code> - Contents of the robots file in a byte array
- <code>contentType</code> - The content type of the robots file
- <code>robotName</code> - A string containing all the robots agent names used by parser for matching
- Returns:
- BaseRobotRules object
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, org.apache.hadoop.io.Text url)
-
getRobotRulesSet
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
-
main
public static void main(String[] argv)
command-line main for testing
- Prev Class
- Next Class