[TOC]

org.apache.nutch.protocol

Class RobotRulesParser


public abstract class RobotRulesParser
extends Object
implements org.apache.hadoop.conf.Configurable

This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.

Field Summary

Fields Modifier and Type Field and Description protected String agentNames protected static Hashtable CACHE static crawlercommons.robots.BaseRobotRules EMPTY_RULES A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed. static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed. static org.slf4j.Logger LOG

Constructor Summary

Constructors Constructor and Description RobotRulesParser() RobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf() Get the Configuration object crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, org.apache.hadoop.io.Text url) abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url) static void main(String[] argv) command-line main for testing crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName) Parses the robots content using the SimpleRobotRulesParser from crawler commons void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

CACHE

protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
-  

EMPTY_RULES

public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES

A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.

-  

FORBID_ALL_RULES

public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES

A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.

-  

agentNames

protected String agentNames

Constructor Detail

-  

RobotRulesParser

public RobotRulesParser()
-  

RobotRulesParser

public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()

Get the Configuration object

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

parseRules

public crawlercommons.robots.BaseRobotRules parseRules(String url,
                                              byte[] content,
                                              String contentType,
                                              String robotName)

Parses the robots content using the SimpleRobotRulesParser from crawler commons

  - Parameters:
  - <code>url</code> - A string containing url
  - <code>content</code> - Contents of the robots file in a byte array
  - <code>contentType</code> - The content type of the robots file
  - <code>robotName</code> - A string containing all the robots agent names used by parser for matching 
  - Returns:
  - BaseRobotRules object       
-  

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                    org.apache.hadoop.io.Text url)
-  

getRobotRulesSet

public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                    URL url)
-  

main

public static void main(String[] argv)

command-line main for testing

Copyright © 2014 The Apache Software Foundation