- Prev Class
- Next Class
org.apache.nutch.protocol.file
Class File
- java.lang.Object
- org.apache.nutch.protocol.file.File
public class File extends Object implements Protocol
This class is a protocol plugin used for file: scheme. It creates FileResponse
object and gets the content of the url from it. Configurable parameters are file.content.limit
and file.crawl.parent
in nutch-default.xml defined under "file properties" section.
- Author:
- John Xing
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor Summary
Constructors Constructor and Description File()
Method Summary
Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration
getConf()
Get the Configuration
object
ProtocolOutput
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Creates a FileResponse
object corresponding to the url and return a ProtocolOutput
object as per the content received
crawlercommons.robots.BaseRobotRules
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
No robots parsing is done for file protocol.
static void
main(String[] args)
Quick way for running this class.
void
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
void
setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
File
public File()
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration
object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
setMaxContentLength
public void setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
Creates a FileResponse
object corresponding to the url and return a ProtocolOutput
object as per the content received
- Specified by:
- <code>getProtocolOutput</code> in interface <code>Protocol</code>
- Parameters:
- <code>url</code> - Text containing the url
- <code>datum</code> - The CrawlDatum object corresponding to the url
- Returns:
- [<code>ProtocolOutput</code>](../../../../../org/apache/nutch/protocol/ProtocolOutput.html) object for the content of the file indicated by url
-
main
public static void main(String[] args) throws Exception
Quick way for running this class. Useful for debugging.
- Throws:
- <code>Exception</code>
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.
- Specified by:
- <code>getRobotRules</code> in interface <code>Protocol</code>
- Parameters:
- <code>url</code> - url to check
- <code>datum</code> - page datum
- Returns:
- robot rules (specific for this url or default), never null
- Prev Class
- Next Class