org.apache.nutch.protocol.ftp
Class Ftp
- java.lang.Object
- org.apache.nutch.protocol.ftp.Ftp
public class Ftp extends Object implements Protocol
This class is a protocol plugin used for ftp: scheme. It creates FtpResponse
object and gets the content of the url from it. Configurable parameters are ftp.username
, ftp.password
, ftp.content.limit
, ftp.timeout
, ftp.server.timeout
, ftp.password
, ftp.keep.connection
and ftp.follow.talk
. For details see "FTP properties" section in nutch-default.xml
.
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
-
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor Summary
Constructors Constructor and Description Ftp()
Method Summary
Methods Modifier and Type Method and Description protected void
finalize()
int
getBufferSize()
org.apache.hadoop.conf.Configuration
getConf()
Get the Configuration
object
ProtocolOutput
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Creates a FtpResponse
object corresponding to the url and returns a ProtocolOutput
object as per the content received
crawlercommons.robots.BaseRobotRules
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Get the robots rules for a given url
static void
main(String[] args)
For debugging.
void
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
void
setFollowTalk(boolean followTalk)
Set followTalk
void
setKeepConnection(boolean keepConnection)
Set keepConnection
void
setMaxContentLength(int length)
Set the point at which content is truncated.
void
setTimeout(int to)
Set the timeout.
-
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
Ftp
public Ftp()
Method Detail
-
setTimeout
public void setTimeout(int to)
Set the timeout.
-
setMaxContentLength
public void setMaxContentLength(int length)
Set the point at which content is truncated.
-
setFollowTalk
public void setFollowTalk(boolean followTalk)
Set followTalk
-
setKeepConnection
public void setKeepConnection(boolean keepConnection)
Set keepConnection
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
Creates a FtpResponse
object corresponding to the url and returns a ProtocolOutput
object as per the content received
- Specified by:
- <code>getProtocolOutput</code> in interface <code>Protocol</code>
- Parameters:
- <code>url</code> - Text containing the ftp url
- <code>datum</code> - The CrawlDatum object corresponding to the url
- Returns:
- [<code>ProtocolOutput</code>](../../../../../org/apache/nutch/protocol/ProtocolOutput.html) object for the url
-
finalize
protected void finalize()
- Overrides:
- <code>finalize</code> in class <code>Object</code>
-
main
public static void main(String[] args) throws Exception
For debugging.
- Throws:
- <code>Exception</code>
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration
object
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration
object
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
Get the robots rules for a given url
- Specified by:
- <code>getRobotRules</code> in interface <code>Protocol</code>
- Parameters:
- <code>url</code> - url to check
- <code>datum</code> - page datum
- Returns:
- robot rules (specific for this url or default), never null
-
getBufferSize
public int getBufferSize()