[TOC]

org.apache.nutch.protocol.ftp

Class Ftp

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, Pluggable, Protocol

public class Ftp
extends Object
implements Protocol

This class is a protocol plugin used for ftp: scheme. It creates FtpResponse object and gets the content of the url from it. Configurable parameters are ftp.username, ftp.password, ftp.content.limit, ftp.timeout, ftp.server.timeout, ftp.password, ftp.keep.connection and ftp.follow.talk. For details see "FTP properties" section in nutch-default.xml.

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.protocol.Protocol

CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID

Constructor Summary

Constructors Constructor and Description Ftp()

Method Summary

Methods Modifier and Type Method and Description protected void finalize() int getBufferSize() org.apache.hadoop.conf.Configuration getConf() Get the Configuration object ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum) Creates a FtpResponse object corresponding to the url and returns a ProtocolOutput object as per the content received crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum) Get the robots rules for a given url static void main(String[] args) For debugging. void setConf(org.apache.hadoop.conf.Configuration conf) Set the Configuration object void setFollowTalk(boolean followTalk) Set followTalk void setKeepConnection(boolean keepConnection) Set keepConnection void setMaxContentLength(int length) Set the point at which content is truncated. void setTimeout(int to) Set the timeout.

-    

Methods inherited from class java.lang.Object

clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

Ftp

public Ftp()

Method Detail

-  

setTimeout

public void setTimeout(int to)

Set the timeout.

-  

setMaxContentLength

public void setMaxContentLength(int length)

Set the point at which content is truncated.

-  

setFollowTalk

public void setFollowTalk(boolean followTalk)

Set followTalk

-  

setKeepConnection

public void setKeepConnection(boolean keepConnection)

Set keepConnection

-  

getProtocolOutput

public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url,
                               CrawlDatum datum)

Creates a FtpResponse object corresponding to the url and returns a ProtocolOutput object as per the content received

  - Specified by: 
  - <code>getProtocolOutput</code> in interface <code>Protocol</code> 
  - Parameters:
  - <code>url</code> - Text containing the ftp url
  - <code>datum</code> - The CrawlDatum object corresponding to the url 
  - Returns:
  - [<code>ProtocolOutput</code>](../../../../../org/apache/nutch/protocol/ProtocolOutput.html) object for the url       
-  

finalize

protected void finalize()
  - Overrides: 
  - <code>finalize</code> in class <code>Object</code>        
-  

main

public static void main(String[] args)
                 throws Exception

For debugging.

  - Throws: 
  - <code>Exception</code>       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

Set the Configuration object

  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()

Get the Configuration object

  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getRobotRules

public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url,
                                                 CrawlDatum datum)

Get the robots rules for a given url

  - Specified by: 
  - <code>getRobotRules</code> in interface <code>Protocol</code> 
  - Parameters:
  - <code>url</code> - url to check
  - <code>datum</code> - page datum 
  - Returns:
  - robot rules (specific for this url or default), never null       
-  

getBufferSize

public int getBufferSize()

Copyright © 2014 The Apache Software Foundation