[TOC]

org.apache.nutch.protocol.http.api

Class HttpBase


public abstract class HttpBase
extends Object
implements Protocol

Field Summary

Fields Modifier and Type Field and Description protected String accept The "Accept" request header value. protected String acceptLanguage The "Accept-Language" request header value. static int BUFFER_SIZE protected int maxContent The length limit for downloaded content, in bytes. protected long maxCrawlDelay Skip page if Crawl-Delay longer than this value. protected String proxyHost The proxy hostname. protected int proxyPort The proxy port. static org.apache.hadoop.io.Text RESPONSE_TIME protected boolean responseTime Record response time in CrawlDatum's meta data, see property http.store.responsetime. protected int timeout The network timeout in millisecond protected Set tlsPreferredCipherSuites Which TLS/SSL cipher suites to support protected Set tlsPreferredProtocols Which TLS/SSL protocols to support protected boolean useHttp11 Do we use HTTP/1.1? protected boolean useProxy Indicates if a proxy is used protected String userAgent The Nutch 'User-Agent' request header

-    

Fields inherited from interface org.apache.nutch.protocol.Protocol

CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID

Constructor Summary

Constructors Constructor and Description HttpBase() Creates a new instance of HttpBase HttpBase(org.slf4j.Logger logger) Creates a new instance of HttpBase

Method Summary

Methods Modifier and Type Method and Description String getAccept() String getAcceptLanguage() Value of "Accept-Language" request header sent by Nutch. org.apache.hadoop.conf.Configuration getConf() int getMaxContent() ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum) Returns the Content for a fetchlist entry. String getProxyHost() int getProxyPort() protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum) Retrieve robot rules applicable for this url. int getTimeout() Set getTlsPreferredCipherSuites() Set getTlsPreferredProtocols() boolean getUseHttp11() String getUserAgent() protected void logConf() protected static void main(HttpBase http, String[] args) byte[] processDeflateEncoded(byte[] compressed, URL url) byte[] processGzipEncoded(byte[] compressed, URL url) void setConf(org.apache.hadoop.conf.Configuration conf) boolean useProxy()

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

RESPONSE_TIME

public static final org.apache.hadoop.io.Text RESPONSE_TIME
-  

BUFFER_SIZE

public static final int BUFFER_SIZE
  - See Also:
  - [Constant Field Values](../../../../../../constant-values.html#org.apache.nutch.protocol.http.api.HttpBase.BUFFER_SIZE)       
-  

proxyHost

protected String proxyHost

The proxy hostname.

-  

proxyPort

protected int proxyPort

The proxy port.

-  

useProxy

protected boolean useProxy

Indicates if a proxy is used

-  

timeout

protected int timeout

The network timeout in millisecond

-  

maxContent

protected int maxContent

The length limit for downloaded content, in bytes.

-  

userAgent

protected String userAgent

The Nutch 'User-Agent' request header

-  

acceptLanguage

protected String acceptLanguage

The "Accept-Language" request header value.

-  

accept

protected String accept

The "Accept" request header value.

-  

useHttp11

protected boolean useHttp11

Do we use HTTP/1.1?

-  

responseTime

protected boolean responseTime

Record response time in CrawlDatum's meta data, see property http.store.responsetime.

-  

maxCrawlDelay

protected long maxCrawlDelay

Skip page if Crawl-Delay longer than this value.

-  

tlsPreferredProtocols

protected Set<String> tlsPreferredProtocols

Which TLS/SSL protocols to support

-  

tlsPreferredCipherSuites

protected Set<String> tlsPreferredCipherSuites

Which TLS/SSL cipher suites to support

Constructor Detail

-  

HttpBase

public HttpBase()

Creates a new instance of HttpBase

-  

HttpBase

public HttpBase(org.slf4j.Logger logger)

Creates a new instance of HttpBase

Method Detail

-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getProtocolOutput

public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url,
                               CrawlDatum datum)

Description copied from interface: Protocol

Returns the Content for a fetchlist entry.

  - Specified by: 
  - <code>getProtocolOutput</code> in interface <code>Protocol</code>        
-  

getProxyHost

public String getProxyHost()
-  

getProxyPort

public int getProxyPort()
-  

useProxy

public boolean useProxy()
-  

getTimeout

public int getTimeout()
-  

getMaxContent

public int getMaxContent()
-  

getUserAgent

public String getUserAgent()
-  

getAcceptLanguage

public String getAcceptLanguage()

Value of "Accept-Language" request header sent by Nutch.

  - Returns:
  - The value of the header &#34;Accept-Language&#34; header.       
-  

getAccept

public String getAccept()
-  

getUseHttp11

public boolean getUseHttp11()
-  

getTlsPreferredCipherSuites

public Set<String> getTlsPreferredCipherSuites()
-  

getTlsPreferredProtocols

public Set<String> getTlsPreferredProtocols()
-  

logConf

protected void logConf()
-  

processGzipEncoded

public byte[] processGzipEncoded(byte[] compressed,
                        URL url)
                          throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

processDeflateEncoded

public byte[] processDeflateEncoded(byte[] compressed,
                           URL url)
                             throws IOException
  - Throws: 
  - <code>IOException</code>       
-  

main

protected static void main(HttpBase http,
        String[] args)
                    throws Exception
  - Throws: 
  - <code>Exception</code>       
-  

getResponse

protected abstract Response getResponse(URL url,
                   CrawlDatum datum,
                   boolean followRedirects)
                                 throws ProtocolException,
                                        IOException
  - Throws: 
  - <code>ProtocolException</code> 
  - <code>IOException</code>       
-  

getRobotRules

public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url,
                                                 CrawlDatum datum)

Description copied from interface: Protocol

Retrieve robot rules applicable for this url.

  - Specified by: 
  - <code>getRobotRules</code> in interface <code>Protocol</code> 
  - Parameters:
  - <code>url</code> - url to check
  - <code>datum</code> - page datum 
  - Returns:
  - robot rules (specific for this url or default), never null      

Copyright © 2014 The Apache Software Foundation