org.apache.nutch.protocol.http.api
Class HttpBase
- java.lang.Object
- org.apache.nutch.protocol.http.api.HttpBase
public abstract class HttpBase extends Object implements Protocol
Field Summary
Fields Modifier and Type Field and Description protected String
accept
The "Accept" request header value.
protected String
acceptLanguage
The "Accept-Language" request header value.
static int
BUFFER_SIZE
protected int
maxContent
The length limit for downloaded content, in bytes.
protected long
maxCrawlDelay
Skip page if Crawl-Delay longer than this value.
protected String
proxyHost
The proxy hostname.
protected int
proxyPort
The proxy port.
static org.apache.hadoop.io.Text
RESPONSE_TIME
protected boolean
responseTime
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
protected int
timeout
The network timeout in millisecond
protected Set
tlsPreferredCipherSuites
Which TLS/SSL cipher suites to support
protected Set
tlsPreferredProtocols
Which TLS/SSL protocols to support
protected boolean
useHttp11
Do we use HTTP/1.1?
protected boolean
useProxy
Indicates if a proxy is used
protected String
userAgent
The Nutch 'User-Agent' request header
-
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor Summary
Constructors Constructor and Description HttpBase()
Creates a new instance of HttpBase
HttpBase(org.slf4j.Logger logger)
Creates a new instance of HttpBase
Method Summary
Methods Modifier and Type Method and Description String
getAccept()
String
getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.
org.apache.hadoop.conf.Configuration
getConf()
int
getMaxContent()
ProtocolOutput
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Returns the Content
for a fetchlist entry.
String
getProxyHost()
int
getProxyPort()
protected abstract Response
getResponse(URL url,
CrawlDatum datum,
boolean followRedirects)
crawlercommons.robots.BaseRobotRules
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url.
int
getTimeout()
Set
getTlsPreferredCipherSuites()
Set
getTlsPreferredProtocols()
boolean
getUseHttp11()
String
getUserAgent()
protected void
logConf()
protected static void
main(HttpBase http,
String[] args)
byte[]
processDeflateEncoded(byte[] compressed,
URL url)
byte[]
processGzipEncoded(byte[] compressed,
URL url)
void
setConf(org.apache.hadoop.conf.Configuration conf)
boolean
useProxy()
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
RESPONSE_TIME
public static final org.apache.hadoop.io.Text RESPONSE_TIME
-
BUFFER_SIZE
public static final int BUFFER_SIZE
- See Also:
- [Constant Field Values](../../../../../../constant-values.html#org.apache.nutch.protocol.http.api.HttpBase.BUFFER_SIZE)
-
proxyHost
protected String proxyHost
The proxy hostname.
-
proxyPort
protected int proxyPort
The proxy port.
-
useProxy
protected boolean useProxy
Indicates if a proxy is used
-
timeout
protected int timeout
The network timeout in millisecond
-
maxContent
protected int maxContent
The length limit for downloaded content, in bytes.
-
userAgent
protected String userAgent
The Nutch 'User-Agent' request header
-
acceptLanguage
protected String acceptLanguage
The "Accept-Language" request header value.
-
accept
protected String accept
The "Accept" request header value.
-
useHttp11
protected boolean useHttp11
Do we use HTTP/1.1?
-
responseTime
protected boolean responseTime
Record response time in CrawlDatum's meta data, see property http.store.responsetime.
-
maxCrawlDelay
protected long maxCrawlDelay
Skip page if Crawl-Delay longer than this value.
-
tlsPreferredProtocols
protected Set<String> tlsPreferredProtocols
Which TLS/SSL protocols to support
-
tlsPreferredCipherSuites
protected Set<String> tlsPreferredCipherSuites
Which TLS/SSL cipher suites to support
Constructor Detail
-
HttpBase
public HttpBase()
Creates a new instance of HttpBase
-
HttpBase
public HttpBase(org.slf4j.Logger logger)
Creates a new instance of HttpBase
Method Detail
-
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
- <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
- <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>
-
getProtocolOutput
public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
Description copied from interface: Protocol
Returns the Content
for a fetchlist entry.
- Specified by:
- <code>getProtocolOutput</code> in interface <code>Protocol</code>
-
getProxyHost
public String getProxyHost()
-
getProxyPort
public int getProxyPort()
-
useProxy
public boolean useProxy()
-
getTimeout
public int getTimeout()
-
getMaxContent
public int getMaxContent()
-
getUserAgent
public String getUserAgent()
-
getAcceptLanguage
public String getAcceptLanguage()
Value of "Accept-Language" request header sent by Nutch.
- Returns:
- The value of the header "Accept-Language" header.
-
getAccept
public String getAccept()
-
getUseHttp11
public boolean getUseHttp11()
-
getTlsPreferredCipherSuites
public Set<String> getTlsPreferredCipherSuites()
-
getTlsPreferredProtocols
public Set<String> getTlsPreferredProtocols()
-
logConf
protected void logConf()
-
processGzipEncoded
public byte[] processGzipEncoded(byte[] compressed, URL url) throws IOException
- Throws:
- <code>IOException</code>
-
processDeflateEncoded
public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException
- Throws:
- <code>IOException</code>
-
main
protected static void main(HttpBase http, String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
getResponse
protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException
- Throws:
- <code>ProtocolException</code>
- <code>IOException</code>
-
getRobotRules
public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
Description copied from interface: Protocol
Retrieve robot rules applicable for this url.
- Specified by:
- <code>getRobotRules</code> in interface <code>Protocol</code>
- Parameters:
- <code>url</code> - url to check
- <code>datum</code> - page datum
- Returns:
- robot rules (specific for this url or default), never null