org.apache.nutch.crawl
Class CrawlDbReader
- java.lang.Object
- org.apache.nutch.crawl.CrawlDbReader
- All Implemented Interfaces:
- Closeable, AutoCloseable
public class CrawlDbReader extends Object implements Closeable
Read utility for the CrawlDB.
- Author:
- Andrzej Bialecki
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
CrawlDbReader.CrawlDatumCsvOutputFormat
static class
CrawlDbReader.CrawlDbDumpMapper
static class
CrawlDbReader.CrawlDbStatCombiner
static class
CrawlDbReader.CrawlDbStatMapper
static class
CrawlDbReader.CrawlDbStatReducer
static class
CrawlDbReader.CrawlDbTopNMapper
static class
CrawlDbReader.CrawlDbTopNReducer
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description CrawlDbReader()
Method Summary
Methods Modifier and Type Method and Description void
close()
CrawlDatum
get(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config)
static void
main(String[] args)
void
processDumpJob(String crawlDb,
String output,
org.apache.hadoop.conf.Configuration config,
String format,
String regex,
String status,
Integer retry)
void
processStatJob(String crawlDb,
org.apache.hadoop.conf.Configuration config,
boolean sort)
void
processTopNJob(String crawlDb,
long topN,
float min,
String output,
org.apache.hadoop.conf.Configuration config)
void
readUrl(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
CrawlDbReader
public CrawlDbReader()
Method Detail
-
close
public void close()
- Specified by:
- <code>close</code> in interface <code>Closeable</code>
- Specified by:
- <code>close</code> in interface <code>AutoCloseable</code>
-
processStatJob
public void processStatJob(String crawlDb, org.apache.hadoop.conf.Configuration config, boolean sort) throws IOException
- Throws:
- <code>IOException</code>
-
get
public CrawlDatum get(String crawlDb, String url, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
readUrl
public void readUrl(String crawlDb, String url, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
processDumpJob
public void processDumpJob(String crawlDb, String output, org.apache.hadoop.conf.Configuration config, String format, String regex, String status, Integer retry) throws IOException
- Throws:
- <code>IOException</code>
-
processTopNJob
public void processTopNJob(String crawlDb, long topN, float min, String output, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws IOException
- Throws:
- <code>IOException</code>