org.apache.nutch.crawl
Class CrawlDbReader
- java.lang.Object
- org.apache.nutch.crawl.CrawlDbReader
- All Implemented Interfaces:
- Closeable, AutoCloseable
public class CrawlDbReader extends Object implements Closeable
Read utility for the CrawlDB.
- Author:
- Andrzej Bialecki
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class CrawlDbReader.CrawlDatumCsvOutputFormat static class CrawlDbReader.CrawlDbDumpMapper static class CrawlDbReader.CrawlDbStatCombiner static class CrawlDbReader.CrawlDbStatMapper static class CrawlDbReader.CrawlDbStatReducer static class CrawlDbReader.CrawlDbTopNMapper static class CrawlDbReader.CrawlDbTopNReducer
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger LOG
Constructor Summary
Constructors Constructor and Description CrawlDbReader()
Method Summary
Methods Modifier and Type Method and Description void close() CrawlDatum get(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config) static void main(String[] args) void processDumpJob(String crawlDb,
String output,
org.apache.hadoop.conf.Configuration config,
String format,
String regex,
String status,
Integer retry) void processStatJob(String crawlDb,
org.apache.hadoop.conf.Configuration config,
boolean sort) void processTopNJob(String crawlDb,
long topN,
float min,
String output,
org.apache.hadoop.conf.Configuration config) void readUrl(String crawlDb,
String url,
org.apache.hadoop.conf.Configuration config)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
Constructor Detail
-
CrawlDbReader
public CrawlDbReader()
Method Detail
-
close
public void close()
- Specified by:
- <code>close</code> in interface <code>Closeable</code>
- Specified by:
- <code>close</code> in interface <code>AutoCloseable</code>
-
processStatJob
public void processStatJob(String crawlDb, org.apache.hadoop.conf.Configuration config, boolean sort) throws IOException
- Throws:
- <code>IOException</code>
-
get
public CrawlDatum get(String crawlDb, String url, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
readUrl
public void readUrl(String crawlDb, String url, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
processDumpJob
public void processDumpJob(String crawlDb, String output, org.apache.hadoop.conf.Configuration config, String format, String regex, String status, Integer retry) throws IOException
- Throws:
- <code>IOException</code>
-
processTopNJob
public void processTopNJob(String crawlDb, long topN, float min, String output, org.apache.hadoop.conf.Configuration config) throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws IOException
- Throws:
- <code>IOException</code>
