org.apache.nutch.tools.arc
Class ArcRecordReader
- java.lang.Object
- org.apache.nutch.tools.arc.ArcRecordReader
- All Implemented Interfaces:
- org.apache.hadoop.mapred.RecordReader
public class ArcRecordReader extends Object implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,org.apache.hadoop.io.BytesWritable>
The ArchRecordReader
class provides a record reader which reads records from arc files.
Arc files are essentially tars of gzips. Each record in an arc file is a compressed gzip. Multiple records are concatenated together to form a complete arc. For more information on the arc file format see http://www.archive.org/web/researcher/ArcFileFormat.php
.
Arc files are used by the internet archive and grub projects.
see http://www.archive.org/
see http://www.grub.org/
Field Summary
Fields Modifier and Type Field and Description protected org.apache.hadoop.conf.Configuration
conf
protected long
fileLen
protected org.apache.hadoop.fs.FSDataInputStream
in
static org.slf4j.Logger
LOG
protected long
pos
protected long
splitEnd
protected long
splitLen
protected long
splitStart
Constructor Summary
Constructors Constructor and Description ArcRecordReader(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.mapred.FileSplit split)
Constructor that sets the configuration and file split.
Method Summary
Methods Modifier and Type Method and Description void
close()
Closes the record reader resources.
org.apache.hadoop.io.Text
createKey()
Creates a new instance of the Text
object for the key.
org.apache.hadoop.io.BytesWritable
createValue()
Creates a new instance of the BytesWritable
object for the key
long
getPos()
Returns the current position in the file.
float
getProgress()
Returns the percentage of progress in processing the file.
static boolean
isMagic(byte[] input)
Returns true if the byte array passed matches the gzip header magic number.
boolean
next(org.apache.hadoop.io.Text key,
org.apache.hadoop.io.BytesWritable value)
Returns true if the next record in the split is read into the key and value pair.
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
conf
protected org.apache.hadoop.conf.Configuration conf
-
splitStart
protected long splitStart
-
pos
protected long pos
-
splitEnd
protected long splitEnd
-
splitLen
protected long splitLen
-
fileLen
protected long fileLen
-
in
protected org.apache.hadoop.fs.FSDataInputStream in
Constructor Detail
-
ArcRecordReader
public ArcRecordReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapred.FileSplit split) throws IOException
Constructor that sets the configuration and file split.
- Parameters:
- <code>conf</code> - The job configuration.
- <code>split</code> - The file split to read from.
- Throws:
- <code>IOException</code> - If an IO error occurs while initializing file split.
Method Detail
-
isMagic
public static boolean isMagic(byte[] input)
Returns true if the byte array passed matches the gzip header magic number.
- Parameters:
- <code>input</code> - The byte array to check.
- Returns:
- True if the byte array matches the gzip header magic number.
-
close
public void close() throws IOException
Closes the record reader resources.
- Specified by:
- <code>close</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
- Throws:
- <code>IOException</code>
-
createKey
public org.apache.hadoop.io.Text createKey()
Creates a new instance of the Text
object for the key.
- Specified by:
- <code>createKey</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
-
createValue
public org.apache.hadoop.io.BytesWritable createValue()
Creates a new instance of the BytesWritable
object for the key
- Specified by:
- <code>createValue</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
-
getPos
public long getPos() throws IOException
Returns the current position in the file.
- Specified by:
- <code>getPos</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
- Returns:
- The long of the current position in the file.
- Throws:
- <code>IOException</code>
-
getProgress
public float getProgress() throws IOException
Returns the percentage of progress in processing the file. This will be represented as a float from 0 to 1 with 1 being 100% completed.
- Specified by:
- <code>getProgress</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
- Returns:
- The percentage of progress as a float from 0 to 1.
- Throws:
- <code>IOException</code>
-
next
public boolean next(org.apache.hadoop.io.Text key, org.apache.hadoop.io.BytesWritable value) throws IOException
Returns true if the next record in the split is read into the key and value pair. The key will be the arc record header and the values will be the raw content bytes of the arc record.
- Specified by:
- <code>next</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>
- Parameters:
- <code>key</code> - The record key
- <code>value</code> - The record value
- Returns:
- True if the next record is read.
- Throws:
- <code>IOException</code> - If an error occurs while reading the record value.