[TOC]

org.apache.nutch.tools.arc

Class ArcRecordReader

    • All Implemented Interfaces:
    • org.apache.hadoop.mapred.RecordReader

public class ArcRecordReader
extends Object
implements org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,org.apache.hadoop.io.BytesWritable>

The ArchRecordReader class provides a record reader which reads records from arc files.

Arc files are essentially tars of gzips. Each record in an arc file is a compressed gzip. Multiple records are concatenated together to form a complete arc. For more information on the arc file format see http://www.archive.org/web/researcher/ArcFileFormat.php .

Arc files are used by the internet archive and grub projects. see http://www.archive.org/ see http://www.grub.org/

Field Summary

Fields Modifier and Type Field and Description protected org.apache.hadoop.conf.Configuration conf protected long fileLen protected org.apache.hadoop.fs.FSDataInputStream in static org.slf4j.Logger LOG protected long pos protected long splitEnd protected long splitLen protected long splitStart

Constructor Summary

Constructors Constructor and Description ArcRecordReader(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.mapred.FileSplit split) Constructor that sets the configuration and file split.

Method Summary

Methods Modifier and Type Method and Description void close() Closes the record reader resources. org.apache.hadoop.io.Text createKey() Creates a new instance of the Text object for the key. org.apache.hadoop.io.BytesWritable createValue() Creates a new instance of the BytesWritable object for the key long getPos() Returns the current position in the file. float getProgress() Returns the percentage of progress in processing the file. static boolean isMagic(byte[] input) Returns true if the byte array passed matches the gzip header magic number. boolean next(org.apache.hadoop.io.Text key, org.apache.hadoop.io.BytesWritable value) Returns true if the next record in the split is read into the key and value pair.

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

conf

protected org.apache.hadoop.conf.Configuration conf
-  

splitStart

protected long splitStart
-  

pos

protected long pos
-  

splitEnd

protected long splitEnd
-  

splitLen

protected long splitLen
-  

fileLen

protected long fileLen
-  

in

protected org.apache.hadoop.fs.FSDataInputStream in

Constructor Detail

-  

ArcRecordReader

public ArcRecordReader(org.apache.hadoop.conf.Configuration conf,
               org.apache.hadoop.mapred.FileSplit split)
                throws IOException

Constructor that sets the configuration and file split.

  - Parameters:
  - <code>conf</code> - The job configuration.
  - <code>split</code> - The file split to read from. 
  - Throws: 
  - <code>IOException</code> - If an IO error occurs while initializing file split.       

Method Detail

-  

isMagic

public static boolean isMagic(byte[] input)

Returns true if the byte array passed matches the gzip header magic number.

  - Parameters:
  - <code>input</code> - The byte array to check. 
  - Returns:
  - True if the byte array matches the gzip header magic number.       
-  

close

public void close()
           throws IOException

Closes the record reader resources.

  - Specified by: 
  - <code>close</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code> 
  - Throws: 
  - <code>IOException</code>       
-  

createKey

public org.apache.hadoop.io.Text createKey()

Creates a new instance of the Text object for the key.

  - Specified by: 
  - <code>createKey</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>        
-  

createValue

public org.apache.hadoop.io.BytesWritable createValue()

Creates a new instance of the BytesWritable object for the key

  - Specified by: 
  - <code>createValue</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code>        
-  

getPos

public long getPos()
            throws IOException

Returns the current position in the file.

  - Specified by: 
  - <code>getPos</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code> 
  - Returns:
  - The long of the current position in the file. 
  - Throws: 
  - <code>IOException</code>       
-  

getProgress

public float getProgress()
                  throws IOException

Returns the percentage of progress in processing the file. This will be represented as a float from 0 to 1 with 1 being 100% completed.

  - Specified by: 
  - <code>getProgress</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code> 
  - Returns:
  - The percentage of progress as a float from 0 to 1. 
  - Throws: 
  - <code>IOException</code>       
-  

next

public boolean next(org.apache.hadoop.io.Text key,
           org.apache.hadoop.io.BytesWritable value)
             throws IOException

Returns true if the next record in the split is read into the key and value pair. The key will be the arc record header and the values will be the raw content bytes of the arc record.

  - Specified by: 
  - <code>next</code> in interface <code>org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></org.apache.hadoop.io.text,org.apache.hadoop.io.byteswritable></code> 
  - Parameters:
  - <code>key</code> - The record key
  - <code>value</code> - The record value 
  - Returns:
  - True if the next record is read. 
  - Throws: 
  - <code>IOException</code> - If an error occurs while reading the record value.      

Copyright © 2014 The Apache Software Foundation