org.apache.nutch.parse
Class ParseSegment
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.parse.ParseSegment
- All Implemented Interfaces:
- Closeable, AutoCloseable, org.apache.hadoop.conf.Configurable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper
,Content,org.apache.hadoop.io.Text,ParseImpl>, org.apache.hadoop.mapred.Reducer , org.apache.hadoop.util.Tool
public class ParseSegment extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.WritableComparable<?>,Content,org.apache.hadoop.io.Text,ParseImpl>, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable>
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger
LOG
static String
SKIP_TRUNCATED
Constructor Summary
Constructors Constructor and Description ParseSegment()
ParseSegment(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description void
close()
void
configure(org.apache.hadoop.mapred.JobConf job)
static boolean
isTruncated(Content content)
Checks if the page's content is truncated.
static void
main(String[] args)
void
map(org.apache.hadoop.io.WritableComparable key,
Content content,
org.apache.hadoop.mapred.OutputCollector
void
parse(org.apache.hadoop.fs.Path segment)
void
reduce(org.apache.hadoop.io.Text key,
Iterator
int
run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
SKIP_TRUNCATED
public static final String SKIP_TRUNCATED
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.parse.ParseSegment.SKIP_TRUNCATED)
Constructor Detail
-
ParseSegment
public ParseSegment()
-
ParseSegment
public ParseSegment(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job)
- Specified by:
- <code>configure</code> in interface <code>org.apache.hadoop.mapred.JobConfigurable</code>
-
close
public void close()
- Specified by:
- <code>close</code> in interface <code>Closeable</code>
- Specified by:
- <code>close</code> in interface <code>AutoCloseable</code>
-
map
public void map(org.apache.hadoop.io.WritableComparable<?> key, Content content, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,ParseImpl> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
- Specified by:
- <code>map</code> in interface <code>org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.writablecomparable<?>,Content,org.apache.hadoop.io.Text,ParseImpl></org.apache.hadoop.io.writablecomparable<?></code>
- Throws:
- <code>IOException</code>
-
isTruncated
public static boolean isTruncated(Content content)
Checks if the page's content is truncated.
- Parameters:
- <code>content</code> -
- Returns:
- If the page is truncated <code>true</code>. When it is not, or when it could be determined, <code>false</code>.
-
reduce
public void reduce(org.apache.hadoop.io.Text key, Iterator<org.apache.hadoop.io.Writable> values, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
- Specified by:
- <code>reduce</code> in interface <code>org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.text,org.apache.hadoop.io.writable,org.apache.hadoop.io.text,org.apache.hadoop.io.writable></org.apache.hadoop.io.text,org.apache.hadoop.io.writable,org.apache.hadoop.io.text,org.apache.hadoop.io.writable></code>
- Throws:
- <code>IOException</code>
-
parse
public void parse(org.apache.hadoop.fs.Path segment) throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>