org.apache.nutch.parse
Class ParseSegment
- java.lang.Object
- org.apache.hadoop.conf.Configured
- org.apache.nutch.parse.ParseSegment
- All Implemented Interfaces:
- Closeable, AutoCloseable, org.apache.hadoop.conf.Configurable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper
,Content,org.apache.hadoop.io.Text,ParseImpl>, org.apache.hadoop.mapred.Reducer , org.apache.hadoop.util.Tool
public class ParseSegment extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.WritableComparable<?>,Content,org.apache.hadoop.io.Text,ParseImpl>, org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable,org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable>
Field Summary
Fields Modifier and Type Field and Description static org.slf4j.Logger LOG static String SKIP_TRUNCATED
Constructor Summary
Constructors Constructor and Description ParseSegment() ParseSegment(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description void close() void configure(org.apache.hadoop.mapred.JobConf job) static boolean isTruncated(Content content)
Checks if the page's content is truncated.
static void main(String[] args) void map(org.apache.hadoop.io.WritableComparable key,
Content content,
org.apache.hadoop.mapred.OutputCollector void parse(org.apache.hadoop.fs.Path segment) void reduce(org.apache.hadoop.io.Text key,
Iterator int run(String[] args)
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
SKIP_TRUNCATED
public static final String SKIP_TRUNCATED
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.parse.ParseSegment.SKIP_TRUNCATED)
Constructor Detail
-
ParseSegment
public ParseSegment()
-
ParseSegment
public ParseSegment(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
configure
public void configure(org.apache.hadoop.mapred.JobConf job)
- Specified by:
- <code>configure</code> in interface <code>org.apache.hadoop.mapred.JobConfigurable</code>
-
close
public void close()
- Specified by:
- <code>close</code> in interface <code>Closeable</code>
- Specified by:
- <code>close</code> in interface <code>AutoCloseable</code>
-
map
public void map(org.apache.hadoop.io.WritableComparable<?> key,
Content content,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,ParseImpl> output,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
- Specified by:
- <code>map</code> in interface <code>org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.writablecomparable<?>,Content,org.apache.hadoop.io.Text,ParseImpl></org.apache.hadoop.io.writablecomparable<?></code>
- Throws:
- <code>IOException</code>
-
isTruncated
public static boolean isTruncated(Content content)
Checks if the page's content is truncated.
- Parameters:
- <code>content</code> -
- Returns:
- If the page is truncated <code>true</code>. When it is not, or when it could be determined, <code>false</code>.
-
reduce
public void reduce(org.apache.hadoop.io.Text key,
Iterator<org.apache.hadoop.io.Writable> values,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Writable> output,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
- Specified by:
- <code>reduce</code> in interface <code>org.apache.hadoop.mapred.Reducer<org.apache.hadoop.io.text,org.apache.hadoop.io.writable,org.apache.hadoop.io.text,org.apache.hadoop.io.writable></org.apache.hadoop.io.text,org.apache.hadoop.io.writable,org.apache.hadoop.io.text,org.apache.hadoop.io.writable></code>
- Throws:
- <code>IOException</code>
-
parse
public void parse(org.apache.hadoop.fs.Path segment)
throws IOException
- Throws:
- <code>IOException</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>
-
run
public int run(String[] args) throws Exception
- Specified by:
- <code>run</code> in interface <code>org.apache.hadoop.util.Tool</code>
- Throws:
- <code>Exception</code>
