org.apache.nutch.segment
Interface SegmentMergeFilter
public interface SegmentMergeFilter
Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
Field Summary
Fields Modifier and Type Field and Description static String
X_POINT_ID
The name of the extension point.
Method Summary
Methods Modifier and Type Method and Description boolean
filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection
The filtering method which gets all information being merged for a given key (URL).
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
Method Detail
-
filter
boolean filter(org.apache.hadoop.io.Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).
- Returns:
- true values for this key (URL) should be merged into the new segment.