- Prev
- Next
Uses of Interface
org.apache.nutch.parse.Parse
Packages that use Parse Package Description org.apache.nutch.analysis.lang
Text document language identifier. org.apache.nutch.crawl
Crawl control code and tools to run the crawler. org.apache.nutch.indexer
Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index. org.apache.nutch.indexer.anchor
An indexing plugin for inbound anchor text. org.apache.nutch.indexer.basic
A basic indexing plugin, adds basic fields: url, host, title, content, etc. org.apache.nutch.indexer.feed
Indexing filter to index meta data from RSS feeds. org.apache.nutch.indexer.metadata
Indexing filter to add document metadata to the index. org.apache.nutch.indexer.more
A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length. org.apache.nutch.indexer.staticfield
A simple plugin called at indexing that adds fields with static data. org.apache.nutch.indexer.subcollection
Indexing filter to assign documents to subcollections. org.apache.nutch.indexer.tld
Top Level Domain Indexing plugin. org.apache.nutch.indexer.urlmeta
URL Meta Tag Indexing Plugin org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin. org.apache.nutch.parse
TheParseinterface and related classes. org.apache.nutch.scoring
TheScoringFilterinterface. org.apache.nutch.scoring.depth
Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs). org.apache.nutch.scoring.link
Scoring filter used in conjunction withWebGraph. org.apache.nutch.scoring.opic
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm. org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin. org.apache.nutch.scoring.urlmeta
URL Meta Tag Scoring Plugin org.creativecommons.nutch
Sample plugins that parse and index Creative Commons medadata.
Uses of Parse in org.apache.nutch.analysis.lang
Methods in org.apache.nutch.analysis.lang with parameters of type Parse Modifier and Type Method and Description NutchDocument LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.crawl
Methods in org.apache.nutch.crawl with parameters of type Parse Modifier and Type Method and Description byte[] TextProfileSignature.calculate(Content content,
Parse parse) abstract byte[] Signature.calculate(Content content,
Parse parse) byte[] MD5Signature.calculate(Content content,
Parse parse)
Uses of Parse in org.apache.nutch.indexer
Methods in org.apache.nutch.indexer with parameters of type Parse Modifier and Type Method and Description NutchDocument IndexingFilters.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
NutchDocument IndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.
Uses of Parse in org.apache.nutch.indexer.anchor
Methods in org.apache.nutch.indexer.anchor with parameters of type Parse Modifier and Type Method and Description NutchDocument AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The AnchorIndexingFilter filter object which supports boolean configuration settings for the deduplication of anchors.
Uses of Parse in org.apache.nutch.indexer.basic
Methods in org.apache.nutch.indexer.basic with parameters of type Parse Modifier and Type Method and Description NutchDocument BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The BasicIndexingFilter filter object which supports few configuration settings for adding basic searchable fields.
Uses of Parse in org.apache.nutch.indexer.feed
Methods in org.apache.nutch.indexer.feed with parameters of type Parse Modifier and Type Method and Description NutchDocument FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields: FEED_AUTHOR FEED_TAGS FEED_PUBLISHED FEED_UPDATED FEED And sends them to the Indexer for indexing within the Nutch index.
Uses of Parse in org.apache.nutch.indexer.metadata
Methods in org.apache.nutch.indexer.metadata with parameters of type Parse Modifier and Type Method and Description NutchDocument MetadataIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.indexer.more
Methods in org.apache.nutch.indexer.more with parameters of type Parse Modifier and Type Method and Description NutchDocument MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.indexer.staticfield
Methods in org.apache.nutch.indexer.staticfield with parameters of type Parse Modifier and Type Method and Description NutchDocument StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
The StaticFieldIndexer filter object which adds fields as per configuration setting.
Uses of Parse in org.apache.nutch.indexer.subcollection
Methods in org.apache.nutch.indexer.subcollection with parameters of type Parse Modifier and Type Method and Description NutchDocument SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.indexer.tld
Methods in org.apache.nutch.indexer.tld with parameters of type Parse Modifier and Type Method and Description NutchDocument TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text urlText,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.indexer.urlmeta
Methods in org.apache.nutch.indexer.urlmeta with parameters of type Parse Modifier and Type Method and Description NutchDocument URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object.
Uses of Parse in org.apache.nutch.microformats.reltag
Methods in org.apache.nutch.microformats.reltag with parameters of type Parse Modifier and Type Method and Description NutchDocument RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
Uses of Parse in org.apache.nutch.parse
Classes in org.apache.nutch.parse that implement Parse Modifier and Type Class and Description class ParseImpl
The result of parsing a page's raw content.
Methods in org.apache.nutch.parse that return Parse Modifier and Type Method and Description Parse ParseResult.get(String key)
Retrieve a single parse output.
Parse ParseResult.get(org.apache.hadoop.io.Text key)
Retrieve a single parse output.
Parse ParseStatus.getEmptyParse(org.apache.hadoop.conf.Configuration conf)
A convenience method.
Methods in org.apache.nutch.parse that return types with arguments of type Parse Modifier and Type Method and Description org.apache.hadoop.mapred.RecordWriter ParseOutputFormat.getRecordWriter(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.mapred.JobConf job,
String name,
org.apache.hadoop.util.Progressable progress) Iterator ParseResult.iterator()
Iterate over all entries in the
Methods in org.apache.nutch.parse with parameters of type Parse Modifier and Type Method and Description static ParseResult ParseResult.createParseResult(String url,
Parse parse)
Convenience method for obtaining ParseResult from a single Parse output.
Constructors in org.apache.nutch.parse with parameters of type Parse Constructor and Description ParseImpl(Parse parse)
Uses of Parse in org.apache.nutch.scoring
Methods in org.apache.nutch.scoring with parameters of type Parse Modifier and Type Method and Description float ScoringFilters.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) float ScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost.
float AbstractScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) void ScoringFilters.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse) void ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process.
void AbstractScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Uses of Parse in org.apache.nutch.scoring.depth
Methods in org.apache.nutch.scoring.depth with parameters of type Parse Modifier and Type Method and Description float DepthScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) void DepthScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Uses of Parse in org.apache.nutch.scoring.link
Methods in org.apache.nutch.scoring.link with parameters of type Parse Modifier and Type Method and Description float LinkAnalysisScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) void LinkAnalysisScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Uses of Parse in org.apache.nutch.scoring.opic
Methods in org.apache.nutch.scoring.opic with parameters of type Parse Modifier and Type Method and Description float OPICScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower.
void OPICScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
Uses of Parse in org.apache.nutch.scoring.tld
Methods in org.apache.nutch.scoring.tld with parameters of type Parse Modifier and Type Method and Description float TLDScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) void TLDScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Uses of Parse in org.apache.nutch.scoring.urlmeta
Methods in org.apache.nutch.scoring.urlmeta with parameters of type Parse Modifier and Type Method and Description float URLMetaScoringFilter.indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate
void URLMetaScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
Uses of Parse in org.creativecommons.nutch
Methods in org.creativecommons.nutch with parameters of type Parse Modifier and Type Method and Description NutchDocument CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
org.apache.hadoop.io.Text url,
CrawlDatum datum,
Inlinks inlinks)
- Prev
- Next
