- Prev
- Next
Uses of Class
org.apache.nutch.protocol.Content
Packages that use Content Package Description org.apache.nutch.analysis.lang
Text document language identifier. org.apache.nutch.crawl
Crawl control code and tools to run the crawler. org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin. org.apache.nutch.parse
TheParse
interface and related classes. org.apache.nutch.parse.ext
Parse wrapper to run external command to do the parsing. org.apache.nutch.parse.feed
Parse RSS feeds. org.apache.nutch.parse.headings
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree. org.apache.nutch.parse.html
An HTML document parsing plugin. org.apache.nutch.parse.js
Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets. org.apache.nutch.parse.metatags
Parse filter to extract meta tags: keywords, description, etc. org.apache.nutch.parse.swf
Parse Flash SWF files. org.apache.nutch.parse.tika
Parse various document formats with help of Apache Tika. org.apache.nutch.parse.zip
Parse ZIP files: embedded files are recursively passed to appropriate parsers. org.apache.nutch.protocol
Classes related to theProtocol
interface, see alsoorg.apache.nutch.net.protocols
. org.apache.nutch.protocol.file
Protocol plugin which supports retrieving local file resources. org.apache.nutch.protocol.ftp
Protocol plugin which supports retrieving documents via the ftp protocol. org.apache.nutch.scoring
TheScoringFilter
interface. org.apache.nutch.scoring.depth
Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs). org.apache.nutch.scoring.link
Scoring filter used in conjunction withWebGraph
. org.apache.nutch.scoring.opic
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm. org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin. org.apache.nutch.scoring.urlmeta
URL Meta Tag Scoring Plugin org.apache.nutch.segment
A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links. org.apache.nutch.util
Miscellaneous utility classes. org.creativecommons.nutch
Sample plugins that parse and index Creative Commons medadata.
Uses of Content in org.apache.nutch.analysis.lang
Methods in org.apache.nutch.analysis.lang with parameters of type Content Modifier and Type Method and Description ParseResult
HTMLLanguageParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
1.
Uses of Content in org.apache.nutch.crawl
Methods in org.apache.nutch.crawl with parameters of type Content Modifier and Type Method and Description byte[]
TextProfileSignature.calculate(Content content,
Parse parse)
abstract byte[]
Signature.calculate(Content content,
Parse parse)
byte[]
MD5Signature.calculate(Content content,
Parse parse)
Uses of Content in org.apache.nutch.microformats.reltag
Methods in org.apache.nutch.microformats.reltag with parameters of type Content Modifier and Type Method and Description ParseResult
RelTagParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible rel-tags
Uses of Content in org.apache.nutch.parse
Methods in org.apache.nutch.parse with parameters of type Content Modifier and Type Method and Description ParseResult
HtmlParseFilters.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Run all defined filters.
ParseResult
HtmlParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.
ParseResult
Parser.getParse(Content c)
This method parses the given content and returns a map of static boolean
ParseSegment.isTruncated(Content content)
Checks if the page's content is truncated.
void
ParseSegment.map(org.apache.hadoop.io.WritableComparable key,
Content content,
org.apache.hadoop.mapred.OutputCollector
ParseResult
ParseUtil.parse(Content content)
Performs a parse by iterating through a List of preferred Parser
s until a successful parse is performed and a Parse
object is returned.
ParseResult
ParseUtil.parseByExtensionId(String extId,
Content content)
Method parses a Content
object using the Parser
specified by the parameter extId
, i.e., the Parser's extension ID.
Uses of Content in org.apache.nutch.parse.ext
Methods in org.apache.nutch.parse.ext with parameters of type Content Modifier and Type Method and Description ParseResult
ExtParser.getParse(Content content)
Uses of Content in org.apache.nutch.parse.feed
Methods in org.apache.nutch.parse.feed with parameters of type Content Modifier and Type Method and Description ParseResult
FeedParser.getParse(Content content)
Parses the given feed and extracts out and parsers all linked items within the feed, using the underlying ROME feed parsing library.
Uses of Content in org.apache.nutch.parse.headings
Methods in org.apache.nutch.parse.headings with parameters of type Content Modifier and Type Method and Description ParseResult
HeadingsParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Uses of Content in org.apache.nutch.parse.html
Methods in org.apache.nutch.parse.html with parameters of type Content Modifier and Type Method and Description ParseResult
HtmlParser.getParse(Content content)
Uses of Content in org.apache.nutch.parse.js
Methods in org.apache.nutch.parse.js with parameters of type Content Modifier and Type Method and Description ParseResult
JSParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
ParseResult
JSParseFilter.getParse(Content c)
Uses of Content in org.apache.nutch.parse.metatags
Methods in org.apache.nutch.parse.metatags with parameters of type Content Modifier and Type Method and Description ParseResult
MetaTagsParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Uses of Content in org.apache.nutch.parse.swf
Methods in org.apache.nutch.parse.swf with parameters of type Content Modifier and Type Method and Description ParseResult
SWFParser.getParse(Content content)
Uses of Content in org.apache.nutch.parse.tika
Methods in org.apache.nutch.parse.tika with parameters of type Content Modifier and Type Method and Description ParseResult
TikaParser.getParse(Content content)
Uses of Content in org.apache.nutch.parse.zip
Methods in org.apache.nutch.parse.zip with parameters of type Content Modifier and Type Method and Description ParseResult
ZipParser.getParse(Content content)
Uses of Content in org.apache.nutch.protocol
Methods in org.apache.nutch.protocol that return Content Modifier and Type Method and Description Content
ProtocolOutput.getContent()
static Content
Content.read(DataInput in)
Methods in org.apache.nutch.protocol with parameters of type Content Modifier and Type Method and Description void
ProtocolOutput.setContent(Content content)
Constructors in org.apache.nutch.protocol with parameters of type Content Constructor and Description ProtocolOutput(Content content)
ProtocolOutput(Content content,
ProtocolStatus status)
Uses of Content in org.apache.nutch.protocol.file
Methods in org.apache.nutch.protocol.file that return Content Modifier and Type Method and Description Content
FileResponse.toContent()
Uses of Content in org.apache.nutch.protocol.ftp
Methods in org.apache.nutch.protocol.ftp that return Content Modifier and Type Method and Description Content
FtpResponse.toContent()
Uses of Content in org.apache.nutch.scoring
Methods in org.apache.nutch.scoring with parameters of type Content Modifier and Type Method and Description void
ScoringFilters.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
void
ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process.
void
AbstractScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
void
ScoringFilters.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
void
ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content
metadata.
void
AbstractScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Uses of Content in org.apache.nutch.scoring.depth
Methods in org.apache.nutch.scoring.depth with parameters of type Content Modifier and Type Method and Description void
DepthScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
void
DepthScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Uses of Content in org.apache.nutch.scoring.link
Methods in org.apache.nutch.scoring.link with parameters of type Content Modifier and Type Method and Description void
LinkAnalysisScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
void
LinkAnalysisScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Uses of Content in org.apache.nutch.scoring.opic
Methods in org.apache.nutch.scoring.opic with parameters of type Content Modifier and Type Method and Description void
OPICScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
void
OPICScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
Uses of Content in org.apache.nutch.scoring.tld
Methods in org.apache.nutch.scoring.tld with parameters of type Content Modifier and Type Method and Description void
TLDScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
void
TLDScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Uses of Content in org.apache.nutch.scoring.urlmeta
Methods in org.apache.nutch.scoring.urlmeta with parameters of type Content Modifier and Type Method and Description void
URLMetaScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it within your parse data.
void
URLMetaScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content.
Uses of Content in org.apache.nutch.segment
Methods in org.apache.nutch.segment with parameters of type Content Modifier and Type Method and Description boolean
SegmentMergeFilters.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection
Iterates over all SegmentMergeFilter
extensions and if any of them returns false, it will return false as well.
boolean
SegmentMergeFilter.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection
The filtering method which gets all information being merged for a given key (URL).
Uses of Content in org.apache.nutch.util
Methods in org.apache.nutch.util with parameters of type Content Modifier and Type Method and Description void
EncodingDetector.autoDetectClues(Content content,
boolean filter)
String
EncodingDetector.guessEncoding(Content content,
String defaultValue)
Guess the encoding with the previously specified list of clues.
Uses of Content in org.creativecommons.nutch
Methods in org.creativecommons.nutch with parameters of type Content Modifier and Type Method and Description ParseResult
CCParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
- Prev
- Next