apache-nutch 1.9 API

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Core
Package	Description
org.apache.nutch.crawl	Crawl control code and tools to run the crawler.
org.apache.nutch.fetcher	The Nutch robot.
org.apache.nutch.indexer	Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.metadata	A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net	Web-related interfaces: URL `filters` and `normalizers`.
org.apache.nutch.net.protocols	Helper classes related to the `Protocol` interface, sea also `org.apache.nutch.protocol`.
org.apache.nutch.parse	The `Parse` interface and related classes.
org.apache.nutch.plugin	The Nutch `Plugin` System.
org.apache.nutch.protocol	Classes related to the `Protocol` interface, see also `org.apache.nutch.net.protocols`.
org.apache.nutch.scoring	The `ScoringFilter` interface.
org.apache.nutch.scoring.webgraph	Scoring implementation based on link analysis (`LinkRank`), see `WebGraph`.
org.apache.nutch.segment	A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.tools	Miscellaneous tools.
org.apache.nutch.tools.arc	Tools to read the Arc file format.
org.apache.nutch.util	Miscellaneous utility classes.
org.apache.nutch.util.domain	Classes for domain name analysis.

Plugins API
Package	Description
org.apache.nutch.protocol.http.api	Common API used by HTTP plugins (`http`, `httpclient`)
org.apache.nutch.urlfilter.api	Generic `URL filter` library, abstracting away from regular expression implementations.

Protocol Plugins
Package	Description
org.apache.nutch.protocol.file	Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp	Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient	Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.

URL Filter Plugins
Package	Description
org.apache.nutch.urlfilter.automaton	URL filter plugin based on dk.brics.automaton Finite-State Automata for Java^TM.
org.apache.nutch.urlfilter.domain	URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domainblacklist	URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.prefix	URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex	URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix	URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator	URL filter plugin that validates given urls.

URL Normalizer Plugins
Package	Description
org.apache.nutch.net.urlnormalizer.basic	URL normalizer performing basic normalizations: remove default ports and dot segments in path.
org.apache.nutch.net.urlnormalizer.host	URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass	URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.querystring	URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex	URL normalizer with configurable rules based on regular expressions (`Pattern`).

Scoring Plugins
Package	Description
org.apache.nutch.scoring.depth	Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link	Scoring filter used in conjunction with `WebGraph`.
org.apache.nutch.scoring.opic	Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.tld	Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta	URL Meta Tag Scoring Plugin

Parse Plugins
Package	Description
org.apache.nutch.parse.ext	Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed	Parse RSS feeds.
org.apache.nutch.parse.html	An HTML document parsing plugin.
org.apache.nutch.parse.js	Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.swf	Parse Flash SWF files.
org.apache.nutch.parse.tika	Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip	Parse ZIP files: embedded files are recursively passed to appropriate parsers.

Parse Filter Plugins
Package	Description
org.apache.nutch.parse.headings	Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.metatags	Parse filter to extract meta tags: keywords, description, etc.

Indexing Filter Plugins
Package	Description
org.apache.nutch.indexer.anchor	An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic	A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed	Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.metadata	Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more	A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length.
org.apache.nutch.indexer.staticfield	A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection	Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld	Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta	URL Meta Tag Indexing Plugin

Indexer Plugins
Package	Description
org.apache.nutch.indexwriter.dummy	Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic	Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.solr	Index writer plugin for Apache Solr.

Misc. Plugins
Package	Description
org.apache.nutch.analysis.lang	Text document language identifier.
org.apache.nutch.collection	Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag	A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch	Sample plugins that parse and index Creative Commons medadata.

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

nutch 中文文档帮助手册教程