[TOC]
- Overview
- Package
- Class
- Use
- Tree
- Deprecated
- Index
- Help
- Prev
- Next
apache-nutch 1.9 API
Apache Nutch is a highly extensible and scalable open source web crawler software project.
See: Description
Package | Description |
---|---|
org.apache.nutch.crawl | Crawl control code and tools to run the crawler. |
org.apache.nutch.fetcher | The Nutch robot. |
org.apache.nutch.indexer | Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index. |
org.apache.nutch.metadata | A Multi-valued Metadata container, and set of constant fields for Nutch Metadata. |
org.apache.nutch.net |
Web-related interfaces: URL filters and normalizers .
|
org.apache.nutch.net.protocols |
Helper classes related to the Protocol interface, sea also org.apache.nutch.protocol .
|
org.apache.nutch.parse |
The Parse interface and related classes.
|
org.apache.nutch.plugin |
The Nutch Plugin System.
|
org.apache.nutch.protocol |
Classes related to the Protocol interface, see also org.apache.nutch.net.protocols .
|
org.apache.nutch.scoring |
The ScoringFilter interface.
|
org.apache.nutch.scoring.webgraph |
Scoring implementation based on link analysis (LinkRank ), see WebGraph .
|
org.apache.nutch.segment | A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links. |
org.apache.nutch.tools | Miscellaneous tools. |
org.apache.nutch.tools.arc | Tools to read the Arc file format. |
org.apache.nutch.util | Miscellaneous utility classes. |
org.apache.nutch.util.domain | Classes for domain name analysis. |
Package | Description |
---|---|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (http , httpclient )
|
org.apache.nutch.urlfilter.api |
Generic URL filter library, abstracting away from regular expression implementations.
|
Package | Description |
---|---|
org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. |
org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. |
org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. |
org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. |
Package | Description |
---|---|
org.apache.nutch.urlfilter.automaton | URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM. |
org.apache.nutch.urlfilter.domain | URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names. |
org.apache.nutch.urlfilter.domainblacklist | URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names. |
org.apache.nutch.urlfilter.prefix | URL filter plugin to include only URLs which match one of a given list of URL prefixes. |
org.apache.nutch.urlfilter.regex | URL filter plugin to include and/or exclude URLs matching Java regular expressions. |
org.apache.nutch.urlfilter.suffix | URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes. |
org.apache.nutch.urlfilter.validator | URL filter plugin that validates given urls. |
Package | Description |
---|---|
org.apache.nutch.net.urlnormalizer.basic | URL normalizer performing basic normalizations: remove default ports and dot segments in path. |
org.apache.nutch.net.urlnormalizer.host | URL normalizer renaming hosts to a canonical form listed in the configuration file. |
org.apache.nutch.net.urlnormalizer.pass | URL normalizer dummy which does not change URLs. |
org.apache.nutch.net.urlnormalizer.querystring | URL normalizer which sort the elements in the query part to avoid duplicates by permutations. |
org.apache.nutch.net.urlnormalizer.regex |
URL normalizer with configurable rules based on regular expressions (Pattern ).
|
Package | Description |
---|---|
org.apache.nutch.scoring.depth | Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs). |
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with WebGraph .
|
org.apache.nutch.scoring.opic | Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm. |
org.apache.nutch.scoring.tld | Top Level Domain Scoring plugin. |
org.apache.nutch.scoring.urlmeta | URL Meta Tag Scoring Plugin |
Package | Description |
---|---|
org.apache.nutch.parse.ext | Parse wrapper to run external command to do the parsing. |
org.apache.nutch.parse.feed | Parse RSS feeds. |
org.apache.nutch.parse.html | An HTML document parsing plugin. |
org.apache.nutch.parse.js | Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets. |
org.apache.nutch.parse.swf | Parse Flash SWF files. |
org.apache.nutch.parse.tika | Parse various document formats with help of Apache Tika. |
org.apache.nutch.parse.zip | Parse ZIP files: embedded files are recursively passed to appropriate parsers. |
Package | Description |
---|---|
org.apache.nutch.parse.headings | Parse filter to extract headings (h1, h2, etc.) from DOM parse tree. |
org.apache.nutch.parse.metatags | Parse filter to extract meta tags: keywords, description, etc. |
Package | Description |
---|---|
org.apache.nutch.indexer.anchor | An indexing plugin for inbound anchor text. |
org.apache.nutch.indexer.basic | A basic indexing plugin, adds basic fields: url, host, title, content, etc. |
org.apache.nutch.indexer.feed | Indexing filter to index meta data from RSS feeds. |
org.apache.nutch.indexer.metadata | Indexing filter to add document metadata to the index. |
org.apache.nutch.indexer.more | A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length. |
org.apache.nutch.indexer.staticfield | A simple plugin called at indexing that adds fields with static data. |
org.apache.nutch.indexer.subcollection | Indexing filter to assign documents to subcollections. |
org.apache.nutch.indexer.tld | Top Level Domain Indexing plugin. |
org.apache.nutch.indexer.urlmeta | URL Meta Tag Indexing Plugin |
Package | Description |
---|---|
org.apache.nutch.indexwriter.dummy | Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete". |
org.apache.nutch.indexwriter.elastic | Index writer plugin for Elasticsearch. |
org.apache.nutch.indexwriter.solr | Index writer plugin for Apache Solr. |
Package | Description |
---|---|
org.apache.nutch.analysis.lang | Text document language identifier. |
org.apache.nutch.collection | Subcollection is a subset of an index. |
org.apache.nutch.microformats.reltag | A microformats Rel-Tag Parser/Indexer/Querier plugin. |
org.creativecommons.nutch | Sample plugins that parse and index Creative Commons medadata. |
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
- Overview
- Package
- Class
- Use
- Tree
- Deprecated
- Index
- Help
- Prev
- Next