[TOC]

  • Prev
  • Next

apache-nutch 1.9 API

Apache Nutch is a highly extensible and scalable open source web crawler software project.

See: Description

Core
Package Description
org.apache.nutch.crawl Crawl control code and tools to run the crawler.
org.apache.nutch.fetcher The Nutch robot.
org.apache.nutch.indexer Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.metadata A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net Web-related interfaces: URL filters and normalizers.
org.apache.nutch.net.protocols Helper classes related to the Protocol interface, sea also org.apache.nutch.protocol.
org.apache.nutch.parse The Parse interface and related classes.
org.apache.nutch.plugin The Nutch Plugin System.
org.apache.nutch.protocol Classes related to the Protocol interface, see also org.apache.nutch.net.protocols.
org.apache.nutch.scoring The ScoringFilter interface.
org.apache.nutch.scoring.webgraph Scoring implementation based on link analysis (LinkRank), see WebGraph.
org.apache.nutch.segment A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.tools Miscellaneous tools.
org.apache.nutch.tools.arc Tools to read the Arc file format.
org.apache.nutch.util Miscellaneous utility classes.
org.apache.nutch.util.domain Classes for domain name analysis.
Plugins API
Package Description
org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http, httpclient)
org.apache.nutch.urlfilter.api Generic URL filter library, abstracting away from regular expression implementations.
Protocol Plugins
Package Description
org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
URL Filter Plugins
Package Description
org.apache.nutch.urlfilter.automaton URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domainblacklist URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.prefix URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator URL filter plugin that validates given urls.
URL Normalizer Plugins
Package Description
org.apache.nutch.net.urlnormalizer.basic URL normalizer performing basic normalizations: remove default ports and dot segments in path.
org.apache.nutch.net.urlnormalizer.host URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.querystring URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex URL normalizer with configurable rules based on regular expressions (Pattern).
Scoring Plugins
Package Description
org.apache.nutch.scoring.depth Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link Scoring filter used in conjunction with WebGraph.
org.apache.nutch.scoring.opic Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.tld Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Plugin
Parse Plugins
Package Description
org.apache.nutch.parse.ext Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed Parse RSS feeds.
org.apache.nutch.parse.html An HTML document parsing plugin.
org.apache.nutch.parse.js Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.swf Parse Flash SWF files.
org.apache.nutch.parse.tika Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip Parse ZIP files: embedded files are recursively passed to appropriate parsers.
Parse Filter Plugins
Package Description
org.apache.nutch.parse.headings Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.metatags Parse filter to extract meta tags: keywords, description, etc.
Indexing Filter Plugins
Package Description
org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.metadata Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more A more indexing plugin, adds "more" index fields: last modified date, MIME type, content length.
org.apache.nutch.indexer.staticfield A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta URL Meta Tag Indexing Plugin
Indexer Plugins
Package Description
org.apache.nutch.indexwriter.dummy Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.solr Index writer plugin for Apache Solr.
Misc. Plugins
Package Description
org.apache.nutch.analysis.lang Text document language identifier.
org.apache.nutch.collection Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch Sample plugins that parse and index Creative Commons medadata.

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

  • Prev
  • Next

Copyright © 2014 The Apache Software Foundation