org.apache.nutch.parse
Class ParserFactory
- java.lang.Object
- org.apache.nutch.parse.ParserFactory
public final class ParserFactory extends Object
Creates and caches Parser
plugins.
Field Summary
Fields Modifier and Type Field and Description static String
DEFAULT_PLUGIN
Wildcard for default plugins.
static org.slf4j.Logger
LOG
Constructor Summary
Constructors Constructor and Description ParserFactory(org.apache.hadoop.conf.Configuration conf)
Method Summary
Methods Modifier and Type Method and Description protected List
getExtensions(String contentType)
Finds the best-suited parse plugin for a given contentType.
Parser
getParserById(String id)
Function returns a Parser
instance with the specified extId
, representing its extension ID.
Parser[]
getParsers(String contentType,
String url)
Function returns an array of Parser
s for a given content type.
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
DEFAULT_PLUGIN
public static final String DEFAULT_PLUGIN
Wildcard for default plugins.
- See Also:
- [Constant Field Values](../../../../constant-values.html#org.apache.nutch.parse.ParserFactory.DEFAULT_PLUGIN)
Constructor Detail
-
ParserFactory
public ParserFactory(org.apache.hadoop.conf.Configuration conf)
Method Detail
-
getParsers
public Parser[] getParsers(String contentType, String url) throws ParserNotFound
Function returns an array of Parser
s for a given content type. The function consults the internal list of parse plugins for the ParserFactory to determine the list of pluginIds, then gets the appropriate extension points to instantiate as Parser
s.
- Parameters:
- <code>contentType</code> - The contentType to return the <code>Array</code> of [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html)s for.
- <code>url</code> - The url for the content that may allow us to get the type from the file suffix.
- Returns:
- An <code>Array</code> of [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html)s for the given contentType. If there were plugins mapped to a contentType via the <code>parse\-plugins.xml</code> file, but never enabled via the <code>plugin.includes</code> Nutch conf, then those plugins won't be part of this array, i.e., they will be skipped. So, if the ordered list of parsing plugins for <code>text/plain</code> was <code>[parse\-text,parse\-html,
parse\-rtf]</code>, and only <code>parse\-html</code> and <code>parse\-rtf</code> were enabled via <code>plugin.includes</code>, then this ordered Array would consist of two [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) interfaces, <code>[parse\-html, parse\-rtf]</code>.
- Throws:
- <code>ParserNotFound</code>
-
getParserById
public Parser getParserById(String id) throws ParserNotFound
Function returns a Parser
instance with the specified extId
, representing its extension ID. If the Parser instance isn't found, then the function throws a ParserNotFound
exception. If the function is able to find the Parser
in the internal PARSER_CACHE
then it will return the already instantiated Parser. Otherwise, if it has to instantiate the Parser itself , then this function will cache that Parser in the internal PARSER_CACHE
.
- Parameters:
- <code>id</code> - The string extension ID (e.g., "org.apache.nutch.parse.rss.RSSParser", "org.apache.nutch.parse.rtf.RTFParseFactory") of the [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) implementation to return.
- Returns:
- A [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) implementation specified by the parameter <code>id</code>.
- Throws:
- <code>ParserNotFound</code> - If the Parser is not found (i.e., registered with the extension point), or if the there a [<code>PluginRuntimeException</code>](../../../../org/apache/nutch/plugin/PluginRuntimeException.html) instantiating the [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html).
-
getExtensions
protected List<Extension> getExtensions(String contentType)
Finds the best-suited parse plugin for a given contentType.
- Parameters:
- <code>contentType</code> - Content-Type for which we seek a parse plugin.
- Returns:
- a list of extensions to be used for this contentType. If none, returns <code>null</code>.