[TOC]

org.apache.nutch.parse

Class ParserFactory


public final class ParserFactory
extends Object

Creates and caches Parser plugins.

Field Summary

Fields Modifier and Type Field and Description static String DEFAULT_PLUGIN Wildcard for default plugins. static org.slf4j.Logger LOG

Constructor Summary

Constructors Constructor and Description ParserFactory(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description protected List getExtensions(String contentType) Finds the best-suited parse plugin for a given contentType. Parser getParserById(String id) Function returns a Parser instance with the specified extId, representing its extension ID. Parser[] getParsers(String contentType, String url) Function returns an array of Parsers for a given content type.

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG
-  

DEFAULT_PLUGIN

public static final String DEFAULT_PLUGIN

Wildcard for default plugins.

  - See Also:
  - [Constant Field Values](../../../../constant-values.html#org.apache.nutch.parse.ParserFactory.DEFAULT_PLUGIN)       

Constructor Detail

-  

ParserFactory

public ParserFactory(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

getParsers

public Parser[] getParsers(String contentType,
                  String url)
                    throws ParserNotFound

Function returns an array of Parsers for a given content type. The function consults the internal list of parse plugins for the ParserFactory to determine the list of pluginIds, then gets the appropriate extension points to instantiate as Parsers.

  - Parameters:
  - <code>contentType</code> - The contentType to return the <code>Array</code> of [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html)s for.
  - <code>url</code> - The url for the content that may allow us to get the type from the file suffix. 
  - Returns:
  - An <code>Array</code> of [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html)s for the given contentType. If there were plugins mapped to a contentType via the <code>parse\-plugins.xml</code> file, but never enabled via the <code>plugin.includes</code> Nutch conf, then those plugins won&#39;t be part of this array, i.e., they will be skipped. So, if the ordered list of parsing plugins for <code>text/plain</code> was <code>[parse\-text,parse\-html,
     parse\-rtf]</code>, and only <code>parse\-html</code> and <code>parse\-rtf</code> were enabled via <code>plugin.includes</code>, then this ordered Array would consist of two [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) interfaces, <code>[parse\-html, parse\-rtf]</code>. 
  - Throws: 
  - <code>ParserNotFound</code>       
-  

getParserById

public Parser getParserById(String id)
                     throws ParserNotFound

Function returns a Parser instance with the specified extId, representing its extension ID. If the Parser instance isn't found, then the function throws a ParserNotFound exception. If the function is able to find the Parser in the internal PARSER_CACHE then it will return the already instantiated Parser. Otherwise, if it has to instantiate the Parser itself , then this function will cache that Parser in the internal PARSER_CACHE.

  - Parameters:
  - <code>id</code> - The string extension ID (e.g., &#34;org.apache.nutch.parse.rss.RSSParser&#34;, &#34;org.apache.nutch.parse.rtf.RTFParseFactory&#34;) of the [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) implementation to return. 
  - Returns:
  - A [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html) implementation specified by the parameter <code>id</code>. 
  - Throws: 
  - <code>ParserNotFound</code> - If the Parser is not found (i.e., registered with the extension point), or if the there a [<code>PluginRuntimeException</code>](../../../../org/apache/nutch/plugin/PluginRuntimeException.html) instantiating the [<code>Parser</code>](../../../../org/apache/nutch/parse/Parser.html).       
-  

getExtensions

protected List<Extension> getExtensions(String contentType)

Finds the best-suited parse plugin for a given contentType.

  - Parameters:
  - <code>contentType</code> - Content-Type for which we seek a parse plugin. 
  - Returns:
  - a list of extensions to be used for this contentType. If none, returns <code>null</code>.      

Copyright © 2014 The Apache Software Foundation