[TOC]

org.apache.nutch.parse.tika

Class TikaParser

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable, Parser, Pluggable

public class TikaParser
extends Object
implements Parser

Wrapper for Tika parsers. Mimics the HTMLParser but using the XHTML representation returned by Tika as SAX events

Field Summary

Fields Modifier and Type Field and Description static org.slf4j.Logger LOG

-    

Fields inherited from interface org.apache.nutch.parse.Parser

X_POINT_ID

Constructor Summary

Constructors Constructor and Description TikaParser()

Method Summary

Methods Modifier and Type Method and Description org.apache.hadoop.conf.Configuration getConf() ParseResult getParse(Content content) This method parses the given content and returns a map of pairs. void setConf(org.apache.hadoop.conf.Configuration conf)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

-  

LOG

public static final org.slf4j.Logger LOG

Constructor Detail

-  

TikaParser

public TikaParser()

Method Detail

-  

getParse

public ParseResult getParse(Content content)

Description copied from interface: Parser

This method parses the given content and returns a map of pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:

Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

  - Specified by: 
  - <code>getParse</code> in interface <code>Parser</code> 
  - Parameters:
  - <code>content</code> - Content to be parsed 
  - Returns:
  - a map containing <key, parse=""> pairs       
-  

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
  - Specified by: 
  - <code>setConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>        
-  

getConf

public org.apache.hadoop.conf.Configuration getConf()
  - Specified by: 
  - <code>getConf</code> in interface <code>org.apache.hadoop.conf.Configurable</code>       

Copyright © 2014 The Apache Software Foundation