[TOC]

org.apache.nutch.util

Class MimeUtil


public final class MimeUtil
extends Object
  • Since:
  • NUTCH-608 This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika. Any mime handling code should be placed in this utility class, and hidden from the Nutch classes that rely on it.

  • Author:

  • mattmann

Constructor Summary

Constructors Constructor and Description MimeUtil(org.apache.hadoop.conf.Configuration conf)

Method Summary

Methods Modifier and Type Method and Description String autoResolveContentType(String typeName, String url, byte[] data) A facade interface to trying all the possible mime type resolution strategies available within Tika. static String cleanMimeType(String origType) Cleans a MimeType name by removing out the actual MimeType, from a string of the form: String forName(String name) A facade interface to Tika's underlying MimeTypes.forName(String) method. String getMimeType(File f) Facade interface to Tika's underlying MimeTypes.getMimeType(File) method. String getMimeType(String url) Facade interface to Tika's underlying MimeTypes.getMimeType(String) method.

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

MimeUtil

public MimeUtil(org.apache.hadoop.conf.Configuration conf)

Method Detail

-  

cleanMimeType

public static String cleanMimeType(String origType)

Cleans a MimeType name by removing out the actual MimeType, from a string of the form:

      <primary type>/<sub type> ; < optional params
 
  - Parameters:
  - <code>origType</code> - The original mime type string to be cleaned. 
  - Returns:
  - The primary type, and subtype, concatenated, e.g., the actual mime type.       
-  

autoResolveContentType

public String autoResolveContentType(String typeName,
                            String url,
                            byte[] data)

A facade interface to trying all the possible mime type resolution strategies available within Tika. First, the mime type provided in typeName is cleaned, with cleanMimeType(String)). Then the cleaned mime type is looked up in the underlying Tika MimeTypes registry, by its cleaned name. If the MimeType is found, then that mime type is used, otherwise URL resolution is used to try and determine the mime type. However, if mime.type.magic is enabled in NutchConfiguration, then mime type magic resolution is used to try and obtain a better-than-the-default approximation of the MimeType.

  - Parameters:
  - <code>typeName</code> - The original mime type, returned from a [<code>ProtocolOutput</code>](../../../../org/apache/nutch/protocol/ProtocolOutput.html).
  - <code>url</code> - The given @see url, that Nutch was trying to crawl.
  - <code>data</code> - The byte data, returned from the crawl, if any. 
  - Returns:
  - The correctly, automatically guessed <code>MimeType</code> name.       
-  

getMimeType

public String getMimeType(String url)

Facade interface to Tika's underlying MimeTypes.getMimeType(String) method.

  - Parameters:
  - <code>url</code> - A string representation of the document <code>URL</code> to sense the <code>MimeType</code> for. 
  - Returns:
  - An appropriate <code>MimeType</code>, identified from the given Document url in string form.       
-  

forName

public String forName(String name)

A facade interface to Tika's underlying MimeTypes.forName(String) method.

  - Parameters:
  - <code>name</code> - The name of a valid <code>MimeType</code> in the Tika mime registry. 
  - Returns:
  - The object representation of the <code>MimeType</code>, if it exists, or null otherwise.       
-  

getMimeType

public String getMimeType(File f)

Facade interface to Tika's underlying MimeTypes.getMimeType(File) method.

  - Parameters:
  - <code>f</code> - The [<code>File</code>](http://java.sun.com/javase/6/docs/api/java/io/File.html?is-external=true) to sense the <code>MimeType</code> for. 
  - Returns:
  - The <code>MimeType</code> of the given [<code>File</code>](http://java.sun.com/javase/6/docs/api/java/io/File.html?is-external=true), or null if it cannot be determined.      

Copyright © 2014 The Apache Software Foundation