[TOC]

org.apache.nutch.crawl

Class TextProfileSignature

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable

public class TextProfileSignature
extends Signature

An implementation of a page signature. It calculates an MD5 hash of a plain text "profile" of a page. In case there is no text, it calculates a hash using the MD5Signature.

The algorithm to calculate a page "profile" takes the plain text version of a page and performs the following steps:

  • remove all characters except letters and digits, and bring all characters to lower case,
  • split the text into tokens (all consecutive non-whitespace characters),
  • discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
  • sort the list of tokens by decreasing frequency,
  • round down the counts of tokens to the nearest multiple of QUANT (QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency). If maxFreq is higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded).
  • tokens, which frequency after quantization falls below QUANT, are discarded.
  • create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency. This list is then submitted to an MD5 hash calculation.

  • Author:

  • Andrzej Bialecki ab@getopt.org

/ab@getopt.org

Field Summary

-    

Fields inherited from class org.apache.nutch.crawl.Signature

conf

Constructor Summary

Constructors Constructor and Description TextProfileSignature()

Method Summary

Methods Modifier and Type Method and Description byte[] calculate(Content content, Parse parse) static void main(String[] args)

-    

Methods inherited from class org.apache.nutch.crawl.Signature

getConf, setConf

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

TextProfileSignature

public TextProfileSignature()

Method Detail

-  

calculate

public byte[] calculate(Content content,
               Parse parse)
  - Specified by: 
  - <code>calculate</code> in class <code>Signature</code>        
-  

main

public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation