
Class TextProfileSignature

    • All Implemented Interfaces:
    • org.apache.hadoop.conf.Configurable

public class TextProfileSignature
extends Signature

An implementation of a page signature. It calculates an MD5 hash of a plain text "profile" of a page. In case there is no text, it calculates a hash using the MD5Signature.

The algorithm to calculate a page "profile" takes the plain text version of a page and performs the following steps:

  • remove all characters except letters and digits, and bring all characters to lower case,
  • split the text into tokens (all consecutive non-whitespace characters),
  • discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
  • sort the list of tokens by decreasing frequency,
  • round down the counts of tokens to the nearest multiple of QUANT (QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency). If maxFreq is higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded).
  • tokens, which frequency after quantization falls below QUANT, are discarded.
  • create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency. This list is then submitted to an MD5 hash calculation.

  • Author:

  • Andrzej Bialecki ab@getopt.org


Field Summary


Fields inherited from class org.apache.nutch.crawl.Signature


Constructor Summary

Constructors Constructor and Description TextProfileSignature()

Method Summary

Methods Modifier and Type Method and Description byte[] calculate(Content content, Parse parse) static void main(String[] args)


Methods inherited from class org.apache.nutch.crawl.Signature

getConf, setConf


Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail



public TextProfileSignature()

Method Detail



public byte[] calculate(Content content,
               Parse parse)
  - Specified by: 
  - <code>calculate</code> in class <code>Signature</code>        


public static void main(String[] args)
                 throws Exception
  - Throws: 
  - <code>Exception</code>      

Copyright © 2014 The Apache Software Foundation