org.apache.nutch.crawl
Class TextProfileSignature
- java.lang.Object
- org.apache.nutch.crawl.Signature
- org.apache.nutch.crawl.TextProfileSignature
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
public class TextProfileSignature extends Signature
An implementation of a page signature. It calculates an MD5 hash of a plain text "profile" of a page. In case there is no text, it calculates a hash using the MD5Signature
.
The algorithm to calculate a page "profile" takes the plain text version of a page and performs the following steps:
- remove all characters except letters and digits, and bring all characters to lower case,
- split the text into tokens (all consecutive non-whitespace characters),
- discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
- sort the list of tokens by decreasing frequency,
- round down the counts of tokens to the nearest multiple of QUANT (
QUANT = QUANT_RATE * maxFreq
, whereQUANT_RATE
is 0.01f by default, andmaxFreq
is the maximum token frequency). IfmaxFreq
is higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded). - tokens, which frequency after quantization falls below QUANT, are discarded.
create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency. This list is then submitted to an MD5 hash calculation.
Author:
- Andrzej Bialecki ab@getopt.org
Field Summary
-
Fields inherited from class org.apache.nutch.crawl.Signature
conf
Constructor Summary
Constructors Constructor and Description TextProfileSignature()
Method Summary
Methods Modifier and Type Method and Description byte[]
calculate(Content content,
Parse parse)
static void
main(String[] args)
-
Methods inherited from class org.apache.nutch.crawl.Signature
getConf, setConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
-
TextProfileSignature
public TextProfileSignature()
Method Detail
-
calculate
public byte[] calculate(Content content, Parse parse)
- Specified by:
- <code>calculate</code> in class <code>Signature</code>
-
main
public static void main(String[] args) throws Exception
- Throws:
- <code>Exception</code>