1、算法介绍
relevance score
算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度Elasticsearch
使用的是 term frequency/inverse document frequency
算法,简称为TF/IDF
算法Term frequency
:搜索文本中的各个词条在field
文本中出现了多少次,出现次数越多,就越相关
搜索请求:hello world
doc1:hello you, and world is very good
doc2:hello, how are you
Inverse document frequency
:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关
搜索请求:hello world
doc1:hello, today is very good
doc2:hi world, how are you
比如说,在index中有1万条document
,hello
这个单词在所有的document
中,一共出现了1000次;world这个单词在所有的document
中,一共出现了100次
doc2更相关Field-length norm
:field
长度,field
越长,相关度越弱
搜索请求:hello world
doc1:{ "title": "hello article", "content": "babaaba 1万个单词" }
doc2:{ "title": "my article", "content": "blablabala 1万个单词,hi world" }
hello world在整个index中出现的次数是一样多的
doc1更相关,title field更短
2、_score是如何被计算出来的
GET /website/_explain/1
{
"query": {
"match": {
"content": "first article"
}
}
}
结果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.4733056,
"hits" : [
{
"_shard" : "[website][0]",
"_node" : "lpzmiHgWTOmrbw3EZfoACQ",
"_index" : "website",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.4733056,
"_source" : {
"title" : "first article",
"content" : "this is my first article",
"post_date" : "2018-01-01",
"author_id" : 117
},
"_explanation" : {
"value" : 1.4733056,
"description" : "sum of:",
"details" : [
{
"value" : 1.3862942,
"description" : "weight(content:first in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.3862942,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.3862944,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.08701137,
"description" : "weight(content:article in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.08701137,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.087011375,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 5,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[website][0]",
"_node" : "lpzmiHgWTOmrbw3EZfoACQ",
"_index" : "website",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.08701137,
"_source" : {
"title" : "second article",
"content" : "this is my second article",
"post_date" : "2019-01-01",
"author_id" : 110
},
"_explanation" : {
"value" : 0.08701137,
"description" : "sum of:",
"details" : [
{
"value" : 0.08701137,
"description" : "weight(content:article in 3) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.08701137,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.087011375,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 5,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[website][0]",
"_node" : "lpzmiHgWTOmrbw3EZfoACQ",
"_index" : "website",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.08701137,
"_source" : {
"title" : "third article",
"content" : "this is my third article",
"post_date" : "2020-01-01",
"author_id" : 111
},
"_explanation" : {
"value" : 0.08701137,
"description" : "sum of:",
"details" : [
{
"value" : 0.08701137,
"description" : "weight(content:article in 4) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.08701137,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.087011375,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 5,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
]
}
}
3、分析一个document是如何被匹配上的
GET /website/_explain/1
{
"query": {
"match": {
"content": "first article"
}
}
}
结果:
{
"_index" : "website",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4733056,
"description" : "sum of:",
"details" : [
{
"value" : 1.3862942,
"description" : "weight(content:first in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.3862942,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.3862944,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.08701137,
"description" : "weight(content:article in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.08701137,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.087011375,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 5,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 5,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
有些类似SQL的执行计划