cross-fields
搜索,一个唯一标识,跨了多个field
。比如一个人,标识,是姓名;一个建筑,它的标识是地址。姓名可以散落在多个field
中,比如first_name
和last_name
中,地址可以散落在country
,province
,city
中。
跨多个field
搜索一个标识,比如搜索一个人名,或者一个地址,就是cross-fields
搜索
初步来说,如果要实现,可能用most_fields
比较合适。因为best_fields
是优先搜索单个field
最匹配的结果,cross-fields
本身就不是一个field
的问题了。
POST /forum/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }
{ "update": { "_id": "6"} }
{ "doc" : {"author_first_name" : "Tom", "author_last_name" : "Smith"} }
{ "update": { "_id": "7"} }
{ "doc" : {"author_first_name" : "June", "author_last_name" : "Lee"} }
{ "update": { "_id": "8"} }
{ "doc" : {"author_first_name" : "July", "author_last_name" : "Hoo"} }
{ "update": { "_id": "9"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Li"} }
{ "update": { "_id": "10"} }
{ "doc" : {"author_first_name" : "Summy", "author_last_name" : "Xi"} }
{ "update": { "_id": "11"} }
{ "doc" : {"author_first_name" : "Suny", "author_last_name" : "Xi"} }
GET /forum/_search
{
"query": {
"multi_match": {
"query": "Peter Smith",
"type": "most_fields",
"fields": [ "author_first_name", "author_last_name" ]
}
}
}
Peter Smith
,匹配author_first_name
,匹配到了Smith
,这时候它的分数很高,为什么啊???
因为IDF
分数高,IDF
分数要高,那么这个匹配到的term(Smith)
,在所有doc
中的出现频率要低,author_first_name
field
中,Smith
就出现过1次
Peter Smith
这个人,doc 1,Smith在author_last_name
中,但是author_last_name
出现了3次Smith
,所以导致doc 1的IDF
分数较低
不要有过多的疑问,一定是这样吗?
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.6931472,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.6931472,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.5753642,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article",
"sub_title": "learning more courses",
"author_first_name": "Peter",
"author_last_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.51623213,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith"
}
}
]
}
}
问题1:只是找到尽可能多的field
匹配的doc
,而不是某个field
完全匹配的doc
问题2:most_fields
,没办法用minimum_should_match
去掉长尾数据,就是匹配的特别少的结果
问题3:TF/IDF算法,比如Peter Smith
和Smith Williams
,搜索Peter Smith
的时候,由于 first_name
中很少有 Smith
的,所以 query
在所有 document
中的频率很低,得到的分数很高,可能Smith Williams
反而会排在Peter Smith
前面
以下结果是 ES7.x 中的,说明 ES7.x 做了优化:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : 2.8442469,
"hits" : [
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.8442469,
"_source" : {
"articleID" : "XHDK-A-1293-#fJ3",
"userID" : 1,
"hidden" : false,
"postDate" : "2017-01-01",
"tag" : [
"java",
"hadoop"
],
"tag_cnt" : 2,
"view_cnt" : 30,
"title" : "this is java and hadoop blog",
"content" : "i like to write best elasticsearch article",
"sub_title" : "learning more courses",
"author_first_name" : "Peter",
"author_last_name" : "Smith"
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "5",
"_score" : 2.4696565,
"_source" : {
"articleID" : "XHDK-A-1293-#fJ5",
"userID" : 3,
"hidden" : false,
"postDate" : "2017-01-01",
"tag" : [
"Python",
"spark",
"flink"
],
"tag_cnt" : 3,
"view_cnt" : 70,
"title" : "this is python, spark, flink blog",
"content" : "spark is best big data solution based on scala ,an programming language similar to java",
"sub_title" : "haha, hello world",
"author_first_name" : "Tonny",
"author_last_name" : "Peter Smith"
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.0794415,
"_source" : {
"articleID" : "KDKE-B-9947-#kL5",
"userID" : 1,
"hidden" : false,
"postDate" : "2017-01-02",
"tag" : [
"java"
],
"tag_cnt" : 1,
"view_cnt" : 50,
"title" : "this is java blog",
"content" : "i think java is the best programming language",
"sub_title" : "learned a lot of course",
"author_first_name" : "Smith",
"author_last_name" : "Williams"
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "9",
"_score" : 1.5686159,
"_source" : {
"articleID" : "DHJK-B-1395-#Ky5",
"userID" : 3,
"hidden" : false,
"postDate" : "2020-06-01",
"tag" : [
"elasticsearch",
"flink"
],
"tag_cnt" : 1,
"view_cnt" : 90,
"title" : "this is elasticsearch and flink blog",
"content" : "elasticsearch and hadoop are all very good solution, i am a beginner",
"sub_title" : "the spark is good bigdata tool",
"author_first_name" : "Peter",
"author_last_name" : "Li"
}
},
{
"_index" : "forum",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.2756311,
"_source" : {
"articleID" : "KDKE-B-9947-#kL1",
"userID" : 3,
"hidden" : false,
"postDate" : "2018-01-02",
"tag" : [
"java",
"flink"
],
"tag_cnt" : 5,
"view_cnt" : 50,
"title" : "this is java, flink blog",
"content" : "i like to write best elasticsearch article",
"sub_title" : "learn java",
"author_first_name" : "Tom",
"author_last_name" : "Smith"
}
}
]
}
}