速查手册13-搜索数据探究 - 《ElasticSearch知识梳理》

概述
折叠搜索结果
- 聚合结果
- 二级折叠不支持inner_hits
筛选搜索结果
- post_filter
- query rescorer
- Multiple Rescores
突出显示(高亮)
- 简单示例
- 参数使用示例
- 高亮优化
长时间运行的搜索
近实时搜索
- 写入流程
分页搜索结果
- 方式1：from+size
- 方式2：scroll_id滚动搜索
- 方式3：search_after
检索内部命中 inner_hits
检索所选字段
跨群集搜索
搜索多个数据流和索引
搜索分片路由
搜索模板
对搜索结果排序

概述

:::info 您可以使用搜索API搜索和聚合存储在Elasticsearch数据流或索引中的数据。API的查询请求主体参数接受在查询DSL中编写的查询。
关于Search API见：速查手册06-REST APIs
关于聚合见：速查手册14-聚合
关于QUERY DSL见：速查手册07-QUERY DSL ::: :::info 下面针对搜索数据操作做一些探究。 :::

折叠搜索结果

聚合结果

GET my-index-000001/_search
{
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "test2.long_2"              #1
  },
  "sort": [
    {
      "test2.long_2": {               #2
        "order": "desc"
      }
    }
  ],
  "from": 0                                                     #3
}
explain:
#1  折叠的字段：test2.long_2
#2 使用test2.long_2排序
#3 定义第一个折叠时的偏移量

:::tips

折叠仅适用于keyword或者number类型
返回中hits.total.value 代表聚合前的数据量，聚合后的数据总数未知
如果想知道折叠后的结果，可通过cardinality聚合实现，前提：cardinality和折叠，使用同一个字段，eg：例2 #1 #2 ::: 例2： ```java GET test/_search { “_source”: [ “id”, “is_lastest”, “second”, “name” ], “aggs”: { “totalSize”: {
```
"cardinality": {                #1
  "field": "second_id"
}
```
} }, “query”: { “terms”: {
```
"second_id": [
  "2208",
  "196158"
]
```
} }, “collapse”: { #6 第一级 “field”: “second_id”, #2 “inner_hits”: { #3
```
"name": "by_location",
"collapse": {                     #4  第二级
  "field": "name"
},
"_source": [
  "id",
  "is_lastest",
  "second",
  "name"
],
"size": 3
```
} }, “sort”: [ {
```
"second_id": {
  "order": "desc"
}
```
} ], “from”: 0 }

输出结果： { “took” : 42, “timed_out” : false, “_shards” : { “total” : 1, “successful” : 1, “skipped” : 0, “failed” : 0 }, “hits” : { “total” : { “value” : 62, “relation” : “eq” }, “max_score” : null, “hits” : [ { “_index” : “test”, “_type” : “_doc”, “_id” : “53421”, “_score” : null, “_source” : { “name” : “xh2107091117”, “id” : 45897 }, “fields” : { “second_id” : [ 196158 ] }, “sort” : [ 196158 ], “inner_hits” : { “by_location” : { “hits” : { “total” : { “value” : 3, “relation” : “eq” }, “max_score” : null, “hits” : [ { “_index” : “test”, “_type” : “_doc”, “_id” : “53421”, “_score” : 1.0, “_source” : { “name” : “xh2107091117”, “id” : 45897 }, “fields” : { “name” : [ “xh2107091117” ] } }, { “_index” : “test”, “_type” : “_doc”, “_id” : “53371”, “_score” : 1.0, “_source” : { “name” : “testName”, “id” : 45868 }, “fields” : { “name” : [ “testName” ] } }, { “_index” : “test”, “_type” : “_doc”, “_id” : “53390”, “_score” : 1.0, “_source” : { “name” : “text”, “id” : 45873 }, “fields” : { “name” : [ “text” ] } } ] } } } }, { “_index” : “test”, “_type” : “_doc”, “_id” : “50735”, “_score” : null, “_source” : { “name” : “apk”, “id” : 380 }, “fields” : { “second_id” : [ 2208 ] }, “sort” : [ 2208 ], “inner_hits” : { “by_location” : { “hits” : { “total” : { “value” : 59, “relation” : “eq” }, “max_score” : null, “hits” : [ { “_index” : “test”, “_type” : “_doc”, “_id” : “50735”, “_score” : 1.0, “_source” : { “name” : “apk”, “id” : 380 }, “fields” : { “name” : [ “apk” ] } }, { “_index” : “test”, “_type” : “_doc”, “_id” : “44901”, “_score” : 1.0, “_source” : { “name” : “testName2”, “id” : 43066 }, “fields” : { “name” : [ “testName2” ] } }, { “_index” : “test”, “_type” : “_doc”, “_id” : “51399”, “_score” : 1.0, “_source” : { “name” : “testName3”, “id” : 43066 }, “fields” : { “name” : [ “testName3” ] } } ] } } } } ] }, “aggregations” : { “totalSize” : { “value” : 1 } } }

<a name="CfETO"></a>
### 折叠后的topN
:::info
如果想获取折叠后，同值字段的前几条数据，可通过inner_hits获取。见例2  #3
:::
<a name="B2Zw4"></a>
### 和search_after搭配使用
例3
```java
GET /my-index-000001/_search
{
  "query": {
    "match": {
      "message": "GET /search"
    }
  },
  "collapse": {
    "field": "user.id"
  },
  "sort": [ "user.id" ],
  "search_after": ["dd5ce1ad"]
}

二级折叠不支持inner_hits

这句的意思是，看例2#4和#6处，这里的collapse位于#6的下一级，如果这里再定义inner_hits，例如：
例4

"collapse": {
    "field": "second_id",
    "inner_hits": {
      "name": "by_location",
      "collapse": {
        "field": "name",
        "inner_hits": {   #1
          "name": "by_location"
        }
      },
      "_source": [
        "id",
        "is_lastest",
        "second",
        "name"
      ],
      "size": 3
    }
  },
# 由于#1处，inner_hits处于第二级collapse下，则报错如下
{
  "error": {
    "root_cause": [
      {
        "type": "parsing_exception",
        "reason": "Invalid token in the inner collapse",
        "line": 28,
        "col": 25
      }
    ],
    "type": "x_content_parse_exception",
    "reason": "[28:25] [collapse] failed to parse field [inner_hits]",
    "caused_by": {
      "type": "x_content_parse_exception",
      "reason": "[28:25] [inner_hits] failed to parse field [collapse]",
      "caused_by": {
        "type": "parsing_exception",
        "reason": "Invalid token in the inner collapse",
        "line": 28,
        "col": 25
      }
    }
  },
  "status": 400
}

筛选搜索结果

post_filter

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": { "brand": "gucci" } 
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": { "field": "color" } 
    },
    "color_red": {
      "filter": {
        "term": { "color": "red" } 
      },
      "aggs": {
        "models": {
          "terms": { "field": "model" } 
        }
      }
    }
  },
  "post_filter": { 
    "term": { "color": "red" }
  }
}
# post_filter是对聚合后的结果再次进行过滤，性能消耗较大，且不适用缓存，看场景而用。

query rescorer

:::info 可用于优化match_phase ::: query rescorer仅对query和post_filter阶段返回的Top-K结果执行第二个查询。每个分片上将检查的文档数量可以通过window_size参数控制，该参数默认为10。默认情况下，原始查询和rescore查询的分数线性组合，以生成每个文档的最终_score。可以分别使用query_weight和rescore_query_weight来控制原始查询和rescore查询的相对重要性。两者都默认为

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }
}

可以使用score_mode控制分数组合的方式：

分数模式	描述
total	添加原始分数和rescore查询分数。默认值。
multiply	将原始分数乘以rescore查询分数。对function query重新调整有用。
avg	平均原始分数和rescore查询分数。
max	取最大原始分数和rescore查询分数。
min	取最初得分和rescore查询得分的分钟。

Multiple Rescores

也可以按顺序执行多个重新扫描：

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : [ {
      "window_size" : 100,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }, {
      "window_size" : 10,
      "query" : {
         "score_mode": "multiply",
         "rescore_query" : {
            "function_score" : {
               "script_score": {
                  "script": {
                    "source": "Math.log10(doc.likes.value + 2)"
                  }
               }
            }
         }
      }
   } ]
}

突出显示(高亮)

简单示例

GET /_search
{
  "query": {
    "match": { "content": "kimchy" }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

GET ljh_text/_search
{
  "query": {
    "match": {
      "content": {
        "query": "父亲"
      }
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "父亲",
            "slop": 1
          }
        }
      },
      "rescore_query_weight": 10
    }
  },
  "_source": false,
  "highlight": {
    "order": "score",
    "tags_schema":"styled",
    "fields": {
      "content": {
         "type": "unified",
        "fragment_size": 150,
        "number_of_fragments": 3,
        "highlight_query": {
          "bool": {
            "must": {
              "match": {
                "content": {
                  "query": "父亲"
                }
              }
            },
            "should": {
              "match_phrase": {
                "content": {
                  "query": "父亲",
                  "slop": 1,
                  "boost": 10.0
                }
              }
            },
            "minimum_should_match": 0
          }
        }
      }
    }
  }
}

参数使用示例

以上示例参数解释

参数	含义	备注
#1 no_match_size	即使字段中没有关键字命中，也可以返回一段文字，该参数表示从开始多少个字符被返回。
#2 encoder	转义HTML，默认default，不转义	比如设置为html，会被转义成：<div>
#3 number_of_fragment	返回结果最多可以包含几段不连续的文字。默认是5。
#4 fragment_size	返回的每个片段，文本长度，默认150
#5 pre_tags，post_tags	标记 highlight 的开始，结束标签
#6 order	文本中，片段是否按分数排序	如果有需求要求更匹配的段落优先展示，可设置
highlight—>type	使用的高亮模式:unified, plain, or fvh. 默认为 unified。
highlight->tags_schema	设置为使用内置标记模式的样式。	设置后，返回数据变成”””包着的数据

高亮优化

高亮查询渲染所有高亮数据，性能消耗较大，可通过rescore优化

GET ljh_text/_search
{
  "query": {
    "match": {
      "content": {
        "query": "父亲"
      }
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "content": {
            "query": "父亲",
            "slop": 1
          }
        }
      },
      "rescore_query_weight": 10
    }
  },
  "_source": false,
  "highlight": {
    "order": "score",
    "fields": {
      "content": {
        "fragment_size": 150,
        "number_of_fragments": 3,
        "highlight_query": {
          "bool": {
            "must": {
              "match": {
                "content": {
                  "query": "父亲"
                }
              }
            },
            "should": {
              "match_phrase": {
                "content": {
                  "query": "父亲",
                  "slop": 1,
                  "boost": 10.0
                }
              }
            },
            "minimum_should_match": 0
          }
        }
      }
    }
  }
}

长时间运行的搜索

:::info Elasticsearch 7.7 版本带来一个新的特性，search 过程允许异步执行，客户端发送完 search 请求后，Elasticsearch 服务端给客户端返回一个 id，以后客户端拿这个 id 来获取 search进度，并且支持返回“部分”结果，这对于 UI 交互相关的查询请求非常友好，例如绘图过程可以逐步的显示出来。使用示例 :::

近实时搜索

写入流程

:::info 1：数据写入Lucene内存和translog
2：经过设置的refresh时间，默认1秒，Lucene内存转segment，reopen，此时可搜索
3：经过设置的flush时间，默认30分钟，内存中的segment写磁盘，历史translog清空，完成一次刷新。
由于经过1秒才从Lucene内存转到可搜索的segment，所以es是近实时的搜索 :::

分页搜索结果

方式1：from+size

GET /_search
{
  "from": 5,
  "size": 20,
  "query": {
    "match": {
      "user.id": "kimchy"
    }
  }
}

方式2：scroll_id滚动搜索

:::tips

不再建议使用卷轴 API 进行深度填充。如果需要在超过 10，000 次点击时保持索引状态，请使用具有时间点（PIT）search_after参数。 :::

POST /my-index-000001/_search?scroll=1m
{
  "size": 100,
  "query": {
    "match": {
      "message": "foo"
    }
  }
}

POST /_search/scroll                                                               
{
  "scroll" : "1m",                                                                 
  "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

# 清除滚动
DELETE /_search/scroll
{
  "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

方式3：search_after

GET /_search
{
  "size": 10000,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [ 
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}},
    {"_shard_doc": "desc"}
  ]
}

返回
{
  "pit_id" : "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
  "took" : 17,
  "timed_out" : false,
  "_shards" : ...,
  "hits" : {
    "total" : ...,
    "max_score" : null,
    "hits" : [
      ...
      {
        "_index" : "my-index-000001",
        "_id" : "FaslK3QBySSL_rrj9zM5",
        "_score" : null,
        "_source" : ...,
        "sort" : [                                
          "2021-05-20T05:30:04.832Z",
          4294967298                              
        ]
      }
    ]
  }
}

再次请求
GET /_search
{
  "size": 10000,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}}
  ],
  "search_after": [                                
    "2021-05-20T05:30:04.832Z",
    4294967298
  ],
  "track_total_hits": false                        
}

检索内部命中 inner_hits

:::info 用于join的父子类型和nested嵌套查询时，返回具体匹配的那个对象
inner_hits中支持的参数：from，size，sort，name :::

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested"
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "number": 1
    },
    {
      "author": "nik9000",
      "number": 2
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": { "comments.number": 2 }
      },
      "inner_hits": {} 
    }
  }
}

检索所选字段

跨群集搜索

搜索多个数据流和索引

搜索分片路由

搜索模板

对搜索结果排序

参考：↓
博文：https://www.cnblogs.com/gmhappy/p/11864048.html
官网：https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-your-data.html