概念
1.bucket和metric概念简介
bucket就是一个聚合搜索时的数据分组。如:销售部门有员工张三和李四,开发部门有员工王五和赵六。那么根据部门分组聚合得到结果就是两个bucket。销售部门bucket中有张三和李四,开发部门 bucket中有王五和赵六。
metric就是对一个bucket数据执行的统计分析。如上述案例中,开发部门有2个员工,销售部门有2个员工,这就是metric。
metric有多种统计,如:求和,最大值,最小值,平均值等。
用一个大家容易理解的SQL语法来解释,如:select count() from table group by column。
那么group by column分组后的每组数据就是bucket。对每个分组执行的count()就是metric。
准备案例数据
PUT /cars
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"model": {
"type": "keyword"
},
"sold_date": {
"type": "date"
},
"remark": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
POST /cars/_bulk
{ "index": {}}
{ "price" : 258000, "color" : "金色", "brand":"大众", "model" : "大众迈腾", "sold_date" : "2021-10-28","remark" : "大众中档车" }
{ "index": {}}
{ "price" : 123000, "color" : "金色", "brand":"大众", "model" : "大众速腾", "sold_date" : "2021-11-05","remark" : "大众神车" }
{ "index": {}}
{ "price" : 239800, "color" : "白色", "brand":"标志", "model" : "标志508", "sold_date" : "2021-05-18","remark" : "标志品牌全球上市车型" }
{ "index": {}}
{ "price" : 148800, "color" : "白色", "brand":"标志", "model" : "标志408", "sold_date" : "2021-07-02","remark" : "比较大的紧凑型车" }
{ "index": {}}
{ "price" : 1998000, "color" : "黑色", "brand":"大众", "model" : "大众辉腾", "sold_date" : "2021-08-19","remark" : "大众最让人肝疼的车" }
{ "index": {}}
{ "price" : 218000, "color" : "红色", "brand":"奥迪", "model" : "奥迪A4", "sold_date" : "2021-11-05","remark" : "小资车型" }
{ "index": {}}
{ "price" : 489000, "color" : "黑色", "brand":"奥迪", "model" : "奥迪A6", "sold_date" : "2022-01-01","remark" : "政府专用?" }
{ "index": {}}
{ "price" : 1899000, "color" : "黑色", "brand":"奥迪", "model" : "奥迪A 8", "sold_date" : "2022-02-12","remark" : "很贵的大A6。。。" }
设置聚合查询不返回文档元数据
输入查询语句
GET /cars/_search
{
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"_count": "desc"
}
}
}
}
}
结果:
你会发现上面hits的显示的都是元数据,而下面的aggregations的才是显示聚合的结果, 如果你不想显示元数据,你可以添加参数”size”:0
{
"took" : 21,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "O6WJYX0BE5NC4Vl4lZFr",
"_score" : 1.0,
"_source" : {
"price" : 258000,
"color" : "金色",
"brand" : "大众",
"model" : "大众迈腾",
"sold_date" : "2021-10-28",
"remark" : "大众中档车"
}
},
// 此处忽略其它元数据........
]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "黑色",
"doc_count" : 3
},
{
"key" : "白色",
"doc_count" : 2
},
{
"key" : "金色",
"doc_count" : 2
},
{
"key" : "红色",
"doc_count" : 1
}
]
}
}
}
添加个 “size”: 0 ,这样就不会显示元数据了.
size可以设置为0,表示不返回ES中的文档,只返回ES聚合之后的数据,提高查询速度,当然如果你需要这些文档的话,也可以按照实际情况进行设置
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"_count": "desc"
}
}
}
}
}
结果:
你会发现hits里面是空的
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "黑色",
"doc_count" : 3
},
{
"key" : "白色",
"doc_count" : 2
},
{
"key" : "金色",
"doc_count" : 2
},
{
"key" : "红色",
"doc_count" : 1
}
]
}
}
}
聚合操作案例
1、根据color分组统计销售数量
只执行聚合分组,不做复杂的聚合统计。在ES中最基础的聚合为terms,相当于SQL中的count。
在ES中默认为分组数据做排序,使用的是doc_count数据执行降序排列。可以使用_key元数据,根据分组后的字段数据执行不同的排序方案,也可以根据_count元数据,根据分组后的统计值执行不同的排序方案。
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color", //根据color字段做统计
"order": {
"_count": "desc" //根据数量排序,_count是ElasticSearch内置默认的名字,不是咱们自己定义的
}
}
}
}
}
2、统计不同color车辆的平均价格
本案例先根据color执行聚合分组,在此分组的基础上,对组内数据执行聚合统计,这个组内数据的聚合统计就是metric。同样可以执行排序,因为组内有聚合统计,且对统计数据给予了命名avg_by_price,所以可以根据这个聚合统计数据字段名执行排序逻辑。
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_by_price": "asc" // 这个是利用价格结果做个升序
}
},
"aggs": { //注意看这个层级关系,这个就是下钻嵌套,意思就是在上面group_by_color,field color之后的基础上再进行aggs分析
"avg_by_price": { //这个avg_by_price是和上面外层的aggs的order里面的avg_by_price是一一对应的.
"avg": { // avg代表是算平均值,不是随便写的
"field": "price" //根据价格
}
}
}
}
}
}
查询结果
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "金色",
"doc_count" : 2,
"avg_by_price" : {
"value" : 190500.0
}
},
{
"key" : "白色",
"doc_count" : 2,
"avg_by_price" : {
"value" : 194300.0
}
},
{
"key" : "红色",
"doc_count" : 1,
"avg_by_price" : {
"value" : 218000.0
}
},
{
"key" : "黑色",
"doc_count" : 3,
"avg_by_price" : {
"value" : 1462000.0
}
}
]
}
}
}
这个aggs嵌套是没有具体的要求的,根据你的需求你可以嵌套无数个.
3、统计不同color不同brand中车辆的平均价格
先根据color聚合分组,在组内根据brand再次聚合分组,这种操作可以称为下钻分析。
Aggs如果定义比较多,则会感觉语法格式混乱,aggs语法格式,有一个相对固定的结构,简单定义:aggs可以嵌套定义,可以水平定义。
嵌套定义称为下钻分析。水平定义就是平铺多个分组方式。
建议还是使用下钻分析,因为有层级,平铺的话,如果dsl语句字段多的话会显得混乱.
GET /index_name/type_name/_search
{
"aggs": {
"定义分组名称(最外层)": {
"分组策略如:terms、avg、sum": {
"field": "根据哪一个字段分组",
"其他参数": ""
},
"aggs": {
"分组名称1": {},
"分组名称2": {}
}
}
}
}
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_by_price_color": "asc"
}
},
"aggs": {
"avg_by_price_color": {
"avg": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand",
"order": {
"avg_by_price_brand": "desc"
}
},
"aggs": {
"avg_by_price_brand": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
}
说明:
“group_by_color”: {
“terms”: {
“field”: “color”,
“order”: {
“avg_by_price_color”: “asc”
}
},
“aggs”: {
“avg_by_price_color”: {
“avg”: {
“field”: “price”
}
},按照颜色分组,然后计算颜色平均价格,并且将结果进行升序
"aggs": {<br /> "avg_by_price_color": {<br /> "avg": {<br /> "field": "price"<br /> }<br /> },<br /> "group_by_brand": {<br /> "terms": {<br /> "field": "brand",<br /> "order": {<br /> "avg_by_price_brand": "desc"<br /> }<br /> },<br /> "aggs": {<br /> "avg_by_price_brand": {<br /> "avg": {<br /> "field": "price"<br /> }<br /> }<br /> }<br /> }<br /> }<br />的意思是按brand品牌做分组,然后按照price价格做统计
执行结果:
金色的颜色的车品牌有大众
白色的颜色的车品牌有标志
红色的颜色的车品牌有奥迪
黑色的颜色的车的品牌有 大众 奥迪
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "金色",
"doc_count" : 2,
"avg_by_price_color" : {
"value" : 190500.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "大众",
"doc_count" : 2,
"avg_by_price_brand" : {
"value" : 190500.0
}
}
]
}
},
{
"key" : "白色",
"doc_count" : 2,
"avg_by_price_color" : {
"value" : 194300.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "标志",
"doc_count" : 2,
"avg_by_price_brand" : {
"value" : 194300.0
}
}
]
}
},
{
"key" : "红色",
"doc_count" : 1,
"avg_by_price_color" : {
"value" : 218000.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "奥迪",
"doc_count" : 1,
"avg_by_price_brand" : {
"value" : 218000.0
}
}
]
}
},
{
"key" : "黑色",
"doc_count" : 3, //黑色的车有三台
"avg_by_price_color" : { //平均价格 1462000.0
"value" : 1462000.0
},
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "大众",
"doc_count" : 1, //大众车有1台,平均价格是1998000.0
"avg_by_price_brand" : {
"value" : 1998000.0
}
},
{
"key" : "奥迪",
"doc_count" : 2, //奥迪有两台,平均价格是 1194000.0
"avg_by_price_brand" : {
"value" : 1194000.0
}
}
]
}
}
]
}
}
}
4、统计不同color中的最大和最小价格、总价
在常见的业务常见中,聚合分析,最常用的种类就是统计数量,最大,最小,平均,总计等。通常占有聚合业务中的60%以上的比例,小型项目中,甚至占比85%以上。
下面的dsl语句是根据color分组,然后下钻分析,取最大和最小和总价.
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"max_price": {
"max": {
"field": "price"
}
},
"min_price": {
"min": {
"field": "price"
}
},
"sum_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
查询结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "黑色",
"doc_count" : 3,
"max_price" : {
"value" : 1998000.0
},
"min_price" : {
"value" : 489000.0
},
"sum_price" : {
"value" : 4386000.0
}
},
{
"key" : "白色",
"doc_count" : 2,
"max_price" : {
"value" : 239800.0
},
"min_price" : {
"value" : 148800.0
},
"sum_price" : {
"value" : 388600.0
}
},
{
"key" : "金色",
"doc_count" : 2,
"max_price" : {
"value" : 258000.0
},
"min_price" : {
"value" : 123000.0
},
"sum_price" : {
"value" : 381000.0
}
},
{
"key" : "红色",
"doc_count" : 1,
"max_price" : {
"value" : 218000.0
},
"min_price" : {
"value" : 218000.0
},
"sum_price" : {
"value" : 218000.0
}
}
]
}
}
}
5、统计不同品牌汽车中价格排名最高的车型
在分组后,可能需要对组内的数据进行排序,并选择其中排名高的数据。那么可以使用s来实现:top_top_hithits中的属性size代表取组内多少条数据(默认为10);
sort代表组内使用什么字段什么规则排序(默认使用_doc的asc规则排序);_source代表结果中包含document中的那些字段(默认包含全部字段)。
GET cars/_search
{
"size": 0,
"aggs": {
"group_by_brand": {//按照品牌分组
"terms": {
"field": "brand"
},
"aggs": {
"top_car": {//别名
"top_hits": { //关键字
"size": 1, //这里写了个1,代表取组内1条数据展示出来,你也可以写10,就是取出10条数据展示出来
"sort": [ //根据price降序排序
{
"price": {
"order": "desc"
}
}
],
"_source": { //取出哪些元数据展示给用户,取出 model 和 price
"includes": [
"model",
"price"
]
}
}
}
}
}
}
}
查询结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "大众",
"doc_count" : 3,//大众有3台
"top_car" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "P6WJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 1998000,
"model" : "大众辉腾" //价格最贵的是大众辉腾
},
"sort" : [
1998000
]
}
]
}
}
},
{
"key" : "奥迪",
"doc_count" : 3,//奥迪有三台
"top_car" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "QqWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 1899000,
"model" : "奥迪A 8" //价格最贵的是 奥迪A 8
},
"sort" : [
1899000
]
}
]
}
}
},
{
"key" : "标志",
"doc_count" : 2,
"top_car" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "PaWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 239800,
"model" : "标志508"
},
"sort" : [
239800
]
}
]
}
}
}
]
}
}
}
假如说我size写了2
GET cars/_search
{
"size": 0,
"aggs": {
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"top_car": {
"top_hits": {
"size": 2,
"sort": [
{
"price": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"model",
"price"
]
}
}
}
}
}
}
}
查询结果:
buckets中每个key都会找最高的两条展示出来,hits就是2
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "大众",
"doc_count" : 3,
"top_car" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "P6WJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 1998000,
"model" : "大众辉腾"
},
"sort" : [
1998000
]
},
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "O6WJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 258000,
"model" : "大众迈腾"
},
"sort" : [
258000
]
}
]
}
}
},
{
"key" : "奥迪",
"doc_count" : 3,
"top_car" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "QqWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 1899000,
"model" : "奥迪A 8"
},
"sort" : [
1899000
]
},
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "QaWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 489000,
"model" : "奥迪A6"
},
"sort" : [
489000
]
}
]
}
}
},
{
"key" : "标志",
"doc_count" : 2,
"top_car" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "PaWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 239800,
"model" : "标志508"
},
"sort" : [
239800
]
},
{
"_index" : "cars",
"_type" : "_doc",
"_id" : "PqWJYX0BE5NC4Vl4lZFr",
"_score" : null,
"_source" : {
"price" : 148800,
"model" : "标志408"
},
"sort" : [
148800
]
}
]
}
}
}
]
}
}
}
6、histogram 区间统计
histogram类似terms,也是进行bucket分组操作的,是根据一个field,实现数据区间分组。
如:以100万为一个范围,统计不同范围内车辆的销售量和平均价格。那么使用histogram的聚合的时候,field指定价格字段price。区间范围是100万-interval : 1000000。这个时候ES会将price价格区间划分为: [0, 1000000), [1000000, 2000000), [2000000, 3000000)等,依次类推。在划分区间的同时,histogram会类似terms进行数据数量的统计(count),可以通过嵌套aggs对聚合分组后的组内数据做再次聚合分析。
“histogram”: {
“field”: “price”, //根据价格字段来做区间统计
“interval”: 1000000 //这里写的是 10万 ,那么区间统计就是 0~10万, 10万~20万 ,如果这里写的是20万的话,那么就是0到20万,20万到40万这个统计.
},
GET /cars/_search
{
"size": 0,
"aggs": {
"histogram_by_price": {
"histogram": {
"field": "price",
"interval": 1000000
},
"aggs": {//继续根据price来算平均价格
"avg_by_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
结果
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"histogram_by_price" : {
"buckets" : [
{
"key" : 0.0, // 0~1000000范围有6个符合要求,这6个的平均价格是246100.0
"doc_count" : 6,
"avg_by_price" : {
"value" : 246100.0
}
},
{
"key" : 1000000.0, // 1000000~2000000 范围有2个符合要求,这2个的平均价格是1948500.0
"doc_count" : 2,
"avg_by_price" : {
"value" : 1948500.0
}
}
]
}
}
}
7、date_histogram区间分组
根据时间做统计,比如说页面有个时间按钮,我做统计分析的时候我想做从某个时间点A到某个时间点B之间的信息销售情况. 这个根据时间来统计就使用date_histogram
date_histogram可以对date类型的field执行区间聚合分组,如每月销量,每年销量等。
如:以月为单位,统计不同月份汽车的销售数量及销售总金额。这个时候可以使用date_histogram实现聚合分组,其中field来指定用于聚合分组的字段,interval指定区间范围(可选值有:year、quarter、month、week、day、hour、minute、second),format指定日期格式化,min_doc_count指定每个区间的最少document(如果不指定,默认为0,当区间范围内没有document时,也会显示bucket分组),extended_bounds指定起始时间和结束时间(如果不指定,默认使用字段中日期最小值所在范围和最大值所在范围为起始和结束时间)。
ElasticSearch7.X之后的版本命令:
下面命令是统计2021-01-01~2022-12-31区间销售金额,并且计算出来每个个月份的区间的平均价格
GET /cars/_search
{
"aggs": {
"histogram_by_date": {
"date_histogram": {
"field": "sold_date", //指定根据sold_date这个时间字段做区间统计
"calendar_interval": "month", //按照月份来做统计
"format": "yyyy-MM-dd", // 采取统计的时间格式
"min_doc_count": 1, // 设置最小统计数量,就是大于等于1的数据才显示出来,如果统计结果小于1的话,干脆就不显示出来
"extended_bounds": { //统计的开始时间和结束时间
"min": "2021-01-01",
"max": "2022-12-31"
}
},
"aggs": { //计算平均价格
"sum_by_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
ES7.x之前的版本语法:
GET /cars/_search
{
"aggs": {
"histogram_by_date": {
"date_histogram": {
"field": "sold_date",
"interval": "month",
"format": "yyyy-MM-dd",
"min_doc_count": 1,
"extended_bounds": {
"min": "2021-01-01",
"max": "2022-12-31"
}
},
"aggs": {
"sum_by_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
结果
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"histogram_by_date" : {
"buckets" : [
{
"key_as_string" : "2021-05-01",
"key" : 1619827200000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 239800.0
}
},
{
"key_as_string" : "2021-07-01",
"key" : 1625097600000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 148800.0
}
},
{
"key_as_string" : "2021-08-01",
"key" : 1627776000000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 1998000.0
}
},
{
"key_as_string" : "2021-10-01",
"key" : 1633046400000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 258000.0
}
},
{
"key_as_string" : "2021-11-01",
"key" : 1635724800000,
"doc_count" : 2,
"sum_by_price" : {
"value" : 341000.0
}
},
{
"key_as_string" : "2022-01-01",
"key" : 1640995200000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 489000.0
}
},
{
"key_as_string" : "2022-02-01",
"key" : 1643673600000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 1899000.0
}
}
]
}
}
}
8、_global bucket
在聚合统计数据的时候,有些时候需要对比部分数据和总体数据。
如:统计某品牌车辆平均价格和所有车辆平均价格。global是用于定义一个全局bucket,这个bucket会忽略query的条件,检索所有document进行对应的聚合统计。
GET /cars/_search
{
"size" : 0,
"query": {
"match": {
"brand": "大众"
}
},
"aggs": {
"volkswagen_of_avg_price": {
"avg": {
"field": "price"
}
},
"all_avg_price" : {
"global": {},
"aggs": {
"all_of_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
9、aggs+order
对聚合统计数据进行排序。
如:统计每个品牌的汽车销量和销售总额,按照销售总额的降序排列。
下面的意思是根据品牌做sum统计,然后再根据sum后的结果进行降序排序.
GET /cars/_search
{
"size": 0,
"aggs": {
"group_of_brand": {
"terms": {
"field": "brand",
"order": {
"sum_of_price": "desc"
}
},
"aggs": {
"sum_of_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
结果
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_of_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "奥迪",
"doc_count" : 3,
"sum_of_price" : {
"value" : 2606000.0
}
},
{
"key" : "大众",
"doc_count" : 3,
"sum_of_price" : {
"value" : 2379000.0
}
},
{
"key" : "标志",
"doc_count" : 2,
"sum_of_price" : {
"value" : 388600.0
}
}
]
}
}
}
如果有多层aggs,执行下钻聚合的时候,也可以根据最内层聚合数据执行排序。
如:统计每个品牌中每种颜色车辆的销售总额,并根据销售总额降序排列。这就像SQL中的分组排序一样,只能组内数据排序,而不能跨组实现排序。
查询:
GET /cars/_search
{
"size": 0,
"aggs": {
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"sum_of_price": "desc"
}
},
"aggs": {
"sum_of_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
}
}
输出结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 8,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "大众",
"doc_count" : 3,
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "黑色",
"doc_count" : 1,
"sum_of_price" : {
"value" : 1998000.0
}
},
{
"key" : "金色",
"doc_count" : 2,
"sum_of_price" : {
"value" : 381000.0
}
}
]
}
},
{
"key" : "奥迪",
"doc_count" : 3,
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "黑色",
"doc_count" : 2,
"sum_of_price" : {
"value" : 2388000.0
}
},
{
"key" : "红色",
"doc_count" : 1,
"sum_of_price" : {
"value" : 218000.0
}
}
]
}
},
{
"key" : "标志",
"doc_count" : 2,
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "白色",
"doc_count" : 2,
"sum_of_price" : {
"value" : 388600.0
}
}
]
}
}
]
}
}
}
10、search+aggs
聚合类似SQL中的group by子句,search类似SQL中的where子句。在ES中是完全可以将search和aggregations整合起来,执行相对更复杂的搜索统计。
如:统计某品牌车辆每个季度的销量和销售额。
GET /cars/_search
{
"size": 0,
"query": { //把品牌是"大众"的车子先找出来
"match": {
"brand": "大众"
}
},
"aggs": {
"histogram_by_date": { //根据上面筛出来的 "大众"品牌车子数据后 再进行分析
"date_histogram": {
"field": "sold_date",
"calendar_interval": "quarter", //季度
"min_doc_count": 1
},
"aggs": {
"sum_by_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
搜索结果
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"histogram_by_date" : {
"buckets" : [
{
"key_as_string" : "2021-07-01T00:00:00.000Z",
"key" : 1625097600000,
"doc_count" : 1,
"sum_by_price" : {
"value" : 1998000.0
}
},
{
"key_as_string" : "2021-10-01T00:00:00.000Z",
"key" : 1633046400000,
"doc_count" : 2,
"sum_by_price" : {
"value" : 381000.0
}
}
]
}
}
}
11、filter+aggs
在ES中,filter也可以和aggs组合使用,实现相对复杂的过滤聚合分析。
如:统计10万~50万之间的车辆的平均价格。
GET /cars/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 100000,
"lte": 500000
}
}
}
}
},
"aggs": {
"avg_by_price": {
"avg": {
"field": "price"
}
}
}
}
12、聚合中使用filter
filter也可以使用在aggs句法中,filter的范围决定了其过滤的范围。
如:统计某品牌汽车最近一年的销售总额。将filter放在aggs内部,代表这个过滤器只对query搜索得到的结果执行filter过滤。如果filter放在aggs外部,过滤器则会过滤所有的数据。
- 12M/M 表示 12 个月。
- 1y/y 表示 1年。
- d 表示天
GET /cars/_search
{
"query": {
"match": {
"brand": "大众"
}
},
"aggs": {
"count_last_year": {
"filter": {
"range": {
"sold_date": {
"gte": "now-12M"
}
}
},
"aggs": {
"sum_of_price_last_year": {
"sum": {
"field": "price"
}
}
}
}
}
}
聚合的要求
“authName” : {
“type” : “text”,
“analyzer”: “standard”,
“fields”: {
“keyword” : {
“type” : “keyword”
}
}
},
比如说你想根据authName名字进行聚合, 那么需要注意一点,就是text类型的字段是不能聚合的,但是我就想根据authName这个text的字段来进行聚合分析怎么办,
在es5版本之后可以设置:
“fields”: {
“keyword” : {
“type” : “keyword”
}
上面代码片段的意思是会将authName的不分词的数据完整的保存到索引库里面去,这样就可以根据authName字段进行聚合搜索了.
后面检索的时候就需要用authName.keyword 来检索了,
authName.keyword的意思就是不对authName字段进行分词拆分.