转自:https://blog.csdn.net/u013089490/article/details/84322556
1、聚合中基本概念
ES中的聚合,包含多种类型,最常用的两种,一个叫桶,一个叫度量。
1.1、桶bucket
桶的作用,是按照某种方式对数据进行分组,每一组数据在ES中称为一个`桶`,例如我们根据国籍对人划分,可以得到`中国桶`、`英国桶`,`日本桶`……或者我们按照年龄段对人进行划分:0~10,10~20,20~30,30~40等。<br />Elasticsearch中提供的划分桶的方式有很多:<br />- Date Histogram Aggregation:根据日期阶梯分组,例如给定阶梯为周,会自动每周分为一组;<br />- Histogram Aggregation:根据数值阶梯分组,与日期类似;<br />- Terms Aggregation:根据词条内容分组,词条内容完全匹配的为一组;<br />- Range Aggregation:数值和日期的范围分组,指定开始和结束,然后按段分组;
1.2、度量metrics
分组完成以后,我们一般会对组中的数据进行聚合运算,例如求平均值、最大、最小、求和等,这些在ES中称为度量
比较常用的一些度量聚合方式:
- Avg Aggregation:求平均值
- Max Aggregation:求最大值
- Min Aggregation:求最小值
- Percentiles Aggregation:求百分比
- Stats Aggregation:同时返回avg、max、min、sum、count等
- Sum Aggregation:求和
- Top hits Aggregation:求前几
- Value Count Aggregation:求总数
- ……
1.3、测试数据
POST /cars/transactions/_bulk{ "index": {}}{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }{ "index": {}}{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }{ "index": {}}{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }{ "index": {}}{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }{ "index": {}}{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }{ "index": {}}{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }{ "index": {}}{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }{ "index": {}}{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }
初始化数据
POST /cars/transactions/_bulk{ "index": {}}{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }{ "index": {}}{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }{ "index": {}}{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }{ "index": {}}{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }{ "index": {}}{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }{ "index": {}}{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }{ "index": {}}{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }{ "index": {}}{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }
2、聚合为桶
首先按照汽车的颜色color来划分桶。
GET /cars/_search{"size" : 0,"aggs" : {"popular_colors" : {"terms" : {"field" : "color"}}}}
【注意】
(1)size:查询条数,这里设置为0,因为不关心搜索到的结果,只关心聚合结果,提供效率;
(2)aggs:声明这是一个聚合查询,是aggregations的缩写;
- popular_colors:这次聚合的名称,可以自定义;
- terms:划分桶的方式,这里是根据词条划分;
- field:划分桶的字段。
【查询结果分析】
{"took": 20,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 8,"max_score": 0,"hits": []},"aggregations": {"popular_colors": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "red","doc_count": 4},{"key": "blue","doc_count": 2},{"key": "green","doc_count": 2}]}}}
- hits:查询结果为空,因为我们设置了size为0
- aggregations:聚合的结果;
- popular_colors:我们定义的聚合名称;
- buckets:查找到的桶,每个不同的color字段值都会形成一个桶;
- key:这个桶对应的color字段的值;
- doc_count:这个桶中的文档数量;
3、桶内度量
前面的例子告诉我们每个桶里面的文档数量,这很有用。 但通常,我们的应用需要提供更复杂的文档度量。 例如,每种颜色汽车的平均价格是多少?因此,我们需要告诉Elasticsearch使用哪个字段,使用何种度量方式进行运算,这些信息要嵌套在桶内,度量的运算会基于桶内的文档进行;现在,我们为刚刚的聚合结果添加 求价格平均值的度量: ```bash GET /cars/_search { “size” : 0, “aggs” : {
} }"popular_colors" : {"terms" : {"field" : "color"},"aggs":{"avg_price": {"avg": {"field": "price"}}}}
#
- aggs:我们在上一个aggs(popular_colors)中添加新的aggs。可见
度量也是一个聚合,度量是在桶内的聚合 - avg_price:聚合的名称
- avg:度量的类型,这里是求平均值
- field:度量运算的字段
【结果】```json{"took": 13,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 8,"max_score": 0,"hits": []},"aggregations": {"popular_colors": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "red","doc_count": 4,"avg_price": {"value": 32500}},{"key": "blue","doc_count": 2,"avg_price": {"value": 20000}},{"key": "green","doc_count": 2,"avg_price": {"value": 21000}}]}}}#####################################################可以看到每个桶中都有自己的`avg_price`字段,这是度量聚合的结果
4、桶内嵌套桶
刚刚的案例中,我们在桶内嵌套度量运算。事实上桶不仅可以嵌套运算, 还可以再嵌套其它桶。也就是说在每个分组中,再分更多组。比如:我们想统计每种颜色的汽车中,分别属于哪个制造商,按照make字段再进行分桶。
GET /cars/_search{"size" : 0,"aggs" : {"popular_colors" : {"terms" : {"field" : "color"},"aggs":{"avg_price": {"avg": {"field": "price"}},"maker":{"terms":{"field":"make"}}}}}}############################################- 原来的color桶和avg计算我们不变- maker:在嵌套的aggs下新添一个桶,叫做maker- terms:桶的划分类型依然是词条- filed:这里根据make字段进行划分
5、划分桶
常见桶的划分
- Date Histogram Aggregation:根据日期阶梯分组,例如给定阶梯为周,会自动每周分为一组
- Histogram Aggregation:根据数值阶梯分组,与日期类似
- Terms Aggregation:根据词条内容分组,词条内容完全匹配的为一组
- Range Aggregation:数值和日期的范围分组,指定开始和结束,然后按段分组
5.1、阶梯分桶Histogram
histogram是把数值类型的字段,按照一定的阶梯大小进行分组。你需要指定一个阶梯值(interval)来划分阶梯大小。
GET /cars/_search{"size":0,"aggs":{"price":{"histogram": {"field": "price","interval": 5000}}}}#结果{"took": 21,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 8,"max_score": 0,"hits": []},"aggregations": {"price": {"buckets": [{"key": 10000,"doc_count": 2},{"key": 15000,"doc_count": 1},{"key": 20000,"doc_count": 2},{"key": 25000,"doc_count": 1},{"key": 30000,"doc_count": 1},{"key": 35000,"doc_count": 0},{"key": 40000,"doc_count": 0},{"key": 45000,"doc_count": 0},{"key": 50000,"doc_count": 0},{"key": 55000,"doc_count": 0},{"key": 60000,"doc_count": 0},{"key": 65000,"doc_count": 0},{"key": 70000,"doc_count": 0},{"key": 75000,"doc_count": 0},{"key": 80000,"doc_count": 1}]}}}
你会发现,中间有大量的文档数量为0 的桶,看起来很丑。我们可以增加一个参数min_doc_count为1,来约束最少文档数量为1,这样文档数量为0的桶会被过滤;示例:
GET /cars/_search{"size":0,"aggs":{"price":{"histogram": {"field": "price","interval": 5000,"min_doc_count": 1}}}}
5.2、范围分桶range
范围分桶与阶梯分桶类似,也是把数字按照阶段进行分组,只不过range方式需要你自己指定每一组的起始和结束大小。
