简介
那么metric聚合又如何理解呢?我认为从两个角度:
- 从分类看:Metric聚合分析分为单值分析和多值分析两类
- 从功能看:根据具体的应用场景设计了一些分析api, 比如地理位置,百分数等等
融合上述两个方面,我们可以梳理出大致的一个mind图:
- 单值分析: 只输出一个分析结果
- 标准stat型
- avg 平均值
- max 最大值
- min 最小值
- sum 和
- value_count 数量
- 其它类型
- cardinality 基数(distinct去重)
- weighted_avg 带权重的avg
- median_absolute_deviation 中位值
- 标准stat型
- 多值分析: 单值之外的
- stats型
- stats 包含avg,max,min,sum和count
- matrix_stats 针对矩阵模型
- extended_stats
- string_stats 针对字符串
- 百分数型
- percentiles 百分数范围
- percentile_ranks 百分数排行
- 地理位置型
- geo_bounds Geo bounds
- geo_centroid Geo-centroid
- geo_line Geo-Line
- Top型
- stats型
POST /book/_search { “size”: 0, “aggs”: { “max_price”: { “max”: { “field”: “price” } } } }
**注意下类型**<br />把类型改为 number (integer, double etc.)
```json
"mlf16_txservnum": {
"type": "integer"
}
如果您的数据很大并且无法重新索引,您可以在运行时更改 value 的数据类型
"aggregations" : {
"Sum_Service_Rate_Numerator" : {
"sum" : {
"field" : 'Integer.parseInt(doc["mlf16_txservnum"].value)'
}
},
"Sum_Service_Rate_Denominator" : {
"sum" : {
"field" : 'Integer.parseInt(doc["mlf16_txservden"].value)'
}
}
}
文档计数count
示例: 统计price大于100的文档数量
POST /book/_count
{
"query": {
"range": {
"price": {
"gt": 100
}
}
}
}
value_count 统计某字段有值的文档数
# 有值的有几个
POST /book/_search?size=0
{
"aggs": {
"price_count": {
"value_count": {
"field": "price"
}
}
}
}
单值分析
weighted_avg 带权重的avg
POST /exams/_search
{
"size": 0,
"aggs": {
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "grade"
},
"weight": {
"field": "weight"
}
}
}
}
}
cardinality值去重计数 基数
类似sql中的distinct count概念
POST /book/_search?size=0
{
"aggs": {
"_id_count": {
"cardinality": {
"field": "_id"
}
},
"price_count": {
"cardinality": {
"field": "price"
}
}
}
}
median_absolute_deviation 中位值
GET reviews/_search
{
"size": 0,
"aggs": {
"review_average": {
"avg": {
"field": "rating"
}
},
"review_variability": {
"median_absolute_deviation": {
"field": "rating"
}
}
}
}
非单值分析:stats型
stats 统计 count max min avg sum 5个值
POST /book/_search?size=0
{
"aggs": {
"price_stats": {
"stats": {
"field": "price"
}
}
}
}
matrix_stats 针对矩阵模型
以下示例说明了使用矩阵统计量来描述收入与贫困之间的关系。
GET /_search
{
"aggs": {
"statistics": {
"matrix_stats": {
"fields": [ "poverty", "income" ]
}
}
}
}
返回
{
...
"aggregations": {
"statistics": {
"doc_count": 50,
"fields": [ {
"name": "income",
"count": 50,
"mean": 51985.1,
"variance": 7.383377037755103E7,
"skewness": 0.5595114003506483,
"kurtosis": 2.5692365287787124,
"covariance": {
"income": 7.383377037755103E7,
"poverty": -21093.65836734694
},
"correlation": {
"income": 1.0,
"poverty": -0.8352655256272504
}
}, {
"name": "poverty",
"count": 50,
"mean": 12.732000000000001,
"variance": 8.637730612244896,
"skewness": 0.4516049811903419,
"kurtosis": 2.8615929677997767,
"covariance": {
"income": -21093.65836734694,
"poverty": 8.637730612244896
},
"correlation": {
"income": -0.8352655256272504,
"poverty": 1.0
}
} ]
}
}
}
string_stats 针对字符串
用于计算从聚合文档中提取的字符串值的统计信息。这些值可以从特定的关键字字段中检索
POST /my-index-000001/_search?size=0
{
"aggs": {
"message_stats": { "string_stats": { "field": "message.keyword" } }
}
}
返回
{
...
"aggregations": {
"message_stats": {
"count": 5,
"min_length": 24,
"max_length": 30,
"avg_length": 28.8,
"entropy": 3.94617750050791
}
}
}
Extended stats高级统计
比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间
POST /book/_search?size=0
{
"aggs": {
"price_stats": {
"extended_stats": {
"field": "price"
}
}
}
}
非单值分析:Top型
Top Hits求top
terms分桶聚合, 然后再agg子查询, top_hits是关键词
GET /book/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "price",
"size": 10
},
"aggs": {
"top_price": {
"top_hits": {
"size": 10,
"sort": [
{
"price": {
"order": "desc"
}
}
]
}
}
}
}
}
}
top_metrics
POST /test/_bulk?refresh
{"index": {}}
{"s": 1, "m": 3.1415}
{"index": {}}
{"s": 2, "m": 1.0}
{"index": {}}
{"s": 3, "m": 2.71828}
POST /test/_search?filter_path=aggregations
{
"aggs": {
"tm": {
"top_metrics": {
"metrics": {"field": "m"},
"sort": {"s": "desc"}
}
}
}
}
非单值分析:百分数型
Percentiles 占比百分位对应的值统计
POST /book/_search?size=0
{
"aggs": {
"price_percents": {
"percentiles": {
"field": "price"
}
}
}
}
指定分位值
只给我75%, 99%的数据
POST /book/_search?size=0
{
"aggs": {
"price_percents": {
"percentiles": {
"field": "price",
"percents": [
75,
99
]
}
}
}
}
Percentiles rank 统计值小于等于指定值的文档占比
统计price小于100和200的文档的占比
POST /book/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "price",
"values": [
100,
200
]
}
}
}
}
非单值分析:地理位置型
geo_bounds
PUT /museums
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"query": {
"match": { "name": "musée" }
},
"aggs": {
"viewport": {
"geo_bounds": {
"field": "location",
"wrap_longitude": true
}
}
}
}
上面的汇总展示了如何针对具有商店业务类型的所有文档计算位置字段的边界框
{
...
"aggregations": {
"viewport": {
"bounds": {
"top_left": {
"lat": 48.86111099738628,
"lon": 2.3269999679178
},
"bottom_right": {
"lat": 48.85999997612089,
"lon": 2.3363889567553997
}
}
}
}
}
Geo-centroid
PUT /museums
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"aggs": {
"centroid": {
"geo_centroid": {
"field": "location"
}
}
}
}
上面的汇总显示了如何针对所有具有犯罪类型的盗窃文件计算位置字段的质心。
{
...
"aggregations": {
"centroid": {
"location": {
"lat": 51.00982965203002,
"lon": 3.9662131341174245
},
"count": 6
}
}
}
Geo-Line
PUT test
{
"mappings": {
"dynamic": "strict",
"_source": {
"enabled": false
},
"properties": {
"my_location": {
"type": "geo_point"
},
"group": {
"type": "keyword"
},
"@timestamp": {
"type": "date"
}
}
}
}
POST /test/_bulk?refresh
{"index": {}}
{"my_location": {"lat":37.3450570, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:36"}
{"index": {}}
{"my_location": {"lat": 37.3451320, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:37Z"}
{"index": {}}
{"my_location": {"lat": 37.349283, "lon": -122.0505010}, "@timestamp": "2013-09-06T16:00:37Z"}
POST /test/_search?filter_path=aggregations
{
"aggs": {
"line": {
"geo_line": {
"point": {"field": "my_location"},
"sort": {"field": "@timestamp"}
}
}
}
}
将存储桶中的所有geo_point值聚合到由所选排序字段排序的LineString中。
{
"aggregations": {
"line": {
"type" : "Feature",
"geometry" : {
"type" : "LineString",
"coordinates" : [
[
-122.049982,
37.345057
],
[
-122.050501,
37.349283
],
[
-122.049982,
37.345132
]
]
},
"properties" : {
"complete" : true
}
}
}
}