简介

那么metric聚合又如何理解呢?我认为从两个角度:

  • 从分类看:Metric聚合分析分为单值分析多值分析两类
  • 从功能看:根据具体的应用场景设计了一些分析api, 比如地理位置,百分数等等

融合上述两个方面,我们可以梳理出大致的一个mind图:

  • 单值分析: 只输出一个分析结果
    • 标准stat型
      • avg 平均值
      • max 最大值
      • min 最小值
      • sum 和
      • value_count 数量
    • 其它类型
      • cardinality 基数(distinct去重)
      • weighted_avg 带权重的avg
      • median_absolute_deviation 中位值
  • 多值分析: 单值之外的
    • stats型
      • stats 包含avg,max,min,sum和count
      • matrix_stats 针对矩阵模型
      • extended_stats
      • string_stats 针对字符串
    • 百分数型
      • percentiles 百分数范围
      • percentile_ranks 百分数排行
    • 地理位置型
      • geo_bounds Geo bounds
      • geo_centroid Geo-centroid
      • geo_line Geo-Line
    • Top型
      • top_hits 分桶后的top hits
      • top_metrics

        单值分析

        max min sum avg

        示例:查询所有书中最贵的 ```json

        size为几, 返回几个文档数据, aggs表示要做聚合, 起个名字叫max_price

        size是0表示不返回文档数据了

        max_price是自己起的名字

POST /book/_search { “size”: 0, “aggs”: { “max_price”: { “max”: { “field”: “price” } } } }

  1. **注意下类型**<br />把类型改为 number (integer, double etc.)
  2. ```json
  3. "mlf16_txservnum": {
  4. "type": "integer"
  5. }

如果您的数据很大并且无法重新索引,您可以在运行时更改 value 的数据类型

"aggregations" : {
 "Sum_Service_Rate_Numerator" : {
  "sum" : {
    "field" : 'Integer.parseInt(doc["mlf16_txservnum"].value)'
  }
},
"Sum_Service_Rate_Denominator" : {
  "sum" : {
    "field" : 'Integer.parseInt(doc["mlf16_txservden"].value)'
  }
 }
}

文档计数count

示例: 统计price大于100的文档数量

POST /book/_count
{
  "query": {
    "range": {
      "price": {
        "gt": 100
      }
    }
  }
}

value_count 统计某字段有值的文档数

# 有值的有几个
POST /book/_search?size=0 
{
  "aggs": {
    "price_count": {
      "value_count": {
        "field": "price"
      }
    }
  }
}

单值分析

weighted_avg 带权重的avg

POST /exams/_search
{
  "size": 0,
  "aggs": {
    "weighted_grade": {
      "weighted_avg": {
        "value": {
          "field": "grade"
        },
        "weight": {
          "field": "weight"
        }
      }
    }
  }
}

cardinality值去重计数 基数

类似sql中的distinct count概念

POST /book/_search?size=0 
{
  "aggs": {
    "_id_count": {
      "cardinality": {
        "field": "_id"
      }
    },
    "price_count": {
      "cardinality": {
        "field": "price"
      }
    }
  }
}

median_absolute_deviation 中位值

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating" 
      }
    }
  }
}

非单值分析:stats型

stats 统计 count max min avg sum 5个值

POST /book/_search?size=0 
{
  "aggs": {
    "price_stats": {
      "stats": {
        "field": "price"
      }
    }
  }
}

matrix_stats 针对矩阵模型

以下示例说明了使用矩阵统计量来描述收入与贫困之间的关系。

GET /_search
{
  "aggs": {
    "statistics": {
      "matrix_stats": {
        "fields": [ "poverty", "income" ]
      }
    }
  }
}

返回

{
  ...
  "aggregations": {
    "statistics": {
      "doc_count": 50,
      "fields": [ {
          "name": "income",
          "count": 50,
          "mean": 51985.1,
          "variance": 7.383377037755103E7,
          "skewness": 0.5595114003506483,
          "kurtosis": 2.5692365287787124,
          "covariance": {
            "income": 7.383377037755103E7,
            "poverty": -21093.65836734694
          },
          "correlation": {
            "income": 1.0,
            "poverty": -0.8352655256272504
          }
        }, {
          "name": "poverty",
          "count": 50,
          "mean": 12.732000000000001,
          "variance": 8.637730612244896,
          "skewness": 0.4516049811903419,
          "kurtosis": 2.8615929677997767,
          "covariance": {
            "income": -21093.65836734694,
            "poverty": 8.637730612244896
          },
          "correlation": {
            "income": -0.8352655256272504,
            "poverty": 1.0
          }
        } ]
    }
  }
}

string_stats 针对字符串

用于计算从聚合文档中提取的字符串值的统计信息。这些值可以从特定的关键字字段中检索

POST /my-index-000001/_search?size=0
{
  "aggs": {
    "message_stats": { "string_stats": { "field": "message.keyword" } }
  }
}

返回

{
  ...

  "aggregations": {
    "message_stats": {
      "count": 5,
      "min_length": 24,
      "max_length": 30,
      "avg_length": 28.8,
      "entropy": 3.94617750050791
    }
  }
}

Extended stats高级统计

比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间

POST /book/_search?size=0 
{
  "aggs": {
    "price_stats": {
      "extended_stats": {
        "field": "price"
      }
    }
  }
}

非单值分析:Top型

Top Hits求top

terms分桶聚合, 然后再agg子查询, top_hits是关键词

GET /book/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "price",
        "size": 10
      },
      "aggs": {
        "top_price": {
          "top_hits": {
            "size": 10,
            "sort": [
              {
                "price": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

top_metrics

POST /test/_bulk?refresh
{"index": {}}
{"s": 1, "m": 3.1415}
{"index": {}}
{"s": 2, "m": 1.0}
{"index": {}}
{"s": 3, "m": 2.71828}
POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "tm": {
      "top_metrics": {
        "metrics": {"field": "m"},
        "sort": {"s": "desc"}
      }
    }
  }
}

非单值分析:百分数型

Percentiles 占比百分位对应的值统计

POST /book/_search?size=0 
{
  "aggs": {
    "price_percents": {
      "percentiles": {
        "field": "price"
      }
    }
  }
}

指定分位值

只给我75%, 99%的数据

POST /book/_search?size=0 
{
  "aggs": {
    "price_percents": {
      "percentiles": {
        "field": "price",
        "percents": [
          75,
          99
        ]
      }
    }
  }
}

Percentiles rank 统计值小于等于指定值的文档占比

统计price小于100和200的文档的占比

POST /book/_search?size=0 
{
  "aggs": {
    "gge_perc_rank": {
      "percentile_ranks": {
        "field": "price",
        "values": [
          100,
          200
        ]
      }
    }
  }
}

非单值分析:地理位置型

geo_bounds

PUT /museums
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
  "query": {
    "match": { "name": "musée" }
  },
  "aggs": {
    "viewport": {
      "geo_bounds": {
        "field": "location",    
        "wrap_longitude": true  
      }
    }
  }
}

上面的汇总展示了如何针对具有商店业务类型的所有文档计算位置字段的边界框

{
  ...
  "aggregations": {
    "viewport": {
      "bounds": {
        "top_left": {
          "lat": 48.86111099738628,
          "lon": 2.3269999679178
        },
        "bottom_right": {
          "lat": 48.85999997612089,
          "lon": 2.3363889567553997
        }
      }
    }
  }
}

Geo-centroid

PUT /museums
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}

POST /museums/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
  "aggs": {
    "centroid": {
      "geo_centroid": {
        "field": "location" 
      }
    }
  }
}

上面的汇总显示了如何针对所有具有犯罪类型的盗窃文件计算位置字段的质心。

{
  ...
  "aggregations": {
    "centroid": {
      "location": {
        "lat": 51.00982965203002,
        "lon": 3.9662131341174245
      },
      "count": 6
    }
  }
}

Geo-Line

PUT test
{
    "mappings": {
        "dynamic": "strict",
        "_source": {
            "enabled": false
        },
        "properties": {
            "my_location": {
                "type": "geo_point"
            },
            "group": {
                "type": "keyword"
            },
            "@timestamp": {
                "type": "date"
            }
        }
    }
}

POST /test/_bulk?refresh
{"index": {}}
{"my_location": {"lat":37.3450570, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:36"}
{"index": {}}
{"my_location": {"lat": 37.3451320, "lon": -122.0499820}, "@timestamp": "2013-09-06T16:00:37Z"}
{"index": {}}
{"my_location": {"lat": 37.349283, "lon": -122.0505010}, "@timestamp": "2013-09-06T16:00:37Z"}

POST /test/_search?filter_path=aggregations
{
  "aggs": {
    "line": {
      "geo_line": {
        "point": {"field": "my_location"},
        "sort": {"field": "@timestamp"}
      }
    }
  }
}

将存储桶中的所有geo_point值聚合到由所选排序字段排序的LineString中。

{
  "aggregations": {
    "line": {
      "type" : "Feature",
      "geometry" : {
        "type" : "LineString",
        "coordinates" : [
          [
            -122.049982,
            37.345057
          ],
          [
            -122.050501,
            37.349283
          ],
          [
            -122.049982,
            37.345132
          ]
        ]
      },
      "properties" : {
        "complete" : true
      }
    }
  }
}