内存泄露

创建检测

image.png

流程

智能巡检 - 图2

粒度

对象:主机

数据源及检测频率

基于主机对象(hostobject) 和 内存指标(mem)
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次

DQL

  1. M::mem:(avg(used_percent))[6h:1m:1m] by cluster_name,host, host_ip

数据格式

-----------------[ r1.mem.s1 ]-----------------
          active 4669968384
        available 4744269824
available_percent 59.913079
        buffered 194924544
          cached 3762335744
    cluster_name 'minik8s-istio'
    commit_limit 3959291904
    committed_as 13459935232
            dirty 61440
            free 1107689472
        high_free 0
      high_total 0
            host 'minikube'
          host_ip '192.168.49.2'
  huge_pages_free 0
  huge_pages_size 2097152
 huge_pages_total 0
        inactive 1683832832
    instancename <nil>
          interip <nil>
        low_free 0
        low_total 0
          mapped 590659584
            owner <nil>
      page_tables 19881984
          project <nil>
          shared 19841024
            slab 208162816
    sreclaimable 150298624
      sunreclaim 57864192
      swap_cached 0
        swap_free 0
      swap_total 0
            time 2022-02-07 08:00:00 +0800 CST
            total 7918587904
            used 2853638144
    used_percent 36.037210
    vmalloc_chunk 35184287895552
    vmalloc_total 35184372087808
    vmalloc_used 20488192
            wired <nil>
      write_back 0
  write_back_tmp 0
---------


{'statement_id': 0, 'series': [{'name': 'mem', 'columns': ['time', 'active', 'available', 'available_percent', 'buffered', 'cached', 'cluster_name', 'commit_limit', 'committed_as', 'dirty', 'free', 'high_free', 'high_total', 'host', 'host_ip', 'huge_pages_free', 'huge_pages_size', 'huge_pages_total', 'inactive', 'instancename', 'interip', 'low_free', 'low_total', 'mapped', 'owner', 'page_tables', 'project', 'shared', 'slab', 'sreclaimable', 'sunreclaim', 'swap_cached', 'swap_free', 'swap_total', 'total', 'used', 'used_percent', 'vmalloc_chunk', 'vmalloc_total', 'vmalloc_used', 'wired', 'write_back', 'write_back_tmp'], 'values': [[1644192000423, 4669968384, 4744269824, 59.91307896706529, 194924544, 3762335744, 'minik8s-istio', 3959291904, 13459935232, 61440, 1107689472, 0, 0, 'minikube', '192.168.49.2', 0, 2097152, 0, 1683832832, None, None, 0, 0, 590659584, None, 19881984, None, 19841024, 208162816, 150298624, 57864192, 0, 0, 0, 7918587904, 2853638144, 36.03720989898353, 35184287895552, 35184372087808, 20488192, None, 0, 0], [1644192000558, 1053151232, 5973762048, 75.97008573498778, 0, 2907250688, None, 3931652096, 5958238208, 1544192, 422965248, 0, 0, 'solrserver.lianglab.cn', None, 0, 2097152, 0, 6162206720, None, None, 0, 0, 111165440, None, 17469440, None, 2043904, 111661056, 69386240, 42274816, 0, 0, 0, 7863308288, 4533092352, 57.648666260711664, 0, 35184372087808, 0, None, 0, 0]]}]}

触发逻辑

跟踪 used_percent 趋势,当将要逼近临界值时触发事件。
跟踪 used_percent 趋势,当快速升高时触发事件。
追踪 used_percent 趋势 ,当触发临界值时触发事件。

算法逻辑

采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测

事件内容

巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息

字段 说明
date 智能巡检事件产生时间。Unix时间戳,单位 ms
df_event_id 事件 id。注意:相同事件存在ongoing,resolved两种状态。
df_status 状态。取值:ongoing , resolved
df_watchdog_category 巡检分类。取值:infrastructure
df_watchdog_type 巡检类型。取值:mem_leak
df_watchdog_object 巡检对象。取值:主机(host)
df_watchdog_tags 固定标签。取值:
项目(project)
云厂商(cloud_provider)
Label 属性(df_label)
df_title 固定格式。
主机 {#host} 存在内存泄漏问题
df_message 固定格式。
- 内存率趋势图
- 近 6 小时内存使用率数据点
- 巡检对象上内存占用 TOP 10 进程列表
- process_name
- pid
- mem_usage
- 巡检对象上内存占用 TOP 10 POD列表
- pod_name
- mem_usage

分析流程

image.png

磁盘空间不足

创建检测

image.png

流程

智能巡检 - 图5

粒度

对象:主机 + device

数据源及检测频率

基于主机对象(hostobject) 和 磁盘指标(disk)
数据点颗粒度(频率):1 小时
检测数据范围: 14 天
定时执行周期: 1 天一次

DQL

M::disk:(avg(used_percent))[14d:1h:1h]  by host,device,path

数据格式

-----------------[ r10.disk.s1 ]-----------------
cluster_name <nil>
      device 'disk1s4'
        free 195231932416
      fstype 'apfs'
        host 'jays-MacBook-Pro.local'
    host_ip <nil>
 inodes_free 1906561840
inodes_total 1906561842
 inodes_used 2
instancename <nil>
    interip <nil>
        mode 'rw'
      owner <nil>
        path '/System/Volumes/VM'
    project <nil>
        time 2022-03-08 10:03:24 +0800 CST
      total 499588771840
        used 304356839424
used_percent 60.921473
---------


{'statement_id': 0, 'series': [{'name': 'disk', 'columns': ['time', 'cluster_name', 'device', 'free', 'fstype', 'host', 'host_ip', 'inodes_free', 'inodes_total', 'inodes_used', 'instancename', 'interip', 'mode', 'owner', 'path', 'project', 'total', 'used', 'used_percent'], 'values': [[1646705001642, None, 'disk1s1', 195231948800, 'apfs', 'jays-MacBook-Pro.local', None, 1906562000, 1911106183, 4544183, None, None, 'rw', None, '/System/Volumes/Data', None, 499588771840, 304356823040, 60.921469855906686], [1646705001642, None, 'disk1s2', 195231948800, 'apfs', 'jays-MacBook-Pro.local', None, 1906562000, 1906563169, 1169, None, None, 'rw', None, '/System/Volumes/Preboot', None, 499588771840, 304356823040, 60.921469855906686]]}]}

触发逻辑

跟踪 used_percent 趋势,当将要逼近临界值时触发事件。
追踪 used_percent 趋势 ,当触发临界值时触发事件。

算法逻辑

采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测

事件内容

巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息

字段 说明
date 智能巡检事件产生时间。Unix时间戳,单位 ms
df_event_id 事件 id。注意:相同事件存在ongoing,resolved两种状态。
df_status 状态。取值:ongoing , resolved
df_watchdog_category 巡检分类。取值:infrastructure
df_watchdog_type 巡检类型。取值:disk_usage
df_watchdog_object 巡检对象。取值:
主机(host)
磁盘(device)
df_watchdog_tags 固定标签。取值:
项目(project)
云厂商(cloud_provider)
Label 属性(df_label)
df_title 固定格式。
主机 {#host} 有 {#N} 个磁盘使用率持续升高
df_message 固定格式。
- Disk 使用率趋势图
- 近 6 小时 Disk 使用率数据点
- 当前磁盘挂载点位置
- 获取问题 device 的挂载地址
- 建议信息
- 待补全

分析流程

image.png

APM 巡检

创建检测

image.png

流程

智能巡检 - 图8

粒度

粒度: 指定 project, env, version, 所有的service+resource,也可以排除指定的 Service 和 resource

数据源及检测频率

service每分钟的指标, service之间的拓扑, service+ resource每分钟的指标, service+ resource+span的数据, service+ resource+span+rum的数据,service+ resource+span+log的数据, service+ resource+span+host的数据, service+ resource+span+host+中间件的数据

APM 请求速率

检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次

APM 延时

检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次

APM 错误率

检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次

DQL

# 请求数 ,线上会有过滤条件(project,env,version)追加到 source 后面
T::re(`.*`):(count(__docid)){ `source` != 'service_map'}[6h:1m:1m] by service,resource 
# 延迟
T::re(`.*`):(avg(duration)){ `source` != 'service_map'}[6h:1m:1m] by service,resource 
# 错误数
T::re(`.*`):(count(__docid)){ `source` != 'service_map' and status = 'error'}[6h:1m:1m] by service,resource

数据格式

service map:

-----------------[ r1..s1 ]-----------------
      __docid 'T_c8n9nmhaahlf101n1q1g'
  call_counts 6
  create_time 1647221722778
      date_ns 0
          env 'dev'
          host 'k8s-node1'
        source 'service_map'
source_service 'demo-k8s-system'
target_service 'demo-k8s-system'
          time 2022-03-14 09:35:00 +0800 CST
          type 'web'
---------
1 rows, 1 series, cost 72ms


{'statement_id':0,'series':[{'columns':['__docid','time','env','source_service','target_service','type','call_counts','create_time','date_ns','host','source'],'values':[['T_c8n9o9k5jjqsb5dg4pe0',1647221760000,'dev','demo-k8s-auth','demo-k8s-auth','web',5,1647221798024,0,'k8s-node1','service_map'],['T_c8n9o9k5jjqsb5dg4peg',1647221760000,'dev','demo-k8s-gateway','demo-k8s-gateway','web',7,1647221798024,0,'k8s-node1','service_map']]}]}

span:

-----------------[ r1..s1 ]-----------------
        __docid 'T_c8n9ovs5jjql3a1t6ou0'
    cluster_name 'k8s-prod'
    create_time 1647221887505
        date_ns 4459
        duration 847
            env 'dev'
            host 'k8s-node1'
        host_ip '172.16.0.230'
    http_method 'PUT'
http_status_code '200'
        message '{"service":"demo-k8s-system","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7379685012312364829,"span_id":4567185458459732621,"parent_id":0,"start":1647221885724004459,"duration":847984,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"df764944-6e79-4cee-9883-c4975a887e56","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":83},"type":"http"}'
        node_ip '172.16.0.230'
      operation 'http.request'
      parent_id '0'
        resource 'PUT /nacos/v1/ns/instance/beat'
        service 'demo-k8s-system'
          source 'ddtrace'
        span_id '4567185458459732621'
      span_type 'entry'
          start 1647221885724004
          status 'ok'
            time 2022-03-14 09:38:05 +0800 CST
        trace_id '7379685012312364829'
            type 'web'
---------
1 rows, 1 series, cost 1.301s


{'statement_id':0,'series':[{'columns':['host','http_status_code','node_ip','operation','status','__docid','cluster_name','date_ns','trace_id','message','resource','service','span_type','type','create_time','duration','source','http_method','parent_id','span_id','start','time','env','host_ip'],'values':[['k8s-node1','200','172.16.0.230','http.request','ok','T_c8n9p9s5jjqsb5dkviug','k8s-prod',4258,'7237931248846529090','{"service":"demo-k8s-system","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7237931248846529090,"span_id":3553429628118875994,"parent_id":0,"start":1647221925732004258,"duration":794942,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"df764944-6e79-4cee-9883-c4975a887e56","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":83},"type":"http"}','PUT /nacos/v1/ns/instance/beat','demo-k8s-system','entry','web',1647221927516,794,'ddtrace','PUT','0','3553429628118875994',1647221925732004,1647221925732,'dev','172.16.0.230'],['k8s-node1','200','172.16.0.230','http.request','ok','T_c8n9p9s5jjqsb5dkviu0','k8s-prod',5238,'7812574577044643721','{"service":"demo-k8s-auth","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7812574577044643721,"span_id":822861332187929136,"parent_id":0,"start":1647221925081005238,"duration":821220,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"c6244f19-6ddf-4ff3-afbd-a49d7751816b","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":56},"type":"http"}','PUT /nacos/v1/ns/instance/beat','demo-k8s-auth','entry','web',1647221927516,821,'ddtrace','PUT','0','822861332187929136',1647221925081005,1647221925081,'dev','172.16.0.230']]}]}

触发逻辑

跟踪阈值: p99大于15秒, eroor>10%
跟踪请求数、延迟、错误率趋势,当发生剧烈数据变化时触发事件。
追踪请求数、延迟、错误率趋势 ,当触发临界值时触发事件。例如:p99大于15秒 ,error_rate >10% 等等

算法逻辑

采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测

事件内容

巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息

字段 说明
date 智能巡检事件产生时间。Unix时间戳,单位 ms
df_event_id 事件 id。注意:相同事件存在ongoing,resolved两种状态。
df_status 状态。取值:ongoing , resolved
df_watchdog_category 巡检分类。取值:apm
df_watchdog_type 巡检类型。取值:apm_request_rate
df_watchdog_object 巡检对象。取值:
资源(resource)
df_watchdog_tags 固定标签。取值:
服务(service)
环境(env)
版本(version)
项目(project)
df_title 固定格式。
{#service} 服务中有 {#N} 个资源请求速率、错误率、延时异常
df_message 固定格式。
- 资源请求速率趋势图
- 近 6 小时资源请求速率数据点
- 资源请求速率异常结果
- 资源(resource)
- 请求速率(request_rate)
- 异常资源关联的span 列表
- 异常资源关联 trace_id列表
- 基于异常 trace_id 的 service_map —- 前端查询获得
- 基于异常 trace_id 关联的 前端应用、用户数、页面地址
- 查询 RUM View 数据获取 app_id,userid,view_url 具体值及数量统计值

分析流程

image.png
image.png