内存泄露
创建检测
流程
粒度
数据源及检测频率
基于主机对象(hostobject) 和 内存指标(mem)
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次
DQL
M::mem:(avg(used_percent))[6h:1m:1m] by cluster_name,host, host_ip
数据格式
-----------------[ r1.mem.s1 ]-----------------
active 4669968384
available 4744269824
available_percent 59.913079
buffered 194924544
cached 3762335744
cluster_name 'minik8s-istio'
commit_limit 3959291904
committed_as 13459935232
dirty 61440
free 1107689472
high_free 0
high_total 0
host 'minikube'
host_ip '192.168.49.2'
huge_pages_free 0
huge_pages_size 2097152
huge_pages_total 0
inactive 1683832832
instancename <nil>
interip <nil>
low_free 0
low_total 0
mapped 590659584
owner <nil>
page_tables 19881984
project <nil>
shared 19841024
slab 208162816
sreclaimable 150298624
sunreclaim 57864192
swap_cached 0
swap_free 0
swap_total 0
time 2022-02-07 08:00:00 +0800 CST
total 7918587904
used 2853638144
used_percent 36.037210
vmalloc_chunk 35184287895552
vmalloc_total 35184372087808
vmalloc_used 20488192
wired <nil>
write_back 0
write_back_tmp 0
---------
{'statement_id': 0, 'series': [{'name': 'mem', 'columns': ['time', 'active', 'available', 'available_percent', 'buffered', 'cached', 'cluster_name', 'commit_limit', 'committed_as', 'dirty', 'free', 'high_free', 'high_total', 'host', 'host_ip', 'huge_pages_free', 'huge_pages_size', 'huge_pages_total', 'inactive', 'instancename', 'interip', 'low_free', 'low_total', 'mapped', 'owner', 'page_tables', 'project', 'shared', 'slab', 'sreclaimable', 'sunreclaim', 'swap_cached', 'swap_free', 'swap_total', 'total', 'used', 'used_percent', 'vmalloc_chunk', 'vmalloc_total', 'vmalloc_used', 'wired', 'write_back', 'write_back_tmp'], 'values': [[1644192000423, 4669968384, 4744269824, 59.91307896706529, 194924544, 3762335744, 'minik8s-istio', 3959291904, 13459935232, 61440, 1107689472, 0, 0, 'minikube', '192.168.49.2', 0, 2097152, 0, 1683832832, None, None, 0, 0, 590659584, None, 19881984, None, 19841024, 208162816, 150298624, 57864192, 0, 0, 0, 7918587904, 2853638144, 36.03720989898353, 35184287895552, 35184372087808, 20488192, None, 0, 0], [1644192000558, 1053151232, 5973762048, 75.97008573498778, 0, 2907250688, None, 3931652096, 5958238208, 1544192, 422965248, 0, 0, 'solrserver.lianglab.cn', None, 0, 2097152, 0, 6162206720, None, None, 0, 0, 111165440, None, 17469440, None, 2043904, 111661056, 69386240, 42274816, 0, 0, 0, 7863308288, 4533092352, 57.648666260711664, 0, 35184372087808, 0, None, 0, 0]]}]}
触发逻辑
跟踪 used_percent 趋势,当将要逼近临界值时触发事件。
跟踪 used_percent 趋势,当快速升高时触发事件。
追踪 used_percent 趋势 ,当触发临界值时触发事件。
算法逻辑
采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测
事件内容
巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息
| 字段 | 说明 |
|---|---|
| date | 智能巡检事件产生时间。Unix时间戳,单位 ms |
| df_event_id | 事件 id。注意:相同事件存在ongoing,resolved两种状态。 |
| df_status | 状态。取值:ongoing , resolved |
| df_watchdog_category | 巡检分类。取值:infrastructure |
| df_watchdog_type | 巡检类型。取值:mem_leak |
| df_watchdog_object | 巡检对象。取值:主机(host) |
| df_watchdog_tags | 固定标签。取值: 项目(project) 云厂商(cloud_provider) Label 属性(df_label) |
| df_title | 固定格式。 主机 {#host} 存在内存泄漏问题 |
| df_message | 固定格式。 - 内存率趋势图 - 近 6 小时内存使用率数据点 - 巡检对象上内存占用 TOP 10 进程列表 - process_name - pid - mem_usage - 巡检对象上内存占用 TOP 10 POD列表 - pod_name - mem_usage |
分析流程

磁盘空间不足
创建检测
流程
粒度
数据源及检测频率
基于主机对象(hostobject) 和 磁盘指标(disk)
数据点颗粒度(频率):1 小时
检测数据范围: 14 天
定时执行周期: 1 天一次
DQL
M::disk:(avg(used_percent))[14d:1h:1h] by host,device,path
数据格式
-----------------[ r10.disk.s1 ]-----------------
cluster_name <nil>
device 'disk1s4'
free 195231932416
fstype 'apfs'
host 'jays-MacBook-Pro.local'
host_ip <nil>
inodes_free 1906561840
inodes_total 1906561842
inodes_used 2
instancename <nil>
interip <nil>
mode 'rw'
owner <nil>
path '/System/Volumes/VM'
project <nil>
time 2022-03-08 10:03:24 +0800 CST
total 499588771840
used 304356839424
used_percent 60.921473
---------
{'statement_id': 0, 'series': [{'name': 'disk', 'columns': ['time', 'cluster_name', 'device', 'free', 'fstype', 'host', 'host_ip', 'inodes_free', 'inodes_total', 'inodes_used', 'instancename', 'interip', 'mode', 'owner', 'path', 'project', 'total', 'used', 'used_percent'], 'values': [[1646705001642, None, 'disk1s1', 195231948800, 'apfs', 'jays-MacBook-Pro.local', None, 1906562000, 1911106183, 4544183, None, None, 'rw', None, '/System/Volumes/Data', None, 499588771840, 304356823040, 60.921469855906686], [1646705001642, None, 'disk1s2', 195231948800, 'apfs', 'jays-MacBook-Pro.local', None, 1906562000, 1906563169, 1169, None, None, 'rw', None, '/System/Volumes/Preboot', None, 499588771840, 304356823040, 60.921469855906686]]}]}
触发逻辑
跟踪 used_percent 趋势,当将要逼近临界值时触发事件。
追踪 used_percent 趋势 ,当触发临界值时触发事件。
算法逻辑
采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测
事件内容
巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息
| 字段 | 说明 |
|---|---|
| date | 智能巡检事件产生时间。Unix时间戳,单位 ms |
| df_event_id | 事件 id。注意:相同事件存在ongoing,resolved两种状态。 |
| df_status | 状态。取值:ongoing , resolved |
| df_watchdog_category | 巡检分类。取值:infrastructure |
| df_watchdog_type | 巡检类型。取值:disk_usage |
| df_watchdog_object | 巡检对象。取值: 主机(host) 磁盘(device) |
| df_watchdog_tags | 固定标签。取值: 项目(project) 云厂商(cloud_provider) Label 属性(df_label) |
| df_title | 固定格式。 主机 {#host} 有 {#N} 个磁盘使用率持续升高 |
| df_message | 固定格式。 - Disk 使用率趋势图 - 近 6 小时 Disk 使用率数据点 - 当前磁盘挂载点位置 - 获取问题 device 的挂载地址 - 建议信息 - 待补全 |
分析流程

APM 巡检
创建检测
流程
粒度
粒度: 指定 project, env, version, 所有的service+resource,也可以排除指定的 Service 和 resource
数据源及检测频率
service每分钟的指标, service之间的拓扑, service+ resource每分钟的指标, service+ resource+span的数据, service+ resource+span+rum的数据,service+ resource+span+log的数据, service+ resource+span+host的数据, service+ resource+span+host+中间件的数据
APM 请求速率
检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次
APM 延时
检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次
APM 错误率
检测粒度:resource
数据点颗粒度(频率):1 分钟
检测数据范围:6 小时
定时执行周期:6 小时一次
DQL
# 请求数 ,线上会有过滤条件(project,env,version)追加到 source 后面
T::re(`.*`):(count(__docid)){ `source` != 'service_map'}[6h:1m:1m] by service,resource
# 延迟
T::re(`.*`):(avg(duration)){ `source` != 'service_map'}[6h:1m:1m] by service,resource
# 错误数
T::re(`.*`):(count(__docid)){ `source` != 'service_map' and status = 'error'}[6h:1m:1m] by service,resource
数据格式
service map:
-----------------[ r1..s1 ]-----------------
__docid 'T_c8n9nmhaahlf101n1q1g'
call_counts 6
create_time 1647221722778
date_ns 0
env 'dev'
host 'k8s-node1'
source 'service_map'
source_service 'demo-k8s-system'
target_service 'demo-k8s-system'
time 2022-03-14 09:35:00 +0800 CST
type 'web'
---------
1 rows, 1 series, cost 72ms
{'statement_id':0,'series':[{'columns':['__docid','time','env','source_service','target_service','type','call_counts','create_time','date_ns','host','source'],'values':[['T_c8n9o9k5jjqsb5dg4pe0',1647221760000,'dev','demo-k8s-auth','demo-k8s-auth','web',5,1647221798024,0,'k8s-node1','service_map'],['T_c8n9o9k5jjqsb5dg4peg',1647221760000,'dev','demo-k8s-gateway','demo-k8s-gateway','web',7,1647221798024,0,'k8s-node1','service_map']]}]}
span:
-----------------[ r1..s1 ]-----------------
__docid 'T_c8n9ovs5jjql3a1t6ou0'
cluster_name 'k8s-prod'
create_time 1647221887505
date_ns 4459
duration 847
env 'dev'
host 'k8s-node1'
host_ip '172.16.0.230'
http_method 'PUT'
http_status_code '200'
message '{"service":"demo-k8s-system","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7379685012312364829,"span_id":4567185458459732621,"parent_id":0,"start":1647221885724004459,"duration":847984,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"df764944-6e79-4cee-9883-c4975a887e56","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":83},"type":"http"}'
node_ip '172.16.0.230'
operation 'http.request'
parent_id '0'
resource 'PUT /nacos/v1/ns/instance/beat'
service 'demo-k8s-system'
source 'ddtrace'
span_id '4567185458459732621'
span_type 'entry'
start 1647221885724004
status 'ok'
time 2022-03-14 09:38:05 +0800 CST
trace_id '7379685012312364829'
type 'web'
---------
1 rows, 1 series, cost 1.301s
{'statement_id':0,'series':[{'columns':['host','http_status_code','node_ip','operation','status','__docid','cluster_name','date_ns','trace_id','message','resource','service','span_type','type','create_time','duration','source','http_method','parent_id','span_id','start','time','env','host_ip'],'values':[['k8s-node1','200','172.16.0.230','http.request','ok','T_c8n9p9s5jjqsb5dkviug','k8s-prod',4258,'7237931248846529090','{"service":"demo-k8s-system","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7237931248846529090,"span_id":3553429628118875994,"parent_id":0,"start":1647221925732004258,"duration":794942,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"df764944-6e79-4cee-9883-c4975a887e56","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":83},"type":"http"}','PUT /nacos/v1/ns/instance/beat','demo-k8s-system','entry','web',1647221927516,794,'ddtrace','PUT','0','3553429628118875994',1647221925732004,1647221925732,'dev','172.16.0.230'],['k8s-node1','200','172.16.0.230','http.request','ok','T_c8n9p9s5jjqsb5dkviu0','k8s-prod',5238,'7812574577044643721','{"service":"demo-k8s-auth","name":"http.request","resource":"PUT /nacos/v1/ns/instance/beat","trace_id":7812574577044643721,"span_id":822861332187929136,"parent_id":0,"start":1647221925081005238,"duration":821220,"error":0,"meta":{"component":"http-url-connection","env":"dev","http.method":"PUT","http.status_code":"200","http.url":"http://172.16.0.229:8848/nacos/v1/ns/instance/beat","language":"jvm","node_ip":"172.16.0.230","peer.hostname":"172.16.0.229","runtime-id":"c6244f19-6ddf-4ff3-afbd-a49d7751816b","span.kind":"client","thread.name":"com.alibaba.nacos.naming.beat.sender"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_sampling_priority_v1":1,"peer.port":8848,"thread.id":56},"type":"http"}','PUT /nacos/v1/ns/instance/beat','demo-k8s-auth','entry','web',1647221927516,821,'ddtrace','PUT','0','822861332187929136',1647221925081005,1647221925081,'dev','172.16.0.230']]}]}
触发逻辑
跟踪阈值: p99大于15秒, eroor>10%
跟踪请求数、延迟、错误率趋势,当发生剧烈数据变化时触发事件。
追踪请求数、延迟、错误率趋势 ,当触发临界值时触发事件。例如:p99大于15秒 ,error_rate >10% 等等
算法逻辑
采用 ADTK、Prophet、LSTM、ARIMA、Holt-Winters等算法进行趋势预测
事件内容
巡检事件产生时间,事件 event_id,状态,巡检分类,巡检类型,巡检对象,tags,异常,告警信息
| 字段 | 说明 |
|---|---|
| date | 智能巡检事件产生时间。Unix时间戳,单位 ms |
| df_event_id | 事件 id。注意:相同事件存在ongoing,resolved两种状态。 |
| df_status | 状态。取值:ongoing , resolved |
| df_watchdog_category | 巡检分类。取值:apm |
| df_watchdog_type | 巡检类型。取值:apm_request_rate |
| df_watchdog_object | 巡检对象。取值: 资源(resource) |
| df_watchdog_tags | 固定标签。取值: 服务(service) 环境(env) 版本(version) 项目(project) |
| df_title | 固定格式。 {#service} 服务中有 {#N} 个资源请求速率、错误率、延时异常 |
| df_message | 固定格式。 - 资源请求速率趋势图 - 近 6 小时资源请求速率数据点 - 资源请求速率异常结果 - 资源(resource) - 请求速率(request_rate) - 异常资源关联的span 列表 - 异常资源关联 trace_id列表 - 基于异常 trace_id 的 service_map —- 前端查询获得 - 基于异常 trace_id 关联的 前端应用、用户数、页面地址 - 查询 RUM View 数据获取 app_id,userid,view_url 具体值及数量统计值 |
分析流程


