Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,而且很容易做到告警信息进行去重,降噪,分组等,是一款前卫的告警通知系统。

一、安装

(1)、配置configMap配置清单
alertmanager-config.yaml

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: alertmanager-config
  5. namespace: kube-ops
  6. data:
  7. alertmanager.yml: |-
  8. global:
  9. # 在没有报警的情况下声明为已解决的时间
  10. resolve_timeout: 5m
  11. # 配置邮件发送信息
  12. smtp_smarthost: 'smtp.163.com:25'
  13. smtp_from: 'xxxs@163.com'
  14. smtp_auth_username: 'xxx@163.com'
  15. smtp_auth_password: 'xxxx'
  16. smtp_hello: '163.com'
  17. smtp_require_tls: false
  18. # 所有报警信息进入后的根路由,用来设置报警的分发策略
  19. route:
  20. # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
  21. group_by: ['alertname', 'cluster']
  22. # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
  23. group_wait: 30s
  24. # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
  25. group_interval: 5m
  26. # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
  27. repeat_interval: 5m
  28. # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
  29. receiver: default
  30. # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
  31. routes:
  32. - receiver: email
  33. group_wait: 10s
  34. match:
  35. team: node
  36. receivers:
  37. - name: 'default'
  38. email_configs:
  39. - to: 'baidjay@163.com'
  40. send_resolved: true
  41. - name: 'email'
  42. email_configs:
  43. - to: '565361785@qq.com'
  44. send_resolved: true

创建ConfigMap对象:

  1. # kubectl apply -f alertmanager-config.yaml
  2. configmap/alertmanager-config created

然后配置alertmanager容器:
alertmanager-deploy.yaml

  1. apiVersion: extensions/v1beta1
  2. kind: Deployment
  3. metadata:
  4. name: alertmanager
  5. namespace: kube-ops
  6. spec:
  7. selector:
  8. matchLabels:
  9. app: alertmanager
  10. replicas: 2
  11. template:
  12. metadata:
  13. labels:
  14. app: alertmanager
  15. spec:
  16. containers:
  17. - name: alertmanager
  18. image: prom/alertmanager:v0.19.0
  19. imagePullPolicy: IfNotPresent
  20. resources:
  21. requests:
  22. cpu: 100m
  23. memory: 256Mi
  24. limits:
  25. cpu: 100m
  26. memory: 256Mi
  27. volumeMounts:
  28. - name: alert-config
  29. mountPath: /etc/alertmanager
  30. ports:
  31. - name: http
  32. containerPort: 9093
  33. volumes:
  34. - name: alert-config
  35. configMap:
  36. name: alertmanager-config

配置service:
alertmanager-svc.yaml

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: alertmanager-svc
  5. namespace: kube-ops
  6. annotations:
  7. prometheus.io/scrape: "true"
  8. spec:
  9. selector:
  10. app: alertmanager
  11. ports:
  12. - name: http
  13. port: 9093

在Prometheus中配置AlertManager地址:
prom-configmap.yaml

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: prometheus-config
  5. namespace: kube-ops
  6. data:
  7. prometheus.yaml: |
  8. global:
  9. scrape_interval: 15s
  10. scrape_timeout: 15s
  11. alerting:
  12. alertmanagers:
  13. - static_configs:
  14. - targets: ["alertmanager-svc:9093"]
  15. scrape_configs:
  16. - job_name: 'prometheus'
  17. static_configs:
  18. - targets: ['localhost:9090']
  19. ......

然后重载配置文件,reload Prometheus:

  1. # kubectl apply -f prom-configmap.yaml
  2. # curl -X POST "http://10.68.254.74:9090/-/reload"

二、配置报警规则

上面我们只是配置了报警器,并没有配置任何报警,所以到目前其并没有其任何作用。警报规则允许你基于 Prometheus 表达式语言的表达式来定义报警报条件,并在触发警报时发送通知给外部的接收者。

首先,定义报警规则,我们依然用ConfigMap的形式,就配置在Prometheus的configMap中:
prom-configmap.yaml

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: prometheus-config
  5. namespace: kube-ops
  6. data:
  7. prometheus.yaml: |
  8. global:
  9. scrape_interval: 15s
  10. scrape_timeout: 15s
  11. alerting:
  12. alertmanagers:
  13. - static_configs:
  14. - targets: ["alertmanager-svc:9093"]
  15. rule_files:
  16. - /etc/prometheus/rules.yaml
  17. scrape_configs:
  18. - job_name: 'prometheus'
  19. static_configs:
  20. - targets: ['localhost:9090']
  21. - job_name: 'redis'
  22. static_configs:
  23. - targets: ['redis.kube-ops.svc.cluster.local:9121']
  24. - job_name: 'kubernetes-nodes'
  25. kubernetes_sd_configs:
  26. - role: node
  27. relabel_configs:
  28. - source_labels: [__address__]
  29. regex: '(.*):10250'
  30. replacement: '${1}:9100'
  31. target_label: __address__
  32. action: replace
  33. - action: labelmap
  34. regex: __meta_kubernetes_node_label_(.+)
  35. - job_name: "kubernetes-kubelet"
  36. kubernetes_sd_configs:
  37. - role: node
  38. scheme: https
  39. tls_config:
  40. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  41. insecure_skip_verify: true
  42. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  43. relabel_configs:
  44. - action: labelmap
  45. regex: __meta_kubernetes_node_label_(.+)
  46. - job_name: "kubernetes_cAdvisor"
  47. kubernetes_sd_configs:
  48. - role: node
  49. scheme: https
  50. tls_config:
  51. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  52. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  53. relabel_configs:
  54. - action: labelmap
  55. regex: __meta_kubernetes_node_label_(.+)
  56. - target_label: __address__
  57. replacement: kubernetes.default.svc:443
  58. - source_labels: [__meta_kubernetes_node_name]
  59. regex: '(.+)'
  60. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  61. target_label: __metrics_path__
  62. - job_name: "kubernetes-apiserver"
  63. kubernetes_sd_configs:
  64. - role: endpoints
  65. scheme: https
  66. tls_config:
  67. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  68. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  69. relabel_configs:
  70. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  71. action: keep
  72. regex: default;kubernetes;https
  73. - job_name: "kubernetes-scheduler"
  74. kubernetes_sd_configs:
  75. - role: endpoints
  76. - job_name: 'kubernetes-service-endpoints'
  77. kubernetes_sd_configs:
  78. - role: endpoints
  79. relabel_configs:
  80. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  81. action: keep
  82. regex: true
  83. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  84. action: replace
  85. target_label: __scheme__
  86. regex: (https?)
  87. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  88. action: replace
  89. target_label: __metrics_path__
  90. regex: (.+)
  91. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  92. action: replace
  93. target_label: __address__
  94. regex: ([^:]+)(?::\d+)?;(\d+)
  95. replacement: $1:$2
  96. - action: labelmap
  97. regex: __meta_kubernetes_service_label_(.+)
  98. - source_labels: [__meta_kubernetes_namespace]
  99. action: replace
  100. target_label: kubernetes_namespace
  101. - source_labels: [__meta_kubernetes_service_name]
  102. action: replace
  103. target_label: kubernetes_name
  104. rules.yaml: |
  105. groups:
  106. - name: test-rule
  107. rules:
  108. - alert: NodeMemoryUsage
  109. expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes+node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 5
  110. for: 2m
  111. labels:
  112. team: node
  113. annotations:
  114. summary: "{{$labels.instance}}: High Memory usage detected"
  115. description: "{{$labels.instance}}: Memory usage is above 5% (current value is: {{ $value }}"

上面我们定义了一个名为NodeMemoryUsage的报警规则,其中:

  • for语句会使 Prometheus 服务等待指定的时间, 然后执行查询表达式。
  • labels语句允许指定额外的标签列表,把它们附加在告警上。
  • annotations语句指定了另一组标签,它们不被当做告警实例的身份标识,它们经常用于存储一些额外的信息,用于报警信息的展示之类的。

然后更新configmap,并重新reload:

  1. # kubectl apply -f prom-configmap.yaml
  2. # curl -X POST "http://10.68.140.137:9090/-/reload"

image.png
我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面3种状态:

  • inactive: 表示当前报警信息既不是firing状态也不是pending状态
  • pending: 表示在设置的阈值时间范围内被激活了
  • firing: 表示超过设置的阈值时间被激活了

然后就会收到报警:
image.png

三、webhook报警

3.1、python

用Flask编写一个简单的钉钉报警程序:

  1. #!/usr/bin/python
  2. # -*- coding:utf-8 -*-
  3. import os
  4. import json
  5. import requests
  6. from flask import Flask
  7. from flask import request
  8. app = Flask(__name__)
  9. @app.route('/', methods=['POST', 'GET'])
  10. def send():
  11. if request.method == 'POST':
  12. post_data = request.get_data()
  13. post_data = format_message(bytes2json(post_data))
  14. print(post_data)
  15. send_alert(post_data)
  16. return 'success'
  17. else:
  18. return 'weclome to use prometheus alertmanager dingtalk webhook server!'
  19. def bytes2json(data_bytes):
  20. data = data_bytes.decode('utf8').replace("'", '"')
  21. return json.loads(data)
  22. def format_message(post_data):
  23. EXCLUDE_LIST = ['prometheus', 'endpoint']
  24. message_list = []
  25. message_list.append('### 报警类型:{}'.format(post_data['status']))
  26. # message_list.append('**alertname:**{}'.format(post_data['alerts'][0]['labels']['alertname']))
  27. message_list.append('> **startsAt: **{}'.format(post_data['alerts'][0]['startsAt']))
  28. message_list.append('#### Labels:')
  29. for label in post_data['alerts'][0]['labels'].keys():
  30. if label in EXCLUDE_LIST:
  31. continue
  32. else:
  33. message_list.append("> **{}: **{}".format(label, post_data['alerts'][0]['labels'][label]))
  34. message_list.append('#### Annotations:')
  35. for annotation in post_data['alerts'][0]['annotations'].keys():
  36. message_list.append('> **{}: **{}'.format(annotation, post_data['alerts'][0]['annotations'][annotation]))
  37. message = (" \n\n ".join(message_list))
  38. title = post_data['alerts'][0]['labels']['alertname']
  39. data = {"title": title, "message": message}
  40. return data
  41. def send_alert(data):
  42. token = os.getenv('ROBOT_TOKEN')
  43. if not token:
  44. print('you must set ROBOT_TOKEN env')
  45. return
  46. url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
  47. send_data = {
  48. "msgtype": "markdown",
  49. "markdown": {
  50. "title": data['title'],
  51. "text": "{}".format(data['message'])
  52. }
  53. }
  54. req = requests.post(url, json=send_data)
  55. result = req.json()
  56. print(result)
  57. if result['errcode'] != 0:
  58. print('notify dingtalk error: %s' % result['errcode'])
  59. if __name__ == '__main__':
  60. app.run(host='0.0.0.0', port=5000)

代码非常简单,通过一个 ROBOT_TOKEN 的环境变量传入群机器人的 TOKEN,然后直接将 webhook 发送过来的数据直接以文本的形式转发给群机器人。

Dockerfile内容:

  1. FROM python:3.6.4
  2. # set working directory
  3. WORKDIR /src
  4. # add app
  5. ADD . /src
  6. # install requirements
  7. RUN pip install -r requirements.txt
  8. # run server
  9. CMD python app.py

requirements.txt

  1. certifi==2018.10.15
  2. chardet==3.0.4
  3. Click==7.0
  4. Flask==1.0.2
  5. idna==2.7
  6. itsdangerous==1.1.0
  7. Jinja2==2.10
  8. MarkupSafe==1.1.0
  9. requests==2.20.1
  10. urllib3==1.24.1
  11. Werkzeug==0.14.1

我们在集群中部署服务:
创建token的secret,将token保存在文件中(钉钉自定义机器人):

  1. #kubectl create secret generic dingtalk-secret --from-literal=token=xxxxxx -n kube-ops

dingtalk-hook.yaml

  1. ---
  2. apiVersion: extensions/v1beta1
  3. kind: Deployment
  4. metadata:
  5. name: dingtalk-hook
  6. namespace: kube-ops
  7. spec:
  8. template:
  9. metadata:
  10. labels:
  11. app: dingtalk-hook
  12. spec:
  13. containers:
  14. - name: dingtalk-hook
  15. image: registry.cn-hangzhou.aliyuncs.com/joker_kubernetes/dingtalk-hook:v0.3
  16. imagePullPolicy: IfNotPresent
  17. ports:
  18. - containerPort: 5000
  19. name: http
  20. env:
  21. - name: ROBOT_TOKEN
  22. valueFrom:
  23. secretKeyRef:
  24. name: dingtalk-secret
  25. key: token
  26. resources:
  27. requests:
  28. cpu: 50m
  29. memory: 100Mi
  30. limits:
  31. cpu: 50m
  32. memory: 100Mi
  33. ---
  34. apiVersion: v1
  35. kind: Service
  36. metadata:
  37. name: dingtalk-hook
  38. namespace: kube-ops
  39. spec:
  40. selector:
  41. app: dingtalk-hook
  42. ports:
  43. - name: hook
  44. port: 5000
  45. targetPort: http

创建配置清单:

  1. # kubectl apply -f dingtalk-hook.yaml
  2. deployment.extensions/dingtalk-hook created

然后我们修改alertmanager的configmap,增加webhook:
alertmanager-config.yaml

  1. ......
  2. - receiver: webhook
  3. group_wait: 10s
  4. match:
  5. filesystem: node
  6. receivers:
  7. - name: 'webhook'
  8. webhook_configs:
  9. - url: "http://dingtalk-hook.kube-ops.svc:5000"
  10. send_resolved: true
  11. ......

更新配置文件,重新创建alertmanager-deploy.yaml

  1. # kubectl apply -f alertmanager-config.yaml
  2. # kubectl delete -f alertmanager-deploy.yaml
  3. deployment.extensions "alertmanager" deleted
  4. # kubectl apply -f alertmanager-deploy.yaml
  5. deployment.extensions/alertmanager created

然后我们在Prometheus中添加rules,如下:

  1. - alert: NodeFilesystemUsage
  2. expr: (sum(node_filesystem_size_bytes{device="rootfs"}) - sum(node_filesystem_free_bytes{device="rootfs"}) ) / sum(node_filesystem_size_bytes{device="rootfs"}) * 100 > 10
  3. for: 2m
  4. labels:
  5. filesystem: node
  6. annotations:
  7. summary: "{{$labels.instance}}: High Filesystem usage detected"
  8. description: "{{$labels.instance}}: Filesystem usage is above 10% (current value is: {{ $value }}"

然后我们更新Prometheus的configmap,并reload Prometheus:

  1. # kubectl apply -f prom-configmap.yaml
  2. # curl -X POST "http://10.68.140.137:9090/-/reload"

然后我们可以看到已经触发:
image.png

image.png

3.2、go

一个比较好的模板:https://github.com/timonwong/prometheus-webhook-dingtalk