Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,例如邮件、微信、钉钉、Slack 等常用沟通工具,而且很容易做到告警信息进行去重,降噪,分组等,是一款很好用的告警通知系统。

安装 alertmanager

#1、部署alertmanager

官方下载网址:https://prometheus.io/download/

  1. mkdir /usr/local/monitor/ && cd /usr/local/monitor/
  2. wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
  3. tar xf alertmanager-0.22.2.linux-amd64.tar.gz


#2、通过systemctl管理alertmanager

  1. #增加系统管理软件文件
  2. vim/etc/systemd/system/alertmanager.service
  3. [Unit]
  4. Description=alertmanager
  5. After=network.target
  6. [Service]
  7. Type=simple
  8. User=root
  9. ExecStart=/usr/local/monitor/alertmanager-0.22.2.linux-amd64/alertmanager --config.file=/usr/local/monitor/alertmanager-0.22.2.linux-amd64/alertmanager.yml
  10. Rastart=on-failure
  11. [Install]
  12. WantedBy=multi-user.target

#3、修改alertmanager配置文件

  1. route:
  2. group_by: ['alertname']
  3. group_wait: 30s
  4. group_interval: 5m
  5. repeat_interval: 1h
  6. receiver: 'web.hook'
  7. receivers:
  8. - name: 'web.hook'
  9. webhook_configs:
  10. - url: 'http://localhost:8060/dingtalk/ops_dingding/send'
  11. send_resolved: true
  12. inhibit_rules:
  13. - source_match:
  14. severity: 'critical'
  15. target_match:
  16. severity: 'warning'
  17. equal: ['alertname', 'dev', 'instance']

#4、重新加载配置,启动服务

  1. systemctl daemon-reload
  2. systemctl enable alertmanager.service
  3. systemctl restart alertmanager.service
  4. systemctl status alertmanager.service

alertmanager默认启动9093和9094端口。

部署prometheus-webhook-dingtalk 钉钉告警插件

1.1、二进制部署

  1. wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

1.2、配置prometheus-webhook-dingtalk

  1. $ cp config.example.yml config.yml
  2. ## Request timeout
  3. # timeout: 5s
  4. ## Customizable templates path
  5. templates: #去掉注释
  6. - contrib/templates/legacy/template.tmpl #去掉注释
  7. ## You can also override default template using `default_message`
  8. ## The following example to use the 'legacy' template from v0.3.0
  9. # default_message:
  10. # title: '{{ template "legacy.title" . }}'
  11. # text: '{{ template "legacy.content" . }}'
  12. ## Targets, previously was known as "profiles"
  13. targets:
  14. webhook1:
  15. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  16. # secret for signature
  17. secret: SEC000000000000000000000
  18. webhook2:
  19. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  20. webhook_legacy:
  21. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  22. # Customize template content
  23. message:
  24. # Use legacy template
  25. title: '{{ template "legacy.title" . }}'
  26. text: '{{ template "legacy.content" . }}'
  27. webhook_mention_all:
  28. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  29. mention:
  30. all: true
  31. webhook_mention_users:
  32. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  33. mention:
  34. mobiles: ['156xxxx8827', '189xxxx8325']

1.3、配置 告警模板

  1. $ vim contrib/templates/legacy/template.tmpl
  2. {{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
  3. {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
  4. {{ define "__text_alert_list" }}{{ range . }}
  5. **Labels**
  6. {{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
  7. {{ end }}
  8. **Annotations**
  9. {{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
  10. {{ end }}
  11. **Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
  12. {{ end }}{{ end }}
  13. {{ define "___text_alert_list" }}{{ range . }}
  14. ---
  15. **告警主题:** {{ .Labels.alertname | upper }}
  16. **告警级别:** {{ .Labels.severity | upper }}
  17. **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
  18. **事件信息:** {{ range .Annotations.SortedPairs }} {{ .Value | markdown | html }}
  19. {{ end }}
  20. **事件标签:**
  21. {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
  22. {{ end }}{{ end }}
  23. {{ end }}
  24. {{ end }}
  25. {{ define "___text_alertresovle_list" }}{{ range . }}
  26. ---
  27. **告警主题:** {{ .Labels.alertname | upper }}
  28. **告警级别:** {{ .Labels.severity | upper }}
  29. **触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
  30. **结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
  31. **事件信息:** {{ range .Annotations.SortedPairs }} {{ .Value | markdown | html }}
  32. {{ end }}
  33. **事件标签:**
  34. {{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
  35. {{ end }}{{ end }}
  36. {{ end }}
  37. {{ end }}
  38. {{/* Default */}}
  39. {{ define "_default.title" }}{{ template "__subject" . }}{{ end }}
  40. {{ define "_default.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
  41. {{ if gt (len .Alerts.Firing) 0 -}}
  42. ![警报 图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/ERROR.jpg)
  43. **========告警触发========**
  44. {{ template "___text_alert_list" .Alerts.Firing }}
  45. {{- end }}
  46. {{ if gt (len .Alerts.Resolved) 0 -}}
  47. ![恢复图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/OK.jpg)
  48. **========告警恢复========**
  49. {{ template "___text_alertresovle_list" .Alerts.Resolved }}
  50. {{- end }}
  51. {{- end }}
  52. {{/* Legacy */}}
  53. {{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
  54. {{ define "legacy.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
  55. {{ template "__text_alert_list" .Alerts.Firing }}
  56. {{- end }}
  57. {{/* Following names for compatibility */}}
  58. {{ define "_ding.link.title" }}{{ template "_default.title" . }}{{ end }}
  59. {{ define "_ding.link.content" }}{{ template "_default.content" . }}{{ end }}

1.4、获取dingding webhook地址

  1. 配置钉钉机器人
  2. 创建一个钉钉群,可能需要管理员才能创建
  3. 点击群设置→智能群助手→添加机器人
  4. 在弹出的页面中选择添加机器人,然后选择自定义,然后点击添加
  5. 给机器人取个名字
  6. 安全设置这里至少选择一个,这里使用自定义关键字,然后点击完成
  7. 复制Webhook地址备用

1.5、 启动服务

  1. $ vim /etc/systemd/system/prometheus-webhook-dingtalk.service
  2. [Unit]
  3. Description=prometheus-webhook-dingtalk
  4. After=network-online.target
  5. [Service]
  6. Restart=on-failure
  7. ExecStart=/usr/local/monitor/prometheus-webhook-dingtalk-1.4.0/prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=88c328c8ac98*******95a15fe9c9"
  8. [Install]
  9. WantedBy=multi-user.target

注意替换Webhook的地址,这里的 -ding.profile 参数:为了支持同时往多个钉钉自定义机器人发送报警消息,因此 -ding.profile 可以在命令行中指定多次。

  1. systemctl daemon-reload
  2. systemctl enable prometheus-webhook-dingtalk.service
  3. systemctl start prometheus-webhook-dingtalk.service
  4. systemctl status prometheus-webhook-dingtalk.service

prometheus-webhook-dingtalk默认启动8060端口。

注意:有时候我们配置好了,但是钉钉接收不到报警信息,但是单独执行上面的ExecStart启动命令,可以接收到信息,这就和下载下来的 prometheus-webhook-dingtalk 二进制文件的权限有关系了,我们可以给这个文件授权root属组属主 chown root:root /usr/local/monitor/prometheus-webhook-dingtalk-1.4.0/prometheus-webhook-dingtalk

配置Prometheus

完整的prometheus安装配置参考本文,增加钉钉报警需要修改prometheus配置文件。

示例:prometheus.yml

  1. ......
  2. # Alertmanager configuration
  3. alerting:
  4. alertmanagers:
  5. - static_configs:
  6. - targets: ['localhost:9093']
  7. #- "localhost:9093"
  8. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  9. rule_files:
  10. - "/usr/local/monitor/prometheus-2.29.2/rules/*.yml"
  11. #- "second_rules.yml"
  12. .....

在alertmanager configuration指定alertmanager的地址和端口,然后在rule_files里指定报警规则文件

修改完prometheus.yml后,重启即可。

经常有触发了报警规则,但是没有报警的情况,有可能是钉钉的webhook中没有添加报警规则中的关键词。
我们可以通过下面命令来测试钉钉机器人能否接收报警通知。

  1. curl 'https://oapi.dingtalk.com/robot/send?access_token=f0c2dd678*****2f75' \
  2. -H 'Content-Type: application/json' \
  3. -d '{"msgtype": "text",
  4. "text": {
  5. "content": "告警shooter钉钉机器人群消息测试"
  6. }
  7. }'