1. 配置并启动 alertmanager

  1. ## 创建 alertmanager 数据目录
  2. mkdir -p /data/alertmanager/
  3. chmod -R 777 /data/alertmanager
  4. ## 编辑 alertmanager 配置文件
  5. # 配置的详细说明,见官方文档: https://prometheus.io/docs/alerting/configuration/
  6. # 请参考附件1中的配置模板编写配置文件: /etc/alertmanager/config.yml
  7. ## docker 启动 alertmanager
  8. docker run -d -p 9093:9093 \
  9. -v /etc/alertmanager/config.yml:/etc/alertmanager/config.yml \
  10. -v /data/alertmanager:/data/alertmanager \
  11. --name alertmanager \
  12. --restart=always \
  13. quay.io/prometheus/alertmanager \
  14. --config.file=/etc/alertmanager/config.yml \
  15. --storage.path=/data/alertmanager

2. prometheus 中加入alertmanager 配置

在 prometheus 的配置文件中加入如下配置:

  1. # Alertmanager配置
  2. alerting:
  3. alertmanagers:
  4. - static_configs:
  5. - targets: ["localhost:9093"] # 设定alertmanager和prometheus交互的接口,即alertmanager监听的ip地址和端口
  1. # 重启 prometheus
  2. docker restart prometheus

3. prometheus 中配置rules

2 在 prometheus 配置文件加入 alert rules 配置

  1. # alertmanager rules
  2. rule_files:
  3. - "/etc/prometheus/rules/*.yml"

3 加入两条配置规则
vim /etc/prometheus/rules/testAlert.yml

  1. groups:
  2. - name: ServiceStatus #规则组名称
  3. rules:
  4. - alert: ServiceStatusAlert #单个规则的名称
  5. expr: up == 0 #匹配规则, up==0, 1表示在线,0表示down机
  6. for: 10s #持续时间
  7. labels: #标签
  8. project: zhidaoAPP #自定义lables
  9. annotations: #告警正文
  10. summary: "Instance {{ $labels.instance }} down"
  11. description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
  12. - name: hostStatsAlert
  13. rules:
  14. - alert: hostCpuUsageAlert
  15. expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
  16. for: 1m
  17. labels:
  18. severity: page
  19. annotations:
  20. summary: "Instance {{ $labels.instance }} CPU usgae high"
  21. description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  22. - alert: hostMemUsageAlert
  23. expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
  24. for: 1m
  25. labels:
  26. severity: page
  27. annotations:
  28. summary: "Instance {{ $labels.instance }} MEM usgae high"
  29. description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

4 重启 prometheus

  1. docker restart prometheus

查看 prometheus 中的告警规则
http://{YOU_prometheus_IP}:9090/alerts
查看alert 信息
http://{YOU_alertmanager_IP}:9093/#/alerts

4.增加钉钉告警功能

  1. 启动dingtalk 插件

    1. docker run -d --name dingtalk \
    2. -p 8060:8060 \
    3. --restart=always \
    4. docker.io/timonwong/prometheus-webhook-dingtalk \
    5. --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=139c51c0c3f8dabf9d0ea50b042ef6593bea61340a7d116ef2ce51e4e538b8a9" \
    6. --ding.profile="webhook2=https://oapi.dingtalk.com/robot/send?access_token=yyyyyyyyyyy" # 可以写多个
  2. 配置 alertmanager, 增加

参考文档:

  1. https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule


附件:

1. Alertmanager 配置模板

配置示例: https://raw.githubusercontent.com/prometheus/alertmanager/master/doc/examples/simple.yml

vim /etc/alertmanager/config.yml

  1. global:
  2. # 邮箱配置
  3. smtp_smarthost: smtp.ym.163.com:587 # 如果是企业邮箱一定要配置587端口, 456端口邮件会发送失败
  4. smtp_from: alert@xxx.com
  5. smtp_auth_username: alert@xxx.com
  6. smtp_auth_identity: alert@xxx.com
  7. smtp_auth_password: XXXXXXX
  8. route:
  9. ## default receiver
  10. receiver: 'default'
  11. group_wait: 30s
  12. group_interval: 1m
  13. repeat_interval: 4h
  14. group_by: ['claster','alertname']
  15. routes:
  16. - receiver: webhook
  17. group_wait: 10s
  18. match: # match_re: 正则匹配
  19. alertname: ServiceStatusAlert # 定义告警的匹配标签,来确定告警组的标实
  20. receivers:
  21. - name: default
  22. email_configs:
  23. - to: 152xxxx8332@163.com
  24. send_resolved: true
  25. # webhook——钉钉
  26. - name: webhook
  27. webhook_configs:
  28. - url: http://{prometheus-webhook-dingtalk_IP}:8060/dingtalk/ops_dingding/send