一、部署Alertmanager**

1、下载软件包

二进制安装,软件包下载地址:https://prometheus.io/download/

  1. wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

2、解压移动安装包

  1. tar -xf alertmanager-0.21.0.linux-amd64.tar.gz
  2. mv alertmanager-0.21.0.linux-amd64.tar.gz /usr/local/alertmanager

3、创建数据存储目录及管理用户

  1. mkdir /usr/local/alertmanager/data
  2. useradd prometheus
  3. chown -R prometheus:prometheus /usr/local/alertmanager/

4、使用system管理启动

vim /usr/lib/systemd/system/alertmanager.service

  1. [Unit]
  2. Description=Alertmanager
  3. After=network.target
  4. [Service]
  5. Type=simple
  6. User=prometheus
  7. ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data
  8. Restart=on-failure
  9. [Install]
  10. WantedBy=multi-user.target

启动

  1. systemctl start alertmanager.service

查看运行状态:
image.png
访问
地址:http://ip:9093
image.png

二、配置Alertmanager发送邮件

1、修改alertmanager配置文件

vim alertmanager.yml

  1. #全局配置项
  2. global:
  3. resolve_timeout: 5m #超时,默认5min
  4. #以下5行为邮箱smtp服务
  5. smtp_smarthost: 'smtp.163.com:25'
  6. smtp_from: 'XXXX@163.com'
  7. smtp_auth_username: 'XXXX@163.com'
  8. smtp_auth_password: 'JEYIXXXXXXAUO'
  9. smtp_require_tls: false
  10. #路由
  11. route:
  12. group_by: ['alertname'] #报警分组依据
  13. group_wait: 10s #组等待时间
  14. group_interval: 10s #发送前等待时间
  15. repeat_interval: 30m #重复周期
  16. receiver: 'email' #默认预警接收者
  17. #预警接收者
  18. receivers:
  19. - name: 'email' #警报名称
  20. email_configs:
  21. - send_resolved: true #恢复后通知
  22. to: '196XXXX096@qq.com' #预警接收者邮箱
  23. html: '{{ template "email.html" . }}' # 指定使用模板,如果不指定,还是会加载默认的模板的
  24. headers: { Subject: "[WARN]Prometheus告警邮件" } # 配置邮件主题
  25. templates: # 指定邮件模板的路径,可以使用相对路径,template/*.tmpl的方式
  26. - '/usr/local/alertmanager/template/email.tmpl'
  27. inhibit_rules:
  28. - source_match:
  29. severity: 'critical'
  30. target_match:
  31. severity: 'warning'
  32. equal: ['alertname', 'dev', 'instance']

2、建立报警模板

mkdir /usr/local/alertmanager/template
vim /usr/local/alertmanager/template/email.tmpl

  1. {{ define "email.html" }}
  2. {{ range .Alerts }}
  3. <pre>
  4. 故障实例: {{ .Labels.instance }}
  5. 故障概要: {{ .Annotations.summary }}
  6. 故障描述: {{ .Annotations.description }}
  7. 告警级别: {{ .Labels.severity }}
  8. 告警时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
  9. </pre>
  10. {{ end }}
  11. {{ end }}

3、修改Prometheus配置文件

1) 修改prometheus,yml

vim prometheus.yml

  1. #添加alertmanagers主机
  2. # Alertmanager configuration
  3. alerting:
  4. alertmanagers:
  5. - static_configs:
  6. - targets:
  7. - 192.168.196.11:9093
  8. #添加报警规则文件路径,此处使用的相对路径,规则文件需在相对路径下
  9. rule_files:
  10. - rule.yml

2)新建rule.yml规则文件

vim rule.yml

  1. groups:
  2. - name: node_cpu
  3. rules:
  4. - alert: "CPU使用率过高"
  5. expr: (1- (sum(increase(node_cpu{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu[1m])) by (instance))) * 100 > 80
  6. for: 3s
  7. labels:
  8. severity: Warning
  9. annotations:
  10. summary: "{{ $labels.instance }}CPU总使用率过高"
  11. description: "{{ $labels.instance }}当前CPU使用率: {{ $value }}%"
  12. - name: node_up
  13. rules:
  14. - alert: "客户端掉线"
  15. expr: up == 0
  16. for: 3s
  17. labels:
  18. severity: Error
  19. annotations:
  20. summary: "celint is not online"
  21. description: "{{ $labels.instance }}客户端不在线"

4、重启Alertmanager与Prometheus

  1. systemctl restart alertmanager
  2. systemctl restart prometheus

5、模拟节点宕机接收邮件

停止一节点的node_exporter模拟宕机

  1. systemctl stop node_exporter

然后在Prometheus的Alerts页面可以已经触发预警
image.png
再查看接收人邮箱,即可看到预警已通过邮件推送
image.png

三、使用钉钉发送预警通知

1、下载插件

项目地址:https://github.com/timonwong/prometheus-webhook-dingtalk/releases

  1. wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

2、解压到安装目录

  1. tar xf /root/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz -C /usr/local/alertmanager
  2. cd /usr/local/alertmanager && mv prometheus-webhook-dingtalk-1.4.0.linux-amd64/ prometheus-webhook-dingtalk

3、钉钉获取Webhook

image.png

4、启动 webhook-dingtalk

  1. ./prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=75b269eeXXXXXXX19ac0aXXXXXe2bb87ef2e7f2f3cbXXXXXXX57fd"

注:webhook1=后为钉钉中复制的Webhook

5、修改alertmanager.yml文件

  1. global:
  2. resolve_timeout: 5m
  3. smtp_require_tls: false
  4. route:
  5. group_by: ['alertname']
  6. group_wait: 10s
  7. group_interval: 10s
  8. repeat_interval: 30m
  9. receiver: 'webhook' #指定接收者
  10. receivers: #接收者设置
  11. - name: 'webhook'
  12. webhook_configs:
  13. - url: http://localhost:8060/dingtalk/webhook1/send
  14. send_resolved: true #预警恢复通知
  15. inhibit_rules:
  16. - source_match:
  17. severity: 'critical'
  18. target_match:
  19. severity: 'warning'
  20. equal: ['alertname', 'dev', 'instance']

6、模拟节点宕机接收钉钉机器人消息

停止一节点的node_exporter模拟宕机

  1. systemctl stop node_exporter

然后在Prometheus的Alerts页面可以已经触发预警
image.png
再查看钉钉,即可看到预警已通过钉钉群推送
image.png
启动节点的node_exporter模拟恢复故障

  1. systemctl start node_exporter

再查看钉钉,即可看到预警恢复已通过钉钉群推送
image.png