下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

Alertmanager 安装

下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
将安装包上传至服务器

[root@Prometheus software]# tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/
[root@Prometheus software]# cd /usr/local/
[root@Prometheus local]# mv alertmanager-0.24.0.linux-amd64 alertmanager

创建启动文件:
[root@Prometheus local]# vim /usr/lib/systemd/system/alertmanager.service

  1. [Unit]
  2. Description=alertmanager System
  3. After=network.target
  4. [Service]
  5. Type=simple
  6. ExecStart=/usr/local/alertmanager/alertmanager --config.file /usr/local/alertmanager/alertmanager.yml
  7. ExecReload=/bin/kill -HUP $MAINPID
  8. KillMode=process
  9. Restart=on-failure
  10. [Install]
  11. WantedBy=multi-user.target

启动服务:
[root@Prometheus local]# systemctl daemon-reload
[root@Prometheus local]# systemctl start alertmanager
[root@Prometheus local]# systemctl enable —now alertmanager
[root@Prometheus local]# systemctl status alertmanager

prometheus集成alertmanager:
[root@Prometheus local]# mkdir -p /usr/local/prometheus/rules
[root@Prometheus local]# vim /usr/local/prometheus/prometheus.yml

  1. # 1. 修改 prometheus.yml 的 alerting 部分
  2. # 2. 修改 prometheus.yml 的 rule_files 部分
  3. alerting:
  4. alertmanagers:
  5. - static_configs:
  6. - targets:
  7. - 192.168.10.111:9093 # AlterManager 地址
  8. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  9. rule_files:
  10. - "rules/*.yml" # 定义告警文件
  11. # - "first_rules.yml"
  12. # - "second_rules.yml"

Alertmanager 邮件告警

163邮箱开启SMTP

image.png
image.png
image.png

配置邮件发送

[root@Prometheus local]# vim /usr/local/alertmanager/alertmanager.yml

  1. global:
  2. resolve_timeout: 5m
  3. smtp_smarthost: 'smtp.163.com:25' # 使用 163 邮箱服务器发邮件
  4. smtp_from: 'muyaobin@163.com' # 发件人,填写你的 163 邮箱
  5. smtp_auth_username: 'muyaobin@163.com' # 与上面保持一致
  6. smtp_auth_password: 'LKPZVCYSLHVEHGRI' # 你 163 邮箱的授权码
  7. smtp_require_tls: false # 不使用加密认证
  8. route:
  9. group_by: ['alertname']
  10. group_wait: 10s
  11. group_interval: 10s
  12. repeat_interval: 1h # 1 小时重复一次报警
  13. receiver: 'email'
  14. receivers:
  15. - name: 'email'
  16. email_configs:
  17. - to: 'muyaobin@163.com'
  18. send_resolved: true # 故障恢复后发送邮件
  19. inhibit_rules: # 告警抑制规则
  20. - source_match:
  21. serverity: 'critical'
  22. target_match:
  23. serverity: 'warning'
  24. equal: ['alertname','dev','instance']

添加报警规则

[root@Prometheus local]# vim /usr/local/prometheus/rules/host_monitor.yml

  1. groups:
  2. - name: node-down
  3. rules:
  4. - alert: node-down
  5. expr: up == 0
  6. for: 5s # 评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为 pending
  7. labels: # 自定义标签,允许用户指定要附加到告警上的一组附加标签
  8. severity: 1
  9. team: node
  10. annotations:
  11. summary: "{{$labels.instance}}"
  12. description: "{{$labels.instance}}:job {{$labels.job}} 已经停止5分钟以上"

验证配置文件

[root@Prometheus local]# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
image.png

重启 Prometheus

[root@Prometheus local]# cd /usr/local/prometheus
[root@Prometheus prometheus]# pkill prometheus
[root@Prometheus prometheus]# lsof -i:9090
[root@Prometheus prometheus]# ./prometheus &

image.png
image.png

触发告警

当kill掉node_exporter的时候,会发送告警邮件
image.png
当重启node_exporter的时候,会发送恢复邮件
image.png

优化告警模板

新建模板文件:
[root@Prometheus ~]# vim /usr/local/alertmanager/email.tmpl

  1. {{ define "email.to.html" }}
  2. {{ range .Alerts }}
  3. =========start==========<br>
  4. 告警程序: prometheus_alert <br>
  5. 告警级别: {{ .Labels.severity }} <br>
  6. 告警类型: {{ .Labels.alertname }} <br>
  7. 故障主机: {{ .Labels.instance }} <br>
  8. 告警主题: {{ .Annotations.summary }} <br>
  9. 告警详情: {{ .Annotations.description }} <br>
  10. 触发时间: {{ .StartsAt }} <br>
  11. =========end==========<br>
  12. {{ end }}
  13. {{ end }}

修改配置文件使用模板:
[root@Prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml

  1. global:
  2. resolve_timeout: 5m
  3. smtp_smarthost: 'smtp.163.com:25' # 使用 163 邮箱服务器发邮件
  4. smtp_from: 'muyaobin@163.com' # 发件人,填写你的 163 邮箱
  5. smtp_auth_username: 'muyaobin@163.com' # 与上面保持一致
  6. smtp_auth_password: 'LKPZVCYSLHVEHGRI' # 你 163 邮箱的授权码
  7. smtp_require_tls: false # 不使用加密认证
  8. templates:
  9. - '/usr/local/alertmanager/email.tmpl'
  10. route:
  11. group_by: ['alertname']
  12. group_wait: 10s
  13. group_interval: 10s
  14. repeat_interval: 1h # 1 小时重复一次报警
  15. receiver: 'email' # 注意和下面的 receivers.name 同名
  16. receivers:
  17. - name: 'email'
  18. email_configs:
  19. - to: 'muyaobin@163.com'
  20. html: '{{ template "email.to.html" . }}' # 使用模板的方式发送
  21. send_resolved: true # 故障恢复后发送邮件
  22. inhibit_rules: # 告警抑制规则
  23. - source_match:
  24. serverity: 'critical'
  25. target_match:
  26. serverity: 'warning'
  27. equal: ['alertname','dev','instance']

alertmanager.yml配置文件检查:
[root@Prometheus ~]# /usr/local/alertmanager/amtool check-config /usr/local/alertmanager/alertmanager.yml
image.png
[root@Prometheus ~]# systemctl restart alertmanager

模拟宕机告警:
当kill掉node_exporter的时候,会发送告警邮件
image.png
修改模板添加恢复信息:
[root@Prometheus ~]# vim /usr/local/alertmanager/email.tmpl

  1. {{ define "email.to.html" }}
  2. {{ if gt (len .Alerts.Firing) 0 }}{{ range .Alerts }}
  3. @告警信息: <br>
  4. 告警程序: prometheus_alert <br>
  5. 告警级别: {{ .Labels.severity }} <br>
  6. 告警类型: {{ .Labels.alertname }} <br>
  7. 故障主机: {{ .Labels.instance }} <br>
  8. 告警主题: {{ .Annotations.summary }} <br>
  9. 告警详情: {{ .Annotations.description }} 停止工作 <br>
  10. 触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} <br>
  11. {{ end }}
  12. {{ end }}
  13. {{ if gt (len .Alerts.Resolved) 0 }}{{ range .Alerts }}
  14. @恢复信息: <br>
  15. 告警主机:{{ .Labels.instance }} <br>
  16. 告警主题:{{ .Annotations.summary }} 恢复正常 <br>
  17. 恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }} <br>
  18. {{ end }}
  19. {{ end }}
  20. {{ end }}

模拟宕机告警:
当kill掉node_exporter的时候,会发送告警邮件
image.png
当重启node_exporter的时候,会发送恢复邮件
image.png