下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
Alertmanager 安装
下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
将安装包上传至服务器
[root@Prometheus software]# tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/
[root@Prometheus software]# cd /usr/local/
[root@Prometheus local]# mv alertmanager-0.24.0.linux-amd64 alertmanager
创建启动文件:
[root@Prometheus local]# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager System
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --config.file /usr/local/alertmanager/alertmanager.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
启动服务:
[root@Prometheus local]# systemctl daemon-reload
[root@Prometheus local]# systemctl start alertmanager
[root@Prometheus local]# systemctl enable —now alertmanager
[root@Prometheus local]# systemctl status alertmanager
prometheus集成alertmanager:
[root@Prometheus local]# mkdir -p /usr/local/prometheus/rules
[root@Prometheus local]# vim /usr/local/prometheus/prometheus.yml
# 1. 修改 prometheus.yml 的 alerting 部分
# 2. 修改 prometheus.yml 的 rule_files 部分
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.10.111:9093 # AlterManager 地址
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml" # 定义告警文件
# - "first_rules.yml"
# - "second_rules.yml"
Alertmanager 邮件告警
163邮箱开启SMTP
配置邮件发送
[root@Prometheus local]# vim /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 使用 163 邮箱服务器发邮件
smtp_from: 'muyaobin@163.com' # 发件人,填写你的 163 邮箱
smtp_auth_username: 'muyaobin@163.com' # 与上面保持一致
smtp_auth_password: 'LKPZVCYSLHVEHGRI' # 你 163 邮箱的授权码
smtp_require_tls: false # 不使用加密认证
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h # 1 小时重复一次报警
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'muyaobin@163.com'
send_resolved: true # 故障恢复后发送邮件
inhibit_rules: # 告警抑制规则
- source_match:
serverity: 'critical'
target_match:
serverity: 'warning'
equal: ['alertname','dev','instance']
添加报警规则
[root@Prometheus local]# vim /usr/local/prometheus/rules/host_monitor.yml
groups:
- name: node-down
rules:
- alert: node-down
expr: up == 0
for: 5s # 评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为 pending
labels: # 自定义标签,允许用户指定要附加到告警上的一组附加标签
severity: 1
team: node
annotations:
summary: "{{$labels.instance}}"
description: "{{$labels.instance}}:job {{$labels.job}} 已经停止5分钟以上"
验证配置文件
[root@Prometheus local]# /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
重启 Prometheus
[root@Prometheus local]# cd /usr/local/prometheus
[root@Prometheus prometheus]# pkill prometheus
[root@Prometheus prometheus]# lsof -i:9090
[root@Prometheus prometheus]# ./prometheus &
触发告警
当kill掉node_exporter的时候,会发送告警邮件
当重启node_exporter的时候,会发送恢复邮件
优化告警模板
新建模板文件:
[root@Prometheus ~]# vim /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt }} <br>
=========end==========<br>
{{ end }}
{{ end }}
修改配置文件使用模板:
[root@Prometheus ~]# vim /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25' # 使用 163 邮箱服务器发邮件
smtp_from: 'muyaobin@163.com' # 发件人,填写你的 163 邮箱
smtp_auth_username: 'muyaobin@163.com' # 与上面保持一致
smtp_auth_password: 'LKPZVCYSLHVEHGRI' # 你 163 邮箱的授权码
smtp_require_tls: false # 不使用加密认证
templates:
- '/usr/local/alertmanager/email.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h # 1 小时重复一次报警
receiver: 'email' # 注意和下面的 receivers.name 同名
receivers:
- name: 'email'
email_configs:
- to: 'muyaobin@163.com'
html: '{{ template "email.to.html" . }}' # 使用模板的方式发送
send_resolved: true # 故障恢复后发送邮件
inhibit_rules: # 告警抑制规则
- source_match:
serverity: 'critical'
target_match:
serverity: 'warning'
equal: ['alertname','dev','instance']
alertmanager.yml配置文件检查:
[root@Prometheus ~]# /usr/local/alertmanager/amtool check-config /usr/local/alertmanager/alertmanager.yml
[root@Prometheus ~]# systemctl restart alertmanager
模拟宕机告警:
当kill掉node_exporter的时候,会发送告警邮件
修改模板添加恢复信息:
[root@Prometheus ~]# vim /usr/local/alertmanager/email.tmpl
{{ define "email.to.html" }}
{{ if gt (len .Alerts.Firing) 0 }}{{ range .Alerts }}
@告警信息: <br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} 停止工作 <br>
触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} <br>
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}{{ range .Alerts }}
@恢复信息: <br>
告警主机:{{ .Labels.instance }} <br>
告警主题:{{ .Annotations.summary }} 恢复正常 <br>
恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }} <br>
{{ end }}
{{ end }}
{{ end }}
模拟宕机告警:
当kill掉node_exporter的时候,会发送告警邮件
当重启node_exporter的时候,会发送恢复邮件