一、部署Alertmanager**
1、下载软件包
二进制安装,软件包下载地址:https://prometheus.io/download/
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
2、解压移动安装包
tar -xf alertmanager-0.21.0.linux-amd64.tar.gzmv alertmanager-0.21.0.linux-amd64.tar.gz /usr/local/alertmanager
3、创建数据存储目录及管理用户
mkdir /usr/local/alertmanager/datauseradd prometheuschown -R prometheus:prometheus /usr/local/alertmanager/
4、使用system管理启动
vim /usr/lib/systemd/system/alertmanager.service
[Unit]Description=AlertmanagerAfter=network.target[Service]Type=simpleUser=prometheusExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/dataRestart=on-failure[Install]WantedBy=multi-user.target
启动
systemctl start alertmanager.service
二、配置Alertmanager发送邮件
1、修改alertmanager配置文件
vim alertmanager.yml
#全局配置项global:resolve_timeout: 5m #超时,默认5min#以下5行为邮箱smtp服务smtp_smarthost: 'smtp.163.com:25'smtp_from: 'XXXX@163.com'smtp_auth_username: 'XXXX@163.com'smtp_auth_password: 'JEYIXXXXXXAUO'smtp_require_tls: false#路由route:group_by: ['alertname'] #报警分组依据group_wait: 10s #组等待时间group_interval: 10s #发送前等待时间repeat_interval: 30m #重复周期receiver: 'email' #默认预警接收者#预警接收者receivers:- name: 'email' #警报名称email_configs:- send_resolved: true #恢复后通知to: '196XXXX096@qq.com' #预警接收者邮箱html: '{{ template "email.html" . }}' # 指定使用模板,如果不指定,还是会加载默认的模板的headers: { Subject: "[WARN]Prometheus告警邮件" } # 配置邮件主题templates: # 指定邮件模板的路径,可以使用相对路径,template/*.tmpl的方式- '/usr/local/alertmanager/template/email.tmpl'inhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
2、建立报警模板
mkdir /usr/local/alertmanager/templatevim /usr/local/alertmanager/template/email.tmpl
{{ define "email.html" }}{{ range .Alerts }}<pre>故障实例: {{ .Labels.instance }}故障概要: {{ .Annotations.summary }}故障描述: {{ .Annotations.description }}告警级别: {{ .Labels.severity }}告警时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}</pre>{{ end }}{{ end }}
3、修改Prometheus配置文件
1) 修改prometheus,yml
vim prometheus.yml
#添加alertmanagers主机# Alertmanager configurationalerting:alertmanagers:- static_configs:- targets:- 192.168.196.11:9093#添加报警规则文件路径,此处使用的相对路径,规则文件需在相对路径下rule_files:- rule.yml
2)新建rule.yml规则文件
vim rule.yml
groups:- name: node_cpurules:- alert: "CPU使用率过高"expr: (1- (sum(increase(node_cpu{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu[1m])) by (instance))) * 100 > 80for: 3slabels:severity: Warningannotations:summary: "{{ $labels.instance }}CPU总使用率过高"description: "{{ $labels.instance }}当前CPU使用率: {{ $value }}%"- name: node_uprules:- alert: "客户端掉线"expr: up == 0for: 3slabels:severity: Errorannotations:summary: "celint is not online"description: "{{ $labels.instance }}客户端不在线"
4、重启Alertmanager与Prometheus
systemctl restart alertmanagersystemctl restart prometheus
5、模拟节点宕机接收邮件
停止一节点的node_exporter模拟宕机
systemctl stop node_exporter
然后在Prometheus的Alerts页面可以已经触发预警
再查看接收人邮箱,即可看到预警已通过邮件推送
三、使用钉钉发送预警通知
1、下载插件
项目地址:https://github.com/timonwong/prometheus-webhook-dingtalk/releases
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
2、解压到安装目录
tar xf /root/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz -C /usr/local/alertmanagercd /usr/local/alertmanager && mv prometheus-webhook-dingtalk-1.4.0.linux-amd64/ prometheus-webhook-dingtalk
3、钉钉获取Webhook
4、启动 webhook-dingtalk
./prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=75b269eeXXXXXXX19ac0aXXXXXe2bb87ef2e7f2f3cbXXXXXXX57fd"
5、修改alertmanager.yml文件
global:resolve_timeout: 5msmtp_require_tls: falseroute:group_by: ['alertname']group_wait: 10sgroup_interval: 10srepeat_interval: 30mreceiver: 'webhook' #指定接收者receivers: #接收者设置- name: 'webhook'webhook_configs:- url: http://localhost:8060/dingtalk/webhook1/sendsend_resolved: true #预警恢复通知inhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
6、模拟节点宕机接收钉钉机器人消息
停止一节点的node_exporter模拟宕机
systemctl stop node_exporter
然后在Prometheus的Alerts页面可以已经触发预警
再查看钉钉,即可看到预警已通过钉钉群推送
启动节点的node_exporter模拟恢复故障
systemctl start node_exporter
再查看钉钉,即可看到预警恢复已通过钉钉群推送

