前言:昨天已经大概介绍了Promethus+Grafana如何搭建,今天继续讲下Promethus报警如何设置,Promethus通过AlertManager实现报警。
一、安装AlertManager
AlertMnager安装方式基本同exporter,解压,修改配置文件,启动即可,下载地址:https://github.com/prometheus/alertmanager/releases/download/v0.15.1/alertmanager-0.15.1.linux-amd64.tar.gz
二、配置文件
alertmanager的配置文件目录在解压出来的文件夹内,我这里的路径在/usr/local/alertmanager/alertmanager.yml,配置如下:
[root@db-monitor-01 alertmanager]# cat alertmanager.yml
global:
smtp_smarthost: 'xxx:25'
smtp_from: 'xxx@8531.cn'
smtp_auth_username: 'xxx@8531.cn'
smtp_auth_password: 'xxx'
smtp_require_tls: false
templates:
- '/usr/local/alertmanager/template/*.tmpl'
route:
group_by: ['alertname']
repeat_interval: 1m
receiver: xucl
receivers:
- name: 'xucl'
email_configs:
- to: 'xxx@8531.cn'
html: '{{ template "alert.html" . }}'
headers: { Subject: " {{ .CommonAnnotations.summary }}" }
三、告警规则
memory_over.yml
[root@db-monitor-01 rules]# cat /usr/local/prometheus/rules/memory_over.yml
groups:
- name: NodeMemoryUsage
rules:
- alert: NodeMemoryUsage
expr: round((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100) > 80
for: 1m
labels:
user: xucl
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is:{{ $value }})"
value: "{{ $value }}"
node_down.yml
[root@db-monitor-01 rules]# cat /usr/local/prometheus/rules/node_down.yml
groups:
- name: InstanceDown
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
user: xucl
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
四、修改promethus配置文件
在原先的promethus.yml最后增加如下
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
rule_files:
- "rules/memory_over.yml"
- "rules/node_down.yml"
五、模板文件
[root@db-monitor-01 template]# cat /usr/local/alertmanager/template/alert.tmpl
{{ define "alert.html" }}
<style type="text/css">
table
{
border-collapse: collapse;
margin: 0 auto;
text-align: center;
}
table td, table th
{
border: 1px solid #cad9ea;
color: #666;
height: 30px;
}
table thead th
{
background-color: #CCE8EB;
width: 100px;
}
table tr:nth-child(odd)
{
background: #fff;
}
table tr:nth-child(even)
{
background: #F5FAFA;
}
</style>
<table width="90%" class="table">
<tr><td>报警项</td>
<td>主机</td>
<td>报警阀值</td>
<td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr><td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "value" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
六、启动测试
首先启动AlertManager
然后启动Promethus
./alertmanager --log.level=debug
./prometheus --config.file=prometheus.yml --storage.tsdb.path="/storage/data" --storage.tsdb.retention=30d
登陆平台查看
看到报警规则已经生效
再看下报警
可以尝试调低一点阈值,收到报警邮件如下