一、AlertManager的三个概念

1.分组 Grouping

grouping是AlertManger把同类型的报警进行分组,合并多条报警到一条通知中

2.抑制 Inhibition

Inhibition是当某条报警已经发送,抑制由此报警引发的后续报警。
例如:现在触发了主机监控报警,应当抑制服务down掉的报警

3.静默 Silences

silences提供了一个简单的机制,根据标签进行静默处理。根据标签进行匹配,如果匹配到,则不发送报警通知。

  • silences需要在web ui界面进行配置,grouping、inhibition则需要在配置文件中配置

    二、route匹配规则

    1.route示例

    1. route:
    2. group_by: ['alertname', 'cluster', 'service']
    3. group_wait: 10s
    4. group_interval: 10s
    5. repeat_interval: 1m
    6. receiver: 'mail' #默认路由:未匹配到下面标签的发送给mail
    7. routes:
    8. - receiver: 'test-1' #正则匹配到一组lable即发送给test-1。{test1="mysql1"}或{test1="db"}
    9. group_wait: 10s
    10. match_re:
    11. test1: mysql1|db
    12. - receiver: 'test-2' #匹配到lable为{severity="warning"} 则发送给test-2
    13. match:
    14. severity: warning
  • 路由匹配这一块匹配较少还是比较容易写出来的,如果比较多的话容易乱,最终效果可以在官方的路由树测试匹配报警。

  • 将alertmanager.yml配置文件复制进去,然后点击Draw Routing Tree即可绘制路由图,根据标签即可查看报警路由,填写需要测试的报警所含有的标签,如:{test1="db"},点击Match Lable Set即可查看报警发送给了哪个receiver
  • 路由树链接:https://www.prometheus.io/webtools/alerting/routing-tree-editor/
  • 如上述示例的路由树为这样:

企业微信截图_1619509728325.png

2.route多路分支

  • 还可以玩的更复杂,继续细分
  • 上边路由基本上是单分支,不利用官方的路由树还能梳理出来,但是遇到大规模的路由就比较难了,如下: ``` route: group_by: [‘alertname’, ‘cluster’, ‘service’] group_wait: 10s group_interval: 10s repeat_interval: 1m receiver: ‘mail’

    routes:

    • receiver: ‘test-1’ group_wait: 10s match_re: test1: mysql1 routes:

      • receiver: ‘test-1-1’ match: status: prod
      • receiver: ‘test-1-2’ match: status: dev
      • receiver: ‘test-1-3’ match: status: test
    • receiver: ‘test-2’ match: severity: warning routes:

      • receiver: ‘test-2-1’ match: status: prod
      • receiver: ‘test-2-2’ match: status: dev
      • receiver: ‘test-2-3’ match: status: test
    • receiver: ‘test-3’ group_wait: 10s match_re: test3: mysql3
    • receiver: ‘test-4’ group_wait: 10s match_re: test4: mysql4
  1. - 路由树
  2. ![image.png](https://cdn.nlark.com/yuque/0/2021/png/1682072/1619517646244-3eb37fe8-2902-4bf5-a5cb-8d482d975313.png#clientId=u5a45c9cb-42c2-4&from=paste&height=604&id=ufbd6f814&margin=%5Bobject%20Object%5D&name=image.png&originHeight=1207&originWidth=1286&originalType=binary&size=64803&status=done&style=none&taskId=ud76bc652-2eef-418d-a939-769d604dc90&width=643)
  3. <a name="vlrFV"></a>
  4. ### 三、receiver接收器
  5. <a name="iNk0L"></a>
  6. #### 1.receiver配置示例

templates:

  • ‘/data/server/alertmanage/email.tmpl’

receivers:

  • name: ‘mail’ email_configs:
    • to: ‘all@test.com’ send_resolved: true html: ‘{{ template “email.html” . }}’
  • name: ‘test-1’ email_configs:
    • to: ‘ops@test.com’ send_resolved: true html: ‘{{ template “email.html” . }}’
  • name: ‘test-2’ email_configs:

    • to: ‘web@test.com,web1@test.com’ #发送多个报警人用逗号隔开 send_resolved: true #发送恢复信息 html: ‘{{ template “email.html” . }}’ #发送消息的模板 ```
  • 根据上边的路由,匹配到哪个receiver,则利用哪个receiver发送消息

    2.template配置示例

    1. {{ define "email.html" }}
    2. {{ range .Alerts }}
    3. =========start==========<br>
    4. 告警程序: prometheus_alert <br>
    5. 告警级别: {{ .Labels.severity }} <br>
    6. 告警类型: {{ .Labels.alertname }} <br>
    7. 故障主机: {{ .Labels.instance }} <br>
    8. 告警主题: {{ .Annotations.summary }} <br>
    9. 告警详情: {{ .Annotations.description }} <br>
    10. 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    11. =========end==========<br>
    12. {{ end }}
    13. {{ end }}

    需要注意两点

  • 1.{{ .StartsAt.Format “2006-01-02 15:04:05” }} 其中时间必须为”2006-01-02 15:04:05”,否则会出错,这个是go语言的诞生时间,而prometheus大部分都是go写的,所以时间格式转化都需要指定这个时间

  • 2.其中变量的值的获取,是从rule中获取的。如下一个Prometheus的rule:

    1. - alert: "内存使用率过高a"
    2. expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.9
    3. for: 1m
    4. labels:
    5. severity: warning
    6. test1: mysql1
    7. annotations:
    8. summary: High memory usage
    9. description: "内存使用率高"
  • 其中{{ .Labels.severity }} 的值就应该为 waring; {{ .Annotations.summary }} 的值为:High memory usage

    四、

    inhibit_rules抑制器

    1.inhibit配置示例

    1. # node-exporter: 1 > 2
    2. - source_match:
    3. job: 'node-exporter'
    4. level: '1'
    5. target_match:
    6. job: 'node-exporter'
    7. level: '2'
    8. equal: ['metric', 'instance']
  • 此示例表示在同一个job中,level1的抑制level2级别的报警。equal表示在metric和instance相同的前提下。