设置alertmanager开机启动
修改prometheus配置用于加载alertmanager和alertmanager rules
新建rules规则
Prometheus templates apply here in the annotation and label fields of the alert.
- 配置告警策略
receiver: ‘web.hook’
- name: ‘web.hook’
- .tmpl模板的配置
创建.tmpl模版存放目录
企业微信
邮件告警
重启alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar zxf alertmanager-0.20.0.linux-amd64.tar.gz && mv alertmanager-0.20.0.linux-amd64 /opt/alertmanager && rm -rf alertmanager-0.20.0.linux-amd64*
# Alermanager会将数据保存到本地中，默认的存储路径为data/。因此，在启动Alertmanager之前需要创建相应的目录
mkdir -p /opt/alertmanager/data

设置alertmanager开机启动

cat >/usr/lib/systemd/system/alertmanager.service<<EOF
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/docs
Wants=network-online.target
After=network-online.target
[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
# 启动
systemctl start alertmanager && systemctl status alertmanager

修改prometheus配置用于加载alertmanager和alertmanager rules

cat prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 10s
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 172.21.1.30:9093
rule_files:
  - 'rules/*.rules'
scrape_configs:
  - job_name: 'kxl_promethes'
    file_sd_configs:
      - files:
           - /opt/prometheus/sd_config/data.yml
        refresh_interval: 5s
  - job_name: 'kxl_docker'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/docker.yml
        refresh_interval: 5s
  - job_name: 'kxl_vm'
    file_sd_configs:
       - files:
            - /opt/prometheus/sd_config/vm.yml
         refresh_interval: 5s
  - job_name: 'kxl_mysql'
    file_sd_configs:
      - files:
          - /opt/prometheus/sd_config/mysql.yml
        refresh_interval: 5s
# 重启prometheus
systemctl  restart prometheus

新建rules规则

node 规则 ```yaml mkdir -p /opt/prometheus/rules

cat >node.rules<<EOF groups:

name: kxl_Instances rules:
- alert: InstanceDown expr: up == 0 for: 5m labels: severity: page
  
  Prometheus templates apply here in the annotation and label fields of the alert.
  annotations: description: ‘{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.’ summary: ‘Instance {{ $labels.instance }} down’
- alert: 内存使用率过高 expr: 100-(node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 30 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} 内存使用率过高” description: “{{ $labels.instance }} of job {{$labels.job}}内存使用率超过80%,当前使用率[{{ $value }}].”
- alert: cpu使用率过高 expr: 100-avg(irate(node_cpu_seconds_total{mode=”idle”}[5m])) by(instance)*100 > 0 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} cpu使用率过高” description: “{{ $labels.instance }} of job {{$labels.job}}cpu使用率超过80%,当前使用率[{{ $value }}].” EOF ```
mysql 规则 ```yaml cat > mysql.rules <<EOF groups:
name: MySQLStatsAlert rules:
- alert: MySQL is down expr: mysql_up == 0 for: 1m labels: severity: critical annotations: summary: “Instance {{ $labels.instance }} MySQL is down” description: “MySQL database is down. This requires immediate action!”
- alert: open files high expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} open files high” description: “Open files is high. Please consider increasing open_files_limit.”
- alert: Read buffer size is bigger than max. allowed packet size expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size” description: “Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication.”
- alert: Sort buffer possibly missconfigured expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 410241024 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Sort buffer possibly missconfigured” description: “Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M.”
- alert: Thread stack size is too small expr: mysql_global_variables_thread_stack <196608 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Thread stack size is too small” description: “Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size.”
- alert: Used more than 80% of max connections limited expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Used more than 80% of max connections limited” description: “Used more than 80% of max connections limited”
- alert: InnoDB Force Recovery is enabled expr: mysql_global_variables_innodb_force_recovery != 0 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} InnoDB Force Recovery is enabled” description: “InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data.”
- alert: InnoDB Log File size is too small expr: mysql_global_variables_innodb_log_file_size < 16777216 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} InnoDB Log File size is too small” description: “The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts.”
- alert: InnoDB Flush Log at Transaction Commit expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit” description: “InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure.”
- alert: Table definition cache too small expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Table definition cache too small” description: “Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!”
- alert: Table open cache too small expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Table open cache too small” description: “Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!”
- alert: Thread stack size is possibly too small expr: mysql_global_variables_thread_stack < 262144 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Thread stack size is possibly too small” description: “Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size.”
- alert: InnoDB Buffer Pool Instances is too small expr: mysql_global_variables_innodb_buffer_pool_instances == 1 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small” description: “If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine.”
- alert: InnoDB Plugin is enabled expr: mysql_global_variables_ignore_builtin_innodb == 1 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} InnoDB Plugin is enabled” description: “InnoDB Plugin is enabled”
- alert: Binary Log is disabled expr: mysql_global_variables_log_bin != 1 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Binary Log is disabled” description: “Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR).”
- alert: Binlog Cache size too small expr: mysql_global_variables_binlog_cache_size < 1048576 for: 1m labels: severity: page annotations: env: “{{ $labels.env }}” summary: “Instance {{ $labels.instance }} Binlog Cache size too small” description: “Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK.”
- alert: Binlog Statement Cache size too small expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Binlog Statement Cache size too small” description: “Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK.”
- alert: Binlog Transaction Cache size too small expr: mysql_global_variables_binlog_cache_size <1048576 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Binlog Transaction Cache size too small” description: “Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK.”
- alert: Sync Binlog is enabled expr: mysql_global_variables_sync_binlog == 1 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Sync Binlog is enabled” description: “Sync Binlog is enabled. This leads to higher data security but on the cost of write performance.”
- alert: IO thread stopped expr: mysql_slave_status_slave_io_running != 1 for: 1m labels: severity: critical annotations: summary: “Instance {{ $labels.instance }} IO thread stopped” description: “IO thread has stopped. This is usually because it cannot connect to the Master any more.”
- alert: SQL thread stopped expr: mysql_slave_status_slave_sql_running == 0 for: 1m labels: severity: critical annotations: summary: “Instance {{ $labels.instance }} SQL thread stopped” description: “SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master.”
- alert: SQL thread stopped expr: mysql_slave_status_slave_sql_running != 1 for: 1m labels: severity: critical annotations: summary: “Instance {{ $labels.instance }} Sync Binlog is enabled” description: “SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master.”
- alert: Slave lagging behind Master expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 for: 1m labels: severity: warning annotations: summary: “Instance {{ $labels.instance }} Slave lagging behind Master” description: “Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!”
- alert: Slave is NOT read only(Please ignore this warning indicator.) expr: mysql_global_variables_read_only != 0 for: 1m labels: severity: page annotations: summary: “Instance {{ $labels.instance }} Slave is NOT read only” description: “Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies…” EOF ```
  配置告警策略
```yaml cat alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: ‘smtp.exmail.qq.com:465’ smtp_from: ‘zxc@xxlaila.cn.com’ smtp_auth_username: ‘zxc@xxlaila.cn.com’ smtp_auth_password: ‘123456’ smtp_require_tls: true hipchat_api_url: ‘https://hipchat.foobar.org/‘ wechat_api_url: ‘https://qyapi.weixin.qq.com/cgi-bin/‘ wechat_api_secret: ‘KJfj93r21389usdas0i—234jsnjkhf23sjkfjsfs’ # 企业微信Secret wechat_api_corp_id: ‘wwa98423u9skdnkjahs’ # 企业微信CorpId

templates:

‘template/*.tmpl’ 告警信息模版

route: group_by: [‘alertname’] group_wait: 10s group_interval: 10s repeat_interval: 1h

receiver: ‘web.hook’

receiver: default routes:

receiver: ‘wechat’ continue: true

receivers:

- name: ‘web.hook’

name: ‘default’ email_configs:
- to: ‘cq_xxlaila@163.com’ html: ‘{{ template “test.html” . }}’ headers: { Subject: “[WARN] email”} send_resolved: true webhook_configs:
- url: ‘http://127.0.0.1:5001/‘
name: ‘wechat’ wechat_configs:
- send_resolved: true to_user: ‘@all’ # 接受人，都是all to_party: ‘4’ # 接收组的id agent_id: ‘1000002’ # 企业微信自定义应用的id corp_id: ‘wwa98457kdsnkdnsadmsdnas’ # 企业微信CorpId message: ‘{{ template “test_wechat.html” . }}’ # 发送消息的模版

inhibit_rules:

source_match: severity: ‘critical’ target_match: severity: ‘warning’ equal: [‘alertname’, ‘dev’, ‘instance’] ``` Alertmanager主要负责对Prometheus产生的告警进行统一处理，因此在Alertmanager配置中一般会包含以下几个主要部分：

全局配置（global）：用于定义一些全局的公共参数，如全局的SMTP配置，Slack配置等内容；
模板（templates）：用于定义告警通知时的模板，如HTML模板，邮件模板等；
告警路由（route）：根据标签匹配，确定当前告警应该如何处理；
接收人（receivers）：接收人是一个抽象的概念，它可以是一个邮箱也可以是微信，Slack或者Webhook等，接收人一般配合告警路由使用；
抑制规则（inhibit_rules）：合理设置抑制规则可以减少垃圾告警的产生
.tmpl模板的配置
```bash
创建.tmpl模版存放目录
mkdir /opt/alertmanager/template && cd /opt/alertmanager/template

企业微信

cat >test_wechat.tmpl <<EOF {{ define “test_wechat.html” }} {{ range $i, $alert := .Alerts.Firing }} [报警项]:{{ index $alert.Labels “alertname” }} [环境]: {{ index $alert.Labels “env” }} [实例]:{{ index $alert.Labels “instance” }} [级别]: {{ index $alert.Labels “severity” }} [报警阀值]: {{ index $alert.Annotations “summary” }} [报警描述]: {{ index $alert.Annotations “description” }} [开始时间]: {{ $alert.StartsAt }} {{ end }} {{ end }} EOF

邮件告警

cat >test.tmpl <<EOF {{ define “test.html” }}

报警项	环境	实例	级别	报警阀值	报警描述	开始时间
{{ index $alert.Labels “alertname” }}	{{ index $alert.Labels “env”}}	{{ index $alert.Labels “instance” }}	{{ index $alert.Labels “severity” }}	{{ index $alert.Annotations “summary” }}	{{ index $alert.Annotations “description” }}	{{ $alert.StartsAt }}

{{ end }} EOF

重启alertmanager

systemctl restart alertmanager ``` 企业微信截图

AlertManager

设置alertmanager开机启动

修改prometheus配置用于加载alertmanager和alertmanager rules

新建rules规则

Prometheus templates apply here in the annotation and label fields of the alert.

配置告警策略

receiver: ‘web.hook’

- name: ‘web.hook’

.tmpl模板的配置

创建.tmpl模版存放目录

企业微信

邮件告警

重启alertmanager