原理

Prometheus实践 - 图1

安装 (以linux为例)

下载地址:https://prometheus.io/download/

安装Prometheus (可放在自建文件夹,也可放在opt)

  1. ## 下载
  2. wget https://github.com/prometheus/prometheus/releases/download/v2.18.1/prometheus-2.18.1.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz prometheus-2.18.1.linux-amd64.tar.gz
  5. ## 进入解压的目录
  6. cd prometheus-2.7.2.linux-amd64:
  7. ## 运行
  8. ./prometheus --config.file=prometheus.yml

示例:IP+9090 直接在自己电脑中访问 如果访问不了,请检查防火墙配置 如果出现可视化界面说明成功

安装node_exporter (监控服务器CPU.硬盘,网络等状态)

  1. ## 下载 (rc 为测试版)
  2. wget https://github.com/prometheus/node_exporter/releases/download/v1.0.0-rc.1/node_exporter-1.0.0-rc.1.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz node_exporter-1.0.0-rc.1.linux-amd64.tar.gz
  5. ## 进入解压的目录
  6. cd node_exporter-1.0.0-rc.1.linux-amd64
  7. ## 运行
  8. ./node_exporter

安装alertmanager (报警处理 官方不支持阿里云短信)

  1. ## 下载
  2. wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz alertmanager-0.20.0.linux-amd64.tar.gz
  5. ## 进入解压的目录
  6. cd alertmanager-0.20.0.linux-amd64
  7. ## 运行
  8. ./alertmanager --log.level=debug

安装mysqld_exporter (mysql 收集器)

  1. ## 下载
  2. wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz mysqld_exporter-0.12.1.linux-amd64.tar.gz
  5. ## 进入解压的目录
  6. cd mysqld_exporter-0.12.1.linux-amd64
  7. ## 配置临时环境 用户名 密码 IP 端口
  8. export DATA_SOURCE_NAME='exporter:123456@(192.168.15.167:3306)/'
  9. ## 运行
  10. ./mysqld_exporter

安装redis_exporter (redis收集器)

  1. ## 下载
  2. wget https://github.com/oliver006/redis_exporter/releases/download/v1.6.1/redis_exporter-v1.6.1.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz redis_exporter-v1.6.1.linux-amd64.tar.gz
  5. ## 进入到解压目录
  6. cd redis_exporter-v1.6.1.linux-amd64
  7. ## 运行 后加redis IP
  8. ./redis_exporter -redis.addr=192.168.15.167 -redis.password=123456

安装mongodb_exporter (mongo收集器)

  1. ## 下载
  2. wget https://github.com/percona/mongodb_exporter/releases/download/v0.11.0/mongodb_exporter-0.11.0.linux-amd64.tar.gz
  3. ## 解压
  4. tar xvfz mongodb_exporter-0.11.0.linux-amd64.tar.gz
  5. ## 进入到解压目录
  6. cd mongodb_exporter-0.11.0.linux-amd64
  7. ## 单机 临时环境配置 IP 端口 用户名 密码
  8. export MONGODB_URI='mongodb://192.168.15.167:27017'
  9. export HTTP_AUTH='admin:123456'
  10. ## 集群
  11. export MONGODB_URI=mongodb://mongodb_exporter:s3cr3tpassw0rd@localhost:10011
  12. ca## 运行
  13. ./mongodb_exporter

配置

安装配置

  1. # Prometheus 全局配置
  2. global:
  3. scrape_interval: 60s # 设定抓取数据的周期,默认为1min
  4. evaluation_interval: 60s # 设定更新rules文件的周期,默认为1min
  5. # scrape_timeout is set to the global default (10s).
  6. # Alertmanager配置
  7. alerting:
  8. alertmanagers:
  9. - static_configs:
  10. - targets: ['localhost:9093'] #设定alertmanager和prometheus交互的接口,即alertmanager监听的ip地址和端口
  11. # rule配置,首次读取默认加载,之后根据evaluation_interval设定的周期加载
  12. rule_files:
  13. # - "first_rules.yml"
  14. # - "second_rules.yml"
  15. - "rules.yml"
  16. # scape配置
  17. scrape_configs:
  18. # job_name默认写入timeseries的labels中,可以用于查询使用
  19. - job_name: 'node'
  20. scrape_interval: 1s # 抓取周期,默认采用global配置
  21. static_configs: # 静态配置
  22. - targets: ['localhost:9100'] # prometheus所要抓取数据的地址,即instance实例项
  23. labels:
  24. group: 'nodes'
  25. - job_name: 'mysql'
  26. scrape_interval: 1s # 抓取周期,默认采用global配置
  27. static_configs: # 静态配置
  28. - targets: ['localhost:9104'] # prometheus所要抓取数据的地址,即instance实例项
  29. - job_name: 'redis'
  30. scrape_interval: 15s
  31. static_configs:
  32. - targets: ['localhost:9121']
  33. - job_name: 'mongodb'
  34. scrape_interval: 15s
  35. static_configs:
  36. - targets: ['192.168.10.69:9216']
  37. basic_auth: ## 需要配置mongdb 的用户名密码
  38. username: admin
  39. password: 123456

报警规则配置

  1. groups:
  2. - name: node ## node_exporter监控报警
  3. rules:
  4. - alert: server_status # 告警名称
  5. expr: up{group="nodes"} == 0 # 告警的判定条件,参考Prometheus高级查询来设定
  6. for: 15s # 满足告警条件持续时间多久后,才会发送告警
  7. annotations: # 解析项,详细解释告警信息
  8. summary: "机器 {{ $labels.instance }} 挂了"
  9. labels:
  10. serverity: warning
  11. - name: mysql ## mysql 监控报警
  12. rules:
  13. - alert: mysql_server_status
  14. expr: mysql_up{job="mysql"} == 0
  15. for: 10s
  16. annotations:
  17. summary: "Instance {{ $labels.instance }} MySQL is down"
  18. description: "MySQL database is down. This requires immediate action!"
  19. groups:
  20. - name: node
  21. rules:
  22. - alert: server_status
  23. expr: up{group="nodes"} == 0
  24. for: 15s
  25. labels:
  26. severity: warning
  27. annotations:
  28. summary: "机器 {{ $labels.instance }} 挂了"
  29. - name: mysql
  30. rules:
  31. - alert: mysql_server_status
  32. expr: mysql_up{job="mysql"} == 0
  33. for: 10s
  34. annotations:
  35. summary: "Instance {{ $labels.instance }} MySQL is down"
  36. description: "MySQL database is down. This requires immediate action!"
  37. - name: mysql_qps
  38. rules:
  39. - alert: mysql_high_QPS
  40. expr: rate(mysql_global_status_questions{job="mysql"}[5m]) > 500
  41. for: 10s
  42. annotations:
  43. summary: "{{ $labels.instance }}: Mysql_High_QPS detected"
  44. description: "{{ $labels.instance }}: Mysql操作速度超过每秒500次,(当前值:{{$value}})"
  45. - name: mysql_connections
  46. rules:
  47. - alert: MySQL_Number_of_Connections
  48. expr: mysql_global_status_max_used_connections{job="mysql"} > 300
  49. for: 10s
  50. annotations:
  51. summary: "{{ $labels.instance }}: Mysql_number_of_Connections"
  52. description: "{{ $labels.instance }}: 当前连接数大于300"
  53. - name: mysql_slow
  54. rules:
  55. - alert: MySQL_slow_queries
  56. expr: rate(mysql_global_status_slow_queries{job="mysql"}[5m]) > 3
  57. for: 10s
  58. annotations:
  59. summary: "{{ $labels.instance }}: Mysql_slow_queries"
  60. description: "{{ $labels.instance }}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"
  61. - name: mysql_innodb_buffer
  62. rules:
  63. - alert: MySQL_inodb_buffer
  64. expr: (1 - mysql_global_status_innodb_buffer_pool_reads{job="mysql"} / mysql_global_status_innodb_buffer_pool_read_requests{job="mysql"}) * 100 > 95
  65. for: 10s
  66. annotations:
  67. summary: "{{ $labels.instance }}: global_status_innodb_buffer_pool"
  68. description: "{{ $labels.instance }}: 当前innodb存储引擎缓冲池命中率大于95%"
  69. - name: mysql_behind
  70. rules:
  71. - alert: Mysql_behind_master
  72. expr: mysql_slave_status_seconds_behind_master{job="mysql"} > 60
  73. for: 10s
  74. annotations:
  75. summary: "{{ $labels.instance }}: Mysql_Behind_Master"
  76. description: "{{ $labels.instance }}: 主从延迟大于60s"
  77. - name: redis
  78. rules:
  79. - alert: Redis_CPU
  80. expr: redis_cpu_sys_seconds_total{job="redis"} + redis_cpu_user_seconds_total{job="redis"} > 80
  81. for: 10s
  82. annotations:
  83. summary: "{{ $labels.instance }}: Redis_CPU"
  84. description: "{{ $labels.instance }}: Redis CPU使用率大于80%"
  85. - name: redis_connectionUsage
  86. rules:
  87. - alert: Redis_ConnectionUsage
  88. expr: redis_connections_received_total{job="redis"}/100 > 80
  89. for: 10s
  90. annotations:
  91. summary: "{{ $labels.instance }}: Redis_ConnectionUsage"
  92. description: "{{ $labels.instance }}: 接树使用率大于80%"
  93. - name: redis_up
  94. rules:
  95. - alert: Redis_up
  96. expr: redis_up{job="redis"} == 0
  97. for: 10s
  98. annotations:
  99. summary: "Instance {{ $labels.instance }} MySQL is down"
  100. description: "{{ $labels.instance }}: redis 挂了"
  101. - name: mongodb
  102. rules:
  103. - alert: MongoDB_connections_number
  104. expr: mongodb_connections_metrics_created_total{job="mongodb"} > 200
  105. for: 10s
  106. annotations:
  107. summary: "{{ $labels.instance }} MongoDB_connections_number"
  108. description: "{{ $labels.instance }}: 当前连接数大于200"

alertmanager 配置

  1. global:
  2. resolve_timeout: 5m # 处理超时时间
  3. wechat_api_corp_id: 'wwf19fbf8843e4e994' # 企业信息(我的企业-->CorpId[在底部])
  4. wechat_api_secret: '4k_lpyXAne3i4jUQT3jX1y1r1G2KOokd7U5eyMwOLs0' # 企业微信(企业微信-->自定应用-->Secret)
  5. templates:
  6. - 'template/*.tmpl' # 定义模板
  7. # 定义路由信息
  8. route:
  9. group_by: ['alertname'] # 报警分组依据
  10. group_wait: 1s # 最初即第一次等待多久时间发送一组警报的通
  11. group_interval: 5m # 在发送新警报前的等待时间
  12. repeat_interval: 60m # 发送重复警报的周期 对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝
  13. receiver: 'wechat' # 发送警报的接收者的名称,以下receivers name的名称
  14. # 接受者
  15. receivers:
  16. - name: 'wechat' # 警报
  17. wechat_configs: # 企业微信报警配置
  18. - send_resolved: true
  19. to_party: '1' # 接收组的id
  20. agent_id: '1000003' # (企业微信-->自定应用-->AgentId)
  21. corp_id: 'wwf19fbf8843e4e994' # 企业信息(我的企业-->CorpId[在底部])
  22. api_secret: '4k_lpyXAne3i4jUQT3jX1y1r1G2KOokd7U5eyMwOLs0' # 企业微信(企业微信-->自定应用-->Secret)
  23. ## 这个叫做抑制项,通过匹配源告警来抑制目的告警。比如说当我们的主机挂了,可能引起主机上的服务,数据库,中间件等一些告警,假如说后续的这些告警相对来说没有意义,我们可以用抑制项这个功能,让PrometheUS只发出主机挂了的告警。
  24. inhibit_rules:
  25. - source_match: ## # 当此告警发生,其他的告警被抑制
  26. severity: 'critical'
  27. target_match: ## # 被抑制的对象
  28. severity: 'warning'
  29. ## 此处的集合的label,在源和目的里的值必须相等。如果该集合的内的值再源和目的里都没有,那么目的告警也会被抑制。
  30. equal: ['id', 'instance']

报警模板配置

  1. {{ define "wechat.default.message" }}
  2. {{ range .Alerts }}
  3. ========start=========
  4. 告警程序: prometheus_alert
  5. 告警级别: {{ .Labels.serverity }}
  6. 告警类型: {{ .Labels.alertname }}
  7. 故障主机: {{ .Labels.instance }}
  8. 告警主题: {{ .Annotations.summary }}
  9. 告警详情: {{ .Annotations.description }}
  10. 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
  11. =========end===========
  12. {{ end }}
  13. {{ end }}

mysqld_exporter 配置

  1. ## 从docker进入到mysql服务
  2. $ docker exec -it e009208d7844 /bin/bash
  3. ## 创建建用户信息
  4. $ GRANT REPLICATION CLIENT, PROCESS ON . to 'exporter'@'%' identified by '8Wua5uNbIY9E';
  5. $ GRANT SELECT ON performance_schema.* TO 'exporter'@'%';
  6. $ flush privileges;

mongodb_export 配置

  1. # 进入mongo
  2. $ mongo --port 10011
  3. ## 使用 admin
  4. $ use admin
  5. ## 登录 admin
  6. $ db.auth('root','123456')
  7. ## 创建集群账户(集群的时候需要这个)
  8. $ db.getSiblingDB("admin").createUser({
  9. user: "mongodb_exporter",
  10. pwd: "s3cr3tpassw0rd",
  11. roles: [
  12. { role: "clusterMonitor", db: "admin" },
  13. { role: "read", db: "local" }
  14. ]
  15. })

验证

验证node_exporter 只要node一掉线 就会报警(走rules里面的node监控报警)
企业微信截图_38f18970-272a-44ea-910a-0b1f2883ad09.png
验证mysql 只要mysql服务一挂 就会产生 报警 通知到企业微信
企业微信截图_758c270c-2cf1-42b3-9078-bca178997bd7.png

函数介绍

rate() rate(v range-vector) 函数可以直接计算区间向量 v 在时间窗口内平均增长速率