前面三篇文章,我们已经安装了prometheus,node_exporter,blackbox_exporter,alertmanager,接下来利用已经安装的组件和exporter完成对主机资源,URL,SSL证书三个方面的指标监控。
已经完成安装的四个组件结构为
tree -L 2 /usr/local/monitor/
/usr/local/monitor/
├── alertmanager-0.22.2.linux-amd64
│ ├── alertmanager
│ ├── alertmanager.yml
│ ├── amtool
│ ├── LICENSE
│ └── NOTICE
├── blackbox_exporter-0.19.0.linux-amd64
│ ├── blackbox_exporter
│ ├── blackbox.yml
│ ├── LICENSE
│ └── NOTICE
├── node_exporter-1.1.2.linux-amd64
│ ├── LICENSE
│ ├── node_exporter
│ ├── nohup.out
│ └── NOTICE
└── prometheus-2.27.1.linux-amd64
├── console_libraries
├── consoles
├── data
├── LICENSE
├── nohup.out
├── NOTICE
├── prometheus
├── prometheus.yml
└── promtool

接下来我们在prometheus-2.27.1.linux-amd64 文件中新建一个rules文件夹存放我们要监控的所有指标文件

一、创建监控规则文件

本部分内容为服务端操作。

ref:https://awesome-prometheus-alerts.grep.to/rules

#1、创建rules文件

  1. mkdir /usr/local/monitor/prometheus-2.27.1.linux-amd64/rules -p

#2、监控主机CPU,内存,硬盘等硬件资源信息

  1. $ cd /usr/local/monitor/prometheus-2.27.1.linux-amd64/rules
  2. $ vim host-alerts.yml
  3. groups:
  4. - name: host-alerts
  5. rules:
  6. - alert: LowMemory
  7. expr: ((node_memory_MemFree_bytes{instance=~"$instance"} / node_memory_MemTotal_bytes{instance=~"$instance"}) * 100) < 3
  8. for: 1m
  9. labels:
  10. severity: warning
  11. at_mobiles: "10086"
  12. annotations:
  13. text: '{{ $labels.instance }} has had low available memeory for than 1 minutes.'
  14. - alert: InstanceDown
  15. expr: up == 0
  16. for: 1m
  17. labels:
  18. severity: critical
  19. at_mobiles: "10086"
  20. annotations:
  21. summary: "Instance {{ $labels.instance }} down"
  22. description: "{{ $labels.instance }} has been down for more than 1 minutes."
  23. - alert: hostCpuUsageAlert
  24. expr: sum(avg without(cpu)(irate(node_cpu_seconds_total{ mode != 'idle' } [5m]))) by (instance) > 0.80
  25. for: 1m
  26. labels:
  27. severity: warning
  28. at_mobiles: "10086"
  29. annotations:
  30. summary: "Instance {{ $labels.instance }} CPU usage high"
  31. description: "{{ $labels.instance }} CPU usage above 80% (current value: {{ $value }})"
  32. - alert: hostMemUsageAlert
  33. expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.80
  34. for: 1m
  35. labels:
  36. severity: warning
  37. at_mobiles: "10086"
  38. annotations:
  39. summary: "Instance {{ $labels.instance }} MEM usage high"
  40. description: "{{ $labels.instance }} MEM usage above 80% (current value: {{ $value }})"
  41. - alert: hostFileSystemUsageAlert
  42. expr: 100 - (node_filesystem_free_bytes{fstype!~"rootfs|tmpfs|ext3"} / node_filesystem_size_bytes{fstype!~"rootfs|tmpfs|ext3"} * 100) > 80
  43. for: 1m
  44. labels:
  45. severity: warning
  46. at_mobiles: "10086"
  47. annotations:
  48. summary: "Instance {{ $labels.instance }} filesystem usage above 70% (current value: {{ $value }})"
  49. description: "{{ $labels.instance }} filesystem usage above 70% (current value: {{ $value }})"

#3、监控内网或者外网接口

  1. $ vim URL-alerts.yml
  2. groups:
  3. - name: url-alerts
  4. rules:
  5. - alert: EndpointDown
  6. expr: probe_success == 0
  7. for: 3s
  8. labels:
  9. severity: "critical"
  10. at_mobiles: "10086"
  11. annotations:
  12. text: '{{ $labels.instance }} has down for 1 second.'

#4、监控SSL证书过期时间

  1. $ vim ssl-alerts.yml
  2. groups:
  3. - name: ssl-alerts
  4. rules:
  5. - alert: SSLCertExpiringSoon
  6. expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 3
  7. for: 10m
  8. labels:
  9. severity: "critical"
  10. at_mobiles: "10086"
  11. annotations:
  12. # text: '{{ $labels.instance }} certificate will expired in 30 days.'
  13. text: '{{ $labels.instance }} 证书过期不足3天.'

二、修改prometheus主配置文件

编辑配置文件 /usr/local/monitor/prometheus-2.27.1.linux-amd64/prometheus.yml ,将上面第二步添加的rules 加载进主配置文件,并将需要监控的主机IP或域名接口加入配置中即可。完成配置文件如下:

  1. # my global config
  2. global:
  3. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  4. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  5. # scrape_timeout is set to the global default (10s).
  6. # Alertmanager configuration
  7. alerting:
  8. alertmanagers:
  9. - static_configs:
  10. - targets: ["localhost:9093"] #增加配置
  11. # - alertmanager:9093
  12. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  13. rule_files:
  14. - /usr/local/monitor/prometheus-2.27.1.linux-amd64/rules/*.yml
  15. #------------------------------------------- 主机硬件信息监控 -----------------------------------
  16. # A scrape configuration containing exactly one endpoint to scrape:
  17. # Here it's Prometheus itself.
  18. scrape_configs:
  19. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  20. - job_name: 'prometheus'
  21. static_configs:
  22. - targets: ['localhost:9090']
  23. labels:
  24. instance: prometheus
  25. - job_name: 'hz-p-inner'
  26. static_configs:
  27. - targets: ['172.17.3.194:9100']
  28. labels:
  29. instance: hz-p-inner
  30. #------------------------------------------- 证书 SSL 过期时间 ------------------------------
  31. - job_name: 'blackbox'
  32. metrics_path: /probe
  33. params:
  34. module: [http_2xx]
  35. static_configs:
  36. - targets:
  37. - https://p.coh.123.com
  38. - http://p.test.123.com/wechat
  39. relabel_configs:
  40. - source_labels: [__address__]
  41. target_label: __param_target
  42. - source_labels: [__param_target]
  43. target_label: instance
  44. - target_label: __address__
  45. replacement: 127.0.0.1:9115
  46. #------------------------------------------ 接口 200 监控 ------------------------------------
  47. - job_name: 'blackbox_http_2xx_post'
  48. metrics_path: /probe
  49. params:
  50. module: [http_post_2xx]
  51. static_configs:
  52. - targets:
  53. - https://www.123.com/appfi/receive
  54. - http://coa.123.com/mini/ver_code
  55. relabel_configs:
  56. - source_labels: [__address__]
  57. target_label: __param_target
  58. - source_labels: [__param_target]
  59. target_label: instance
  60. - target_label: __address__
  61. replacement: 127.0.0.1:9115