概述

在监控体系里面,通常我们认为监控分为:白盒监控和黑盒监控。

使用blackbox_exporter进行黑盒监控 - 图1

黑盒监控:主要关注的现象,一般都是正在发生的东西,例如出现一个告警,业务接口不正常,那么这种监控就是站在用户的角度能看到的监控,重点在于能对正在发生的故障进行告警。

白盒监控:主要关注的是原因,也就是系统内部暴露的一些指标,例如 redis 的 info 中显示 redis slave down,这个就是 redis info 显示的一个内部的指标,重点在于原因,可能是在黑盒监控中看到 redis down,而查看内部信息的时候,显示 redis port is refused connection。

Blackbox Exporter

Blackbox Exporter 是 Prometheus 社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP 以及 ICMP 的方式对网络进行探测。

1、HTTP 测试

  • 定义 Request Header 信息
  • 判断 Http status / Http Respones Header / Http Body 内容

2、TCP 测试

  • 业务组件端口状态监听
  • 应用层协议定义与监听

3、ICMP 测试

  • 主机探活机制

4、POST 测试

  • 接口联通性

5、SSL 证书过期时间

安装Blackbox Exporter

(1)创建YAML配置文件(blackbox-deploymeny.yaml)

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: blackbox
  5. namespace: monitoring
  6. labels:
  7. app: blackbox
  8. spec:
  9. selector:
  10. app: blackbox
  11. ports:
  12. - port: 9115
  13. targetPort: 9115
  14. ---
  15. apiVersion: v1
  16. kind: ConfigMap
  17. metadata:
  18. name: blackbox-config
  19. namespace: monitoring
  20. data:
  21. blackbox.yaml: |-
  22. modules:
  23. http_2xx:
  24. prober: http
  25. timeout: 10s
  26. http:
  27. valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  28. valid_status_codes: [200]
  29. method: GET
  30. preferred_ip_protocol: "ip4"
  31. http_post_2xx:
  32. prober: http
  33. timeout: 10s
  34. http:
  35. valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  36. valid_status_codes: [200]
  37. method: POST
  38. preferred_ip_protocol: "ip4"
  39. tcp_connect:
  40. prober: tcp
  41. timeout: 10s
  42. ping:
  43. prober: icmp
  44. timeout: 5s
  45. icmp:
  46. preferred_ip_protocol: "ip4"
  47. dns:
  48. prober: dns
  49. dns:
  50. transport_protocol: "tcp"
  51. preferred_ip_protocol: "ip4"
  52. query_name: "kubernetes.defalut.svc.cluster.local"
  53. ---
  54. apiVersion: apps/v1
  55. kind: Deployment
  56. metadata:
  57. name: blackbox
  58. namespace: monitoring
  59. spec:
  60. selector:
  61. matchLabels:
  62. app: blackbox
  63. template:
  64. metadata:
  65. labels:
  66. app: blackbox
  67. spec:
  68. containers:
  69. - name: blackbox
  70. image: prom/blackbox-exporter:v0.18.0
  71. args:
  72. - "--config.file=/etc/blackbox_exporter/blackbox.yaml"
  73. - "--log.level=error"
  74. ports:
  75. - containerPort: 9115
  76. volumeMounts:
  77. - name: config
  78. mountPath: /etc/blackbox_exporter
  79. volumes:
  80. - name: config
  81. configMap:
  82. name: blackbox-config

(2)创建即可

  1. kubectl apply -f blackbox-deploymeny.yaml

配置监控

由于集群是用的Prometheus Operator方式部署的,所以就以additional的形式添加配置。

(1)创建prometheus-additional.yaml文件,定义内容如下:

  1. - job_name: "ingress-endpoint-status"
  2. metrics_path: /probe
  3. params:
  4. module: [http_2xx] # Look for a HTTP 200 response.
  5. static_configs:
  6. - targets:
  7. - http://172.17.100.134/healthz
  8. labels:
  9. group: nginx-ingress
  10. relabel_configs:
  11. - source_labels: [__address__]
  12. target_label: __param_target
  13. - source_labels: [__param_target]
  14. target_label: instance
  15. - target_label: __address__
  16. replacement: blackbox.monitoring:9115
  17. - job_name: "kubernetes-service-dns"
  18. metrics_path: /probe
  19. params:
  20. module: [dns]
  21. static_configs:
  22. - targets:
  23. - kube-dns.kube-system:53
  24. relabel_configs:
  25. - source_labels: [__address__]
  26. target_label: __param_target
  27. - source_labels: [__param_target]
  28. target_label: instance
  29. - target_label: __address__
  30. replacement: blackbox.monitoring:9115

(2)创建secret

  1. kubectl -n monitoring create secret generic additional-config --from-file=prometheus-additional.yaml

(3)修改prometheus的配置,文件prometheus-prometheus.yaml
添加以下三行内容:

  1. additionalScrapeConfigs:
  2. name: additional-config
  3. key: prometheus-additional.yaml

完整配置如下:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: Prometheus
  3. metadata:
  4. labels:
  5. prometheus: k8s
  6. name: k8s
  7. namespace: monitoring
  8. spec:
  9. alerting:
  10. alertmanagers:
  11. - name: alertmanager-main
  12. namespace: monitoring
  13. port: web
  14. baseImage: quay.io/prometheus/prometheus
  15. nodeSelector:
  16. kubernetes.io/os: linux
  17. podMonitorNamespaceSelector: {}
  18. podMonitorSelector: {}
  19. replicas: 2
  20. resources:
  21. requests:
  22. memory: 400Mi
  23. ruleSelector:
  24. matchLabels:
  25. prometheus: k8s
  26. role: alert-rules
  27. securityContext:
  28. fsGroup: 2000
  29. runAsNonRoot: true
  30. runAsUser: 1000
  31. additionalScrapeConfigs:
  32. name: additional-config
  33. key: prometheus-additional.yaml
  34. serviceAccountName: prometheus-k8s
  35. serviceMonitorNamespaceSelector: {}
  36. serviceMonitorSelector: {}
  37. version: v2.11.0
  38. storage:
  39. volumeClaimTemplate:
  40. spec:
  41. storageClassName: managed-nfs-storage
  42. resources:
  43. requests:
  44. storage: 10Gi

(4)重新apply配置

  1. kubectl apply -f prometheus-prometheus.yaml

(5)reload prometheus
先找到svc的IP

  1. # kubectl get svc -n monitoring -l prometheus=k8s
  2. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  3. prometheus-k8s ClusterIP 10.99.93.157 <none> 9090/TCP 33m

使用以下命令reload

  1. curl -X POST "http://10.99.93.157:9090/-/reload"

后面修改配置文件,使用以下三条命令即可

  1. kubectl delete secret additional-config -n monitoring
  2. kubectl -n monitoring create secret generic additional-config --from-file=prometheus-additional.yaml
  3. curl -X POST "http://10.99.93.157:9090/-/reload"

等待一段时间,即可在prometheus的web界面看到如下target
image.png

ICMP监控

ICMP主要是通过ping命令来检测目的主机的连通性。
配置如下:

  1. - job_name: "node-icmp-status"
  2. metrics_path: /probe
  3. params:
  4. module: [ping] # Look for a HTTP 200 response.
  5. static_configs:
  6. - targets:
  7. - 172.17.100.134
  8. - 172.17.100.50
  9. - 172.17.100.135
  10. - 172.17.100.136
  11. - 172.17.100.137
  12. - 172.17.100.138
  13. labels:
  14. group: k8s-node-ping
  15. relabel_configs:
  16. - source_labels: [__address__]
  17. target_label: __param_target
  18. - source_labels: [__param_target]
  19. target_label: instance
  20. - target_label: __address__
  21. replacement: blackbox.monitoring:9115

然后重载配置文件

  1. kubectl delete secret additional-config -n monitoring
  2. kubectl -n monitoring create secret generic additional-config --from-file=prometheus-additional.yaml
  3. curl -X POST "http://10.99.93.157:9090/-/reload"

接下来可以看到监控成功,如下:
image.png

HTTP监控

HTTP就是通过GET或者POST的方式来检测应用是否正常。
这里配置GET方式。

  1. - job_name: "check-web-status"
  2. metrics_path: /probe
  3. params:
  4. module: [http_2xx] # Look for a HTTP 200 response.
  5. static_configs:
  6. - targets:
  7. - https://www.coolops.cn
  8. - https://www.baidu.com
  9. labels:
  10. group: web-url
  11. relabel_configs:
  12. - source_labels: [__address__]
  13. target_label: __param_target
  14. - source_labels: [__param_target]
  15. target_label: instance
  16. - target_label: __address__
  17. replacement: blackbox.monitoring:9115

重载配置后可以看到监控如下:
image.png

TCP监控

TCP监控主要是通过类似于Telnet的方式进行检测,配置如下:

  1. - job_name: "check-middleware-tcp"
  2. metrics_path: /probe
  3. params:
  4. module: [tcp_connect] # Look for a HTTP 200 response.
  5. static_configs:
  6. - targets:
  7. - 172.17.100.135:80
  8. - 172.17.100.74:3306
  9. - 172.17.100.25:3306
  10. - 172.17.100.8:3306
  11. - 172.17.100.75:3306
  12. - 172.17.100.72:3306
  13. - 172.17.100.73:3306
  14. labels:
  15. group: middleware-tcp
  16. relabel_configs:
  17. - source_labels: [__address__]
  18. target_label: __param_target
  19. - source_labels: [__param_target]
  20. target_label: instance
  21. - target_label: __address__
  22. replacement: blackbox.monitoring:9115

重载配置文件后监控如下:
image.png

告警规则

1、业务正常性

  • icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
  • probe_success == 0 ##联通性异常
  • probe_success == 1 ##联通性正常
  • 告警也是判断这个指标是否等于0,如等于0 则触发异常报警

image.png

2、通过 http 模块我们可以获取证书的过期时间,可以根据过期时间添加相关告警

probe_ssl_earliest_cert_expiry :可以查询证书到期时间。
image.png
经过单位转换我们可以得到一下,按天来计算:(probe_ssl_earliest_cert_expiry - time())/86400
image.png
3、所以我们结合上面的配置可以定制如下告警规则

  1. groups:
  2. - name: blackbox_network_stats
  3. rules:
  4. - alert: blackbox_network_stats
  5. expr: probe_success == 0
  6. for: 1m
  7. labels:
  8. severity: critical
  9. annotations:
  10. summary: "接口/主机/端口 {{ $labels.instance }} 无法联通"
  11. description: "接口/主机/端口 {{ $labels.instance }} 无法联通"

ssl检测

  1. groups:
  2. - name: check_ssl_status
  3. rules:
  4. - alert: "ssl证书过期警告"
  5. expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30
  6. for: 1h
  7. labels:
  8. severity: warn
  9. annotations:
  10. description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'
  11. summary: "ssl证书过期警告"

Grafana面板

直接使用12559,导入即可。
image.png
导入后就是这个样子。
image.png