prometheus的黑盒监控

  1. 常规的各种exporter都是和需要监控的机器一起安装的,如果需要监控一些tcp端口和七层应用层的状态呢,这个时候就需要黑盒监控了,不需要安装在目标机器上即可从外部去监控。<br /> 9115是它的http端点的默认监听端口,blackbox.yml它的配置文件里以基础的httpdnstcpicmpprober定制配置出各种监测模块(module),在prometheus server的配置文件里声明用哪个模块去探测哪个targets,下面以docker-compose启动一组实例,docker的网络自带dns,所以里面全部用名字替代ip

docker-compose.yml

  1. version: '3.4'
  2. services:
  3. prometheus:
  4. image: prom/prometheus:v2.15.1
  5. hostname: prometheus
  6. volumes:
  7. - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
  8. - ./prometheus.yml:/etc/prometheus/prometheus.yml
  9. - ./alert.rules:/etc/prometheus/alert.rules
  10. - prometheus_data:/prometheus
  11. command:
  12. - '--web.enable-lifecycle'
  13. - '--config.file=/etc/prometheus/prometheus.yml'
  14. ports:
  15. - '9090:9090'
  16. networks:
  17. prometheus:
  18. aliases:
  19. - prometheus
  20. logging:
  21. driver: json-file
  22. options:
  23. max-file: '3'
  24. max-size: 100m
  25. node-exporter:
  26. image: prom/node-exporter:v0.18.1
  27. hostname: node-exporter
  28. volumes:
  29. - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
  30. - /proc:/host/proc:ro
  31. - /sys:/host/sys:ro
  32. - /:/host/rootfs:ro
  33. command:
  34. - '--path.procfs=/host/proc'
  35. - '--path.sysfs=/host/sys'
  36. ports:
  37. - '9100:9100'
  38. networks:
  39. prometheus:
  40. aliases:
  41. - exporter
  42. logging:
  43. driver: json-file
  44. options:
  45. max-file: '3'
  46. max-size: 100m
  47. black-exporter:
  48. image: prom/blackbox-exporter:v0.16.0
  49. hostname: black-exporter
  50. volumes:
  51. - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
  52. - ./blackbox.yml:/config/blackbox.yml
  53. command:
  54. - '--config.file=/config/blackbox.yml'
  55. ports:
  56. - '9115:9115'
  57. networks:
  58. prometheus:
  59. aliases:
  60. - black-exporter
  61. logging:
  62. driver: json-file
  63. options:
  64. max-file: '3'
  65. max-size: 100m
  66. grafana:
  67. image: grafana/grafana:6.5.2
  68. hostname: grafana
  69. volumes:
  70. - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro
  71. - grafana_data:/var/lib/grafana
  72. environment:
  73. - GF_SECURITY_ADMIN_PASSWORD=pass
  74. depends_on:
  75. - prometheus
  76. ports:
  77. - '3000:3000'
  78. networks:
  79. prometheus:
  80. aliases:
  81. - grafana
  82. logging:
  83. driver: json-file
  84. options:
  85. max-file: '3'
  86. max-size: 100m
  87. networks:
  88. prometheus:
  89. driver: bridge
  90. volumes:
  91. grafana_data: {}
  92. prometheus_data: {}

prometheus.yml

  1. global:
  2. scrape_interval: 5s
  3. external_labels:
  4. monitor: 'my-monitor'
  5. scrape_configs:
  6. - job_name: 'prometheus'
  7. static_configs:
  8. - targets: ['prometheus:9090']
  9. - job_name: 'balck_box'
  10. scrape_interval: 10s
  11. static_configs:
  12. - targets: ['black-exporter:9115']
  13. - job_name: 'balck_test'
  14. metrics_path: /probe
  15. params:
  16. module: [tcp_connect]
  17. static_configs:
  18. - targets:
  19. - 120.52.137.xxx:81
  20. - xxxxxx:123
  21. relabel_configs:
  22. - source_labels: [__address__]
  23. target_label: __param_target
  24. - source_labels: [__param_target]
  25. target_label: instance
  26. - target_label: __address__
  27. replacement: black-exporter:9115

balckbox.yml

  1. modules:
  2. http_2xx_example: # 模块名字,符合规则随便命名即可
  3. prober: http # 探针类型
  4. timeout: 5s
  5. http:
  6. valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  7. valid_status_codes: [] # Defaults to 2xx
  8. method: GET
  9. headers:
  10. Host: vhost.example.com
  11. Accept-Language: en-US
  12. no_follow_redirects: false
  13. fail_if_ssl: false
  14. fail_if_not_ssl: false
  15. fail_if_matches_regexp:
  16. - "Could not connect to database"
  17. fail_if_not_matches_regexp:
  18. - "Download the latest version here"
  19. tls_config:
  20. insecure_skip_verify: false
  21. preferred_ip_protocol: "ip4" # defaults to "ip6"
  22. ip_protocol_fallback: false # no fallback to "ip6"
  23. http_post_2xx:
  24. prober: http
  25. timeout: 5s
  26. http:
  27. method: POST
  28. headers:
  29. Content-Type: application/json
  30. body: '{}'
  31. http_basic_auth_example:
  32. prober: http
  33. timeout: 5s
  34. http:
  35. method: POST
  36. headers:
  37. Host: "login.example.com"
  38. basic_auth:
  39. username: "username"
  40. password: "mysecret"
  41. http_custom_ca_example:
  42. prober: http
  43. http:
  44. method: GET
  45. tls_config:
  46. ca_file: "/certs/my_cert.crt"
  47. tls_connect_tls:
  48. prober: tcp
  49. timeout: 5s
  50. tcp:
  51. tls: true
  52. tcp_connect:
  53. prober: tcp
  54. timeout: 5s
  55. imap_starttls:
  56. prober: tcp
  57. timeout: 5s
  58. tcp:
  59. query_response:
  60. - expect: "OK.*STARTTLS"
  61. - send: ". STARTTLS"
  62. - expect: "OK"
  63. - starttls: true
  64. - send: ". capability"
  65. - expect: "CAPABILITY IMAP4rev1"
  66. smtp_starttls:
  67. prober: tcp
  68. timeout: 5s
  69. tcp:
  70. query_response:
  71. - expect: "^220 ([^ ]+) ESMTP (.+)$"
  72. - send: "EHLO prober"
  73. - expect: "^250-STARTTLS"
  74. - send: "STARTTLS"
  75. - expect: "^220"
  76. - starttls: true
  77. - send: "EHLO prober"
  78. - expect: "^250-AUTH"
  79. - send: "QUIT"
  80. ssh_banner:
  81. prober: tcp
  82. tcp:
  83. query_response:
  84. - expect: "^SSH-"
  85. irc_banner_example:
  86. prober: tcp
  87. timeout: 5s
  88. tcp:
  89. query_response:
  90. - send: "NICK prober"
  91. - send: "USER prober prober prober :prober"
  92. - expect: "PING :([^ ]+)"
  93. send: "PONG ${1}"
  94. - expect: "^:[^ ]+ 001"
  95. icmp_example:
  96. prober: icmp
  97. timeout: 5s
  98. icmp:
  99. preferred_ip_protocol: "ip4"
  100. source_ip_address: "127.0.0.1"
  101. dns_udp_example:
  102. prober: dns
  103. timeout: 5s
  104. dns:
  105. query_name: "www.prometheus.io"
  106. query_type: "A"
  107. valid_rcodes:
  108. - NOERROR
  109. validate_answer_rrs:
  110. fail_if_matches_regexp:
  111. - ".*127.0.0.1"
  112. fail_if_not_matches_regexp:
  113. - "www.prometheus.io.\t300\tIN\tA\t127.0.0.1"
  114. validate_authority_rrs:
  115. fail_if_matches_regexp:
  116. - ".*127.0.0.1"
  117. validate_additional_rrs:
  118. fail_if_matches_regexp:
  119. - ".*127.0.0.1"
  120. dns_soa:
  121. prober: dns
  122. dns:
  123. query_name: "prometheus.io"
  124. query_type: "SOA"
  125. dns_tcp_example:
  126. prober: dns
  127. dns:
  128. transport_protocol: "tcp" # defaults to "udp"
  129. preferred_ip_protocol: "ip4" # defaults to "ip6"
  130. query_name: "www.prometheus.io"

上面的探针定义参考官方的demo,其中在prometheus的配置文件里探测那部分是最终版本,如果要简单的探测可以先下面这样写

  1. - job_name: 'balck_test'
  2. metrics_path: /probe
  3. params:
  4. module: [tcp_connect]
  5. target:
  6. - 120.52.137.xxx:81
  7. - xxxx:44
  8. static_configs:
  9. - targets: ['black-exporter:9115']
  1. params声明的参数将会是发送到黑盒的http接口当作参数,向black-exporter:9115 web路由/probe发送参数包含module和探测的target.<br /> 所以我们可以用curl模拟http(prometheus拉取metrics也是发同样的http请求)请求能看到metrics信息输出,下面是一个curl获取黑盒监控使用ping模块去检测192.168.1返回的metrics的例子
  1. $ curl "http://127.0.0.1:9115/probe?module=ping&target=192.168.1.2"
  2. # HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
  3. # TYPE probe_dns_lookup_time_seconds gauge
  4. probe_dns_lookup_time_seconds 2.6453e-05
  5. # HELP probe_duration_seconds Returns how long the probe took to complete in seconds
  6. # TYPE probe_duration_seconds gauge
  7. probe_duration_seconds 0.000351649
  8. # HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
  9. # TYPE probe_ip_protocol gauge
  10. probe_ip_protocol 4
  11. # HELP probe_success Displays whether or not the probe was a success
  12. # TYPE probe_success gauge
  13. probe_success 1
  1. 我提供的文件里涉及到relabel,向target发送请求,但是因为直接relabel替换最终会向黑盒探测的端口发送,这是常见的两种写法。但是如果不用relabel下,我们想给target加一些label呢,而params不支持添加labels,所以我们得利用prometheusrelabel实现,也就是我提供的文件里这部分
  1. - job_name: 'balck_test'
  2. metrics_path: /probe
  3. params:
  4. module: [tcp_connect]
  5. static_configs:
  6. - targets:
  7. - 120.52.137.xxx:81
  8. - xxxxxx:123
  9. relabel_configs:
  10. - source_labels: [__address__]
  11. target_label: __param_target
  12. - source_labels: [__param_target]
  13. target_label: instance
  14. - target_label: __address__
  15. replacement: black-exporter:9115
  • 第一步获取targets的实例address值写进__param_target__param_<name>形式的标签里的name和它的值会被添加到发送到黑盒的http的header的params当作键值,例如__param_module对应params里的module
  • 第2步,获取__param_target的值,并覆写到instance标签中
  • 第3步,覆写Target实例的__address__标签值为BlockBox Exporter实例的访问地址
  • 第4部,向black-exporter:9115 发送请求获取实例的metrics信息

    1. 另外我们这边直接监控suse发现内核hang死了四层还是可达的,ssh的话和telnet都会回应openssh的字样,所以`ssh_banner`模块检测是认定为存活的,决定监控应用层。询问同事故障的现象是他用sap的客户端登陆报错,然后我上去tcpdump抓包导入wireshark把他登陆的http请求头写成了模块,后面内核hang死完全及时告警

    ```yaml http_post_sap: prober: http timeout: 3s http: method: POST headers:

    1. POST: '/SAPControl HTTP/1.1'
    2. Accept: 'text/xml, text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2'
    3. Content-Type: 'text/xml; charset=utf-8'
    4. Cache-Control: 'no-cache'
    5. Pragma: 'no-cache'
    6. User-Agent: 'Java/1.8.0_172'
    7. Connection: 'keep-alive'
    8. Content-Length: '200'

    body: |

    1. <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:q0="urn:SAPControl"><SOAP-ENV:Header/><SOAP-ENV:Body><q0:GetInstanceProperties/></SOAP-ENV:Body></SOAP-ENV:Envelope>
  1. ```yaml
  2. - job_name: 'hana_up'
  3. scrape_interval: 4s
  4. metrics_path: /probe
  5. params:
  6. module: ['http_post_sap']
  7. static_configs:
  8. - targets:
  9. - "http://10.20.4.14:50013/SAPControl"
  10. - "http://10.20.4.4:50013/SAPControl"
  11. - "http://10.20.4.9:50013/SAPControl"
  12. relabel_configs:
  13. - source_labels: [__address__]
  14. target_label: __param_target
  15. - source_labels: [__param_target]
  16. target_label: instance
  17. - target_label: __address__
  18. replacement: black-exporter:9115

SSL证书过期时间监控
http的get请求就自带了证书过期时间的metrics值,主要是表达式

  1. modules:
  2. http_2xx:
  3. prober: http
  4. timeout: 10s
  5. http:
  6. preferred_ip_protocol: "ip4" ##如果http监测是使用ipv4 就要写上,目前国内使用ipv6很少。
  1. scrape_configs:
  2. - job_name: 'blackbox'
  3. metrics_path: /probe
  4. params:
  5. module: [http_2xx] # Look for a HTTP 200 response.
  6. static_configs:
  7. - targets:
  8. - example.com # Target to probe
  9. relabel_configs:
  10. - source_labels: [__address__]
  11. target_label: __param_target
  12. - source_labels: [__param_target]
  13. target_label: instance
  14. - target_label: __address__
  15. replacement: black-exporter:9115

告警规则

  1. groups:
  2. - name: ssl_expiry.rules
  3. rules:
  4. - alert: SSLCertExpiringSoon
  5. expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30
  6. for: 20m