title: 部署Prometheus监控k8s集群 #标题tags: prometheus #标签
date: 2020-09-21
categories: 监控 # 分类

记录下k8s部署Prometheus、aletermanager、node-exports、grafana及访问cAdvisor的数据。最后监控k8s集群中的各个组件,如manager、scheduler、etcd。

我的k8s集群版本为 1.18.3。

部署node_export

  1. $ kubectl create ns prom # 创建名称空间
  2. # 编写yaml文件
  3. $ cat > node-export.yaml << EOF
  4. apiVersion: apps/v1
  5. kind: DaemonSet
  6. metadata:
  7. name: node-exporter
  8. namespace: prom
  9. labels:
  10. name: node-exporter
  11. spec:
  12. selector:
  13. matchLabels:
  14. name: node-exporter
  15. template:
  16. metadata:
  17. labels:
  18. name: node-exporter
  19. spec:
  20. hostPID: true
  21. hostIPC: true
  22. hostNetwork: true
  23. containers:
  24. - name: node-exporter
  25. image: prom/node-exporter:v1.0.1
  26. imagePullPolicy: IfNotPresent
  27. ports:
  28. - containerPort: 9100
  29. resources:
  30. requests:
  31. cpu: 0.15
  32. securityContext:
  33. privileged: true
  34. args:
  35. - --path.procfs
  36. - /host/proc
  37. - --path.sysfs
  38. - /host/sys
  39. - --collector.filesystem.ignored-mount-points
  40. - '"^/(sys|proc|dev|host|etc)($|/)"'
  41. volumeMounts:
  42. - name: dev
  43. mountPath: /host/dev
  44. - name: proc
  45. mountPath: /host/proc
  46. - name: sys
  47. mountPath: /host/sys
  48. - name: rootfs
  49. mountPath: /rootfs
  50. tolerations:
  51. - key: "node-role.kubernetes.io/master"
  52. operator: "Exists"
  53. effect: "NoSchedule"
  54. volumes:
  55. - name: proc
  56. hostPath:
  57. path: /proc
  58. - name: dev
  59. hostPath:
  60. path: /dev
  61. - name: sys
  62. hostPath:
  63. path: /sys
  64. - name: rootfs
  65. hostPath:
  66. path: /
  67. EOF
  68. $ kubectl apply -f node-export.yaml # 执行yaml文件
  69. $ kubectl get pods -n prom -o wide # 确认node_export运行在集群所有节点,包括master节点
  70. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  71. node-exporter-nrjrr 1/1 Running 0 30h 192.168.20.3 node01 <none> <none>
  72. node-exporter-tmlnz 1/1 Running 0 30h 192.168.20.4 node02 <none> <none>
  73. node-exporter-v89dk 1/1 Running 0 30h 192.168.20.2 master <none> <none>

浏览器访问(任意节点IP+9100端口):

部署Prometheus监控k8s集群 - 图1

能看到以下数据,则表示正常:

部署Prometheus监控k8s集群 - 图2

配置用户获取cAvisor数据

cAdvisor是Google开源的容器资源监控和性能分析工具,它是专门为容器而生,在K8s中,我们不需要单独去安装,cAdvisor作为kubelet内置的一部分程序可以直接使用,也就是我们可以直接使用cadvisor采集数据,可以采集到和容器运行相关的所有指标。

  1. $ kubectl create serviceaccount monitor -n prom # 创建用户
  2. # 进行role绑定
  3. $ kubectl create clusterrolebinding monitor-clusterrolebinding -n prom --clusterrole=cluster-admin --serviceaccount=prom:monitor
  4. $ kubectl get secret -n prom # 查看secret
  5. NAME TYPE DATA AGE
  6. default-token-smhrm kubernetes.io/service-account-token 3 30h
  7. monitor-token-lvzt9 kubernetes.io/service-account-token 3 19s
  8. $ kubectl describe secret monitor-token-lvzt9 -n prom # 查看token
  9. .......... # 将tocken信息复制下来
  10. token: eyJhbGciOiJSUzI1NiIsImtpZCI6ImplanRmbkswLXFmZ1J5QXlweXR3Y0RTengwR3c5S2x6R3pPVTJVR0llWGcifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJwcm9tIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im1vbml0b3ItdG9rZW4tbHZ6dDkiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibW9uaXRvciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjFlYjdiYjg1LTA5N2EtNGNiZS04NTg3LWJkMjRlZjEzYmI3MCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpwcm9tOm1vbml0b3IifQ.L4AtiCKbApzYQ_tyXLfr-dZcbYMnDCWZil0hihzsl9t1EVbhywp0XaVS3ju6UTOuJ00M_JJ2U5pwcuDagReLrGexHJZ23BkDNm3vU99ruovMxRLvOHWz3491T_Rdl7TNc9s47RxD0S-JFp9-orB0NbYjKIrzCLVG0MW0W9-okTzCiN4OvF4rPssk8kbF8-ux4NU22hgf4hu_pQ5EYnzGsuEmHYBBExiWHiqBKgzX02pa3qn00L-UGJTf4TDtN1zRSQ54GO-MoiyxYBLZGpy9kCycwrwQ_SoS6ongqzoZsCO6jDPnR5KBPe283wcAahqASHgvgZSbBj7gw1a65pRSDA
  11. # 将性能指标数据保存至文件(将token替换为你自己查看到的token)
  12. $ curl https://127.0.0.1:10250/metrics/cadvisor -k -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImplanRmbkswLXFmZ1J5QXlweXR3Y0RTengwR3c5S2x6R3pPVTJVR0llWGcifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJwcm9tIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im1vbml0b3ItdG9rZW4tbHZ6dDkiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibW9uaXRvciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjFlYjdiYjg1LTA5N2EtNGNiZS04NTg3LWJkMjRlZjEzYmI3MCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpwcm9tOm1vbml0b3IifQ.L4AtiCKbApzYQ_tyXLfr-dZcbYMnDCWZil0hihzsl9t1EVbhywp0XaVS3ju6UTOuJ00M_JJ2U5pwcuDagReLrGexHJZ23BkDNm3vU99ruovMxRLvOHWz3491T_Rdl7TNc9s47RxD0S-JFp9-orB0NbYjKIrzCLVG0MW0W9-okTzCiN4OvF4rPssk8kbF8-ux4NU22hgf4hu_pQ5EYnzGsuEmHYBBExiWHiqBKgzX02pa3qn00L-UGJTf4TDtN1zRSQ54GO-MoiyxYBLZGpy9kCycwrwQ_SoS6ongqzoZsCO6jDPnR5KBPe283wcAahqASHgvgZSbBj7gw1a65pRSDA" > metrics.txt

cadvisor性能指标数据如下:

部署Prometheus监控k8s集群 - 图3

cadvisor中获取到的典型监控指标

指标名称 类型 含义
container_cpu_load_average_10s gauge 过去10秒容器CPU的平均负载
container_cpu_usage_seconds_total counter 容器在每个CPU内核上的累积占用时间 (单位:秒)
container_cpu_system_seconds_total counter System CPU累积占用时间(单位:秒)
container_cpu_user_seconds_total counter User CPU累积占用时间(单位:秒)
container_fs_usage_bytes gauge 容器中文件系统的使用量(单位:字节)
container_fs_limit_bytes gauge 容器可以使用的文件系统总量(单位:字节)
container_fs_reads_bytes_total counter 容器累积读取数据的总量(单位:字节)
container_fs_writes_bytes_total counter 容器累积写入数据的总量(单位:字节)
container_memory_max_usage_bytes gauge 容器的最大内存使用量(单位:字节)
container_memory_usage_bytes gauge 容器当前的内存使用量(单位:字节
container_spec_memory_limit_bytes gauge 容器的内存使用量限制
machine_memory_bytes gauge 当前主机的内存总量
container_network_receive_bytes_total counter 容器网络累积接收数据总量(单位:字节)
container_network_transmit_bytes_total counter 容器网络累积传输数据总量(单位:字节)

监控指标的metrices类型

监控指标有四种metrics类型,分别是 Counter, Gauge, Summary,Histogram,各个metrices类型含义如下:

  • Counter计数器: 计数器统计的数据是递增的,不能使用计数器来统计可能减小的指标,计数器统计的指标是累计增加的,如http请求的总数,出现的错误总数,总的处理时间(如cpu累计使用时间),api请求总数,已完成的任务数等。
  • Gauge量规: 量规是一种度量标准,代表可以任意上下波动的单个数值,
    用于统计cpu使用率,内存使用率,磁盘使用率,温度等指标,还可统计上升和下降的计数,如并发请求数等。
  • Histogram直方图:统计在一定的时间范围内数据的分布情况,如请求的持续/延迟时间,请求的响应大小等,还提供度量指标的总和,数据以直方图显示。
    Histogram由_bucket{le=””},_bucket{le=”+Inf”}, _sum,_count 组成
    如:
    apiserver_request_latencies_sum
    apiserver_request_latencies_count
    apiserver_request_latencies_bucket
  • Summary摘要:和Histogram直方图类似,主要用于表示一段时间内数据采样结果(通常是请求持续时间或响应大小之类的东西),还可计算度量值的总和和度量值的分位数以及在一定时间范围内的分位数,由 {quantile=”<φ>”},_sum,_count 组成。

在上述4中metrices类型中,最常见的还是gauge和conter这两种。

部署Prometheus及altermanager

生成Prometheus配置文件

  1. $ cat prometheus-cfg.yaml # 将Prometheus配置文件写入至configmap
  2. # 下面的job用到了k8s的服务发现,故比较复杂,但后期维护方便
  3. ---
  4. kind: ConfigMap
  5. apiVersion: v1
  6. metadata:
  7. labels:
  8. app: prometheus
  9. name: prometheus-config
  10. namespace: prom
  11. data:
  12. prometheus.yml: |
  13. rule_files:
  14. - /etc/prometheus/rules.yml
  15. alerting:
  16. alertmanagers:
  17. - static_configs:
  18. - targets: ["localhost:9093"]
  19. global:
  20. scrape_interval: 15s
  21. scrape_timeout: 10s
  22. evaluation_interval: 1m
  23. scrape_configs:
  24. - job_name: kubernetes-pods
  25. kubernetes_sd_configs:
  26. - role: pod
  27. relabel_configs:
  28. - action: keep
  29. regex: true
  30. source_labels:
  31. - __meta_kubernetes_pod_annotation_prometheus_io_scrape
  32. - action: replace
  33. regex: (.+)
  34. source_labels:
  35. - __meta_kubernetes_pod_annotation_prometheus_io_path
  36. target_label: __metrics_path__
  37. - action: replace
  38. regex: ([^:]+)(?::\d+)?;(\d+)
  39. replacement: $1:$2
  40. source_labels:
  41. - __address__
  42. - __meta_kubernetes_pod_annotation_prometheus_io_port
  43. target_label: __address__
  44. - action: labelmap
  45. regex: __meta_kubernetes_pod_label_(.+)
  46. - action: replace
  47. source_labels:
  48. - __meta_kubernetes_namespace
  49. target_label: kubernetes_namespace
  50. - action: replace
  51. source_labels:
  52. - __meta_kubernetes_pod_name
  53. target_label: kubernetes_pod_name
  54. - job_name: 'kubernetes-node'
  55. kubernetes_sd_configs:
  56. - role: node
  57. relabel_configs:
  58. - source_labels: [__address__]
  59. regex: '(.*):10250'
  60. replacement: '${1}:9100'
  61. target_label: __address__
  62. action: replace
  63. - action: labelmap
  64. regex: __meta_kubernetes_node_label_(.+)
  65. - job_name: 'kubernetes-node-cadvisor'
  66. kubernetes_sd_configs:
  67. - role: node
  68. scheme: https
  69. tls_config:
  70. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  71. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  72. relabel_configs:
  73. - action: labelmap
  74. regex: __meta_kubernetes_node_label_(.+)
  75. - target_label: __address__
  76. replacement: kubernetes.default.svc:443
  77. - source_labels: [__meta_kubernetes_node_name]
  78. regex: (.+)
  79. target_label: __metrics_path__
  80. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  81. - job_name: 'kubernetes-apiserver'
  82. kubernetes_sd_configs:
  83. - role: endpoints
  84. scheme: https
  85. tls_config:
  86. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  87. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  88. relabel_configs:
  89. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  90. action: keep
  91. regex: default;kubernetes;https
  92. - job_name: 'kubernetes-service-endpoints'
  93. kubernetes_sd_configs:
  94. - role: endpoints
  95. relabel_configs:
  96. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  97. action: keep
  98. regex: true
  99. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  100. action: replace
  101. target_label: __scheme__
  102. regex: (https?)
  103. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  104. action: replace
  105. target_label: __metrics_path__
  106. regex: (.+)
  107. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  108. action: replace
  109. target_label: __address__
  110. regex: ([^:]+)(?::\d+)?;(\d+)
  111. replacement: $1:$2
  112. - action: labelmap
  113. regex: __meta_kubernetes_service_label_(.+)
  114. - source_labels: [__meta_kubernetes_namespace]
  115. action: replace
  116. target_label: kubernetes_namespace
  117. - source_labels: [__meta_kubernetes_service_name]
  118. action: replace
  119. target_label: kubernetes_name
  120. rules.yml: |
  121. groups:
  122. - name: example
  123. rules:
  124. - alert: kube-proxycpu使用率大于80%
  125. expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 80
  126. for: 2s
  127. labels:
  128. severity: warnning
  129. annotations:
  130. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  131. - alert: kube-proxycpu使用率大于90%
  132. expr: rate(process_cpu_seconds_total{job=~"kubernetes-kube-proxy"}[1m]) * 100 > 90
  133. for: 2s
  134. labels:
  135. severity: critical
  136. annotations:
  137. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  138. - alert: schedulercpu使用率大于80%
  139. expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 80
  140. for: 2s
  141. labels:
  142. severity: warnning
  143. annotations:
  144. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  145. - alert: schedulercpu使用率大于90%
  146. expr: rate(process_cpu_seconds_total{job=~"kubernetes-schedule"}[1m]) * 100 > 90
  147. for: 2s
  148. labels:
  149. severity: critical
  150. annotations:
  151. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  152. - alert: controller-managercpu使用率大于80%
  153. expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 80
  154. for: 2s
  155. labels:
  156. severity: warnning
  157. annotations:
  158. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  159. - alert: controller-managercpu使用率大于90%
  160. expr: rate(process_cpu_seconds_total{job=~"kubernetes-controller-manager"}[1m]) * 100 > 0
  161. for: 2s
  162. labels:
  163. severity: critical
  164. annotations:
  165. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  166. - alert: apiservercpu使用率大于80%
  167. expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 80
  168. for: 2s
  169. labels:
  170. severity: warnning
  171. annotations:
  172. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  173. - alert: apiservercpu使用率大于90%
  174. expr: rate(process_cpu_seconds_total{job=~"kubernetes-apiserver"}[1m]) * 100 > 90
  175. for: 2s
  176. labels:
  177. severity: critical
  178. annotations:
  179. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  180. - alert: etcdcpu使用率大于80%
  181. expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 80
  182. for: 2s
  183. labels:
  184. severity: warnning
  185. annotations:
  186. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过80%"
  187. - alert: etcdcpu使用率大于90%
  188. expr: rate(process_cpu_seconds_total{job=~"kubernetes-etcd"}[1m]) * 100 > 90
  189. for: 2s
  190. labels:
  191. severity: critical
  192. annotations:
  193. description: "{{$labels.instance}}的{{$labels.job}}组件的cpu使用率超过90%"
  194. - alert: kube-state-metricscpu使用率大于80%
  195. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 80
  196. for: 2s
  197. labels:
  198. severity: warnning
  199. annotations:
  200. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
  201. value: "{{ $value }}%"
  202. threshold: "80%"
  203. - alert: kube-state-metricscpu使用率大于90%
  204. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-state-metrics"}[1m]) * 100 > 0
  205. for: 2s
  206. labels:
  207. severity: critical
  208. annotations:
  209. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
  210. value: "{{ $value }}%"
  211. threshold: "90%"
  212. - alert: corednscpu使用率大于80%
  213. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 80
  214. for: 2s
  215. labels:
  216. severity: warnning
  217. annotations:
  218. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过80%"
  219. value: "{{ $value }}%"
  220. threshold: "80%"
  221. - alert: corednscpu使用率大于90%
  222. expr: rate(process_cpu_seconds_total{k8s_app=~"kube-dns"}[1m]) * 100 > 90
  223. for: 2s
  224. labels:
  225. severity: critical
  226. annotations:
  227. description: "{{$labels.instance}}的{{$labels.k8s_app}}组件的cpu使用率超过90%"
  228. value: "{{ $value }}%"
  229. threshold: "90%"
  230. - alert: kube-proxy打开句柄数>600
  231. expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 600
  232. for: 2s
  233. labels:
  234. severity: warnning
  235. annotations:
  236. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  237. value: "{{ $value }}"
  238. - alert: kube-proxy打开句柄数>1000
  239. expr: process_open_fds{job=~"kubernetes-kube-proxy"} > 1000
  240. for: 2s
  241. labels:
  242. severity: critical
  243. annotations:
  244. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  245. value: "{{ $value }}"
  246. - alert: kubernetes-schedule打开句柄数>600
  247. expr: process_open_fds{job=~"kubernetes-schedule"} > 600
  248. for: 2s
  249. labels:
  250. severity: warnning
  251. annotations:
  252. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  253. value: "{{ $value }}"
  254. - alert: kubernetes-schedule打开句柄数>1000
  255. expr: process_open_fds{job=~"kubernetes-schedule"} > 1000
  256. for: 2s
  257. labels:
  258. severity: critical
  259. annotations:
  260. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  261. value: "{{ $value }}"
  262. - alert: kubernetes-controller-manager打开句柄数>600
  263. expr: process_open_fds{job=~"kubernetes-controller-manager"} > 600
  264. for: 2s
  265. labels:
  266. severity: warnning
  267. annotations:
  268. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  269. value: "{{ $value }}"
  270. - alert: kubernetes-controller-manager打开句柄数>1000
  271. expr: process_open_fds{job=~"kubernetes-controller-manager"} > 1000
  272. for: 2s
  273. labels:
  274. severity: critical
  275. annotations:
  276. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  277. value: "{{ $value }}"
  278. - alert: kubernetes-apiserver打开句柄数>600
  279. expr: process_open_fds{job=~"kubernetes-apiserver"} > 600
  280. for: 2s
  281. labels:
  282. severity: warnning
  283. annotations:
  284. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  285. value: "{{ $value }}"
  286. - alert: kubernetes-apiserver打开句柄数>1000
  287. expr: process_open_fds{job=~"kubernetes-apiserver"} > 1000
  288. for: 2s
  289. labels:
  290. severity: critical
  291. annotations:
  292. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  293. value: "{{ $value }}"
  294. - alert: kubernetes-etcd打开句柄数>600
  295. expr: process_open_fds{job=~"kubernetes-etcd"} > 600
  296. for: 2s
  297. labels:
  298. severity: warnning
  299. annotations:
  300. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>600"
  301. value: "{{ $value }}"
  302. - alert: kubernetes-etcd打开句柄数>1000
  303. expr: process_open_fds{job=~"kubernetes-etcd"} > 1000
  304. for: 2s
  305. labels:
  306. severity: critical
  307. annotations:
  308. description: "{{$labels.instance}}的{{$labels.job}}打开句柄数>1000"
  309. value: "{{ $value }}"
  310. - alert: coredns
  311. expr: process_open_fds{k8s_app=~"kube-dns"} > 600
  312. for: 2s
  313. labels:
  314. severity: warnning
  315. annotations:
  316. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过600"
  317. value: "{{ $value }}"
  318. - alert: coredns
  319. expr: process_open_fds{k8s_app=~"kube-dns"} > 1000
  320. for: 2s
  321. labels:
  322. severity: critical
  323. annotations:
  324. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 打开句柄数超过1000"
  325. value: "{{ $value }}"
  326. - alert: kube-proxy
  327. expr: process_virtual_memory_bytes{job=~"kubernetes-kube-proxy"} > 2000000000
  328. for: 2s
  329. labels:
  330. severity: warnning
  331. annotations:
  332. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  333. value: "{{ $value }}"
  334. - alert: scheduler
  335. expr: process_virtual_memory_bytes{job=~"kubernetes-schedule"} > 2000000000
  336. for: 2s
  337. labels:
  338. severity: warnning
  339. annotations:
  340. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  341. value: "{{ $value }}"
  342. - alert: kubernetes-controller-manager
  343. expr: process_virtual_memory_bytes{job=~"kubernetes-controller-manager"} > 2000000000
  344. for: 2s
  345. labels:
  346. severity: warnning
  347. annotations:
  348. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  349. value: "{{ $value }}"
  350. - alert: kubernetes-apiserver
  351. expr: process_virtual_memory_bytes{job=~"kubernetes-apiserver"} > 2000000000
  352. for: 2s
  353. labels:
  354. severity: warnning
  355. annotations:
  356. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  357. value: "{{ $value }}"
  358. - alert: kubernetes-etcd
  359. expr: process_virtual_memory_bytes{job=~"kubernetes-etcd"} > 2000000000
  360. for: 2s
  361. labels:
  362. severity: warnning
  363. annotations:
  364. description: "组件{{$labels.job}}({{$labels.instance}}): 使用虚拟内存超过2G"
  365. value: "{{ $value }}"
  366. - alert: kube-dns
  367. expr: process_virtual_memory_bytes{k8s_app=~"kube-dns"} > 2000000000
  368. for: 2s
  369. labels:
  370. severity: warnning
  371. annotations:
  372. description: "插件{{$labels.k8s_app}}({{$labels.instance}}): 使用虚拟内存超过2G"
  373. value: "{{ $value }}"
  374. - alert: HttpRequestsAvg
  375. expr: sum(rate(rest_client_requests_total{job=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"}[1m])) > 1000
  376. for: 2s
  377. labels:
  378. severity: critical
  379. annotations:
  380. description: "组件{{$labels.job}}({{$labels.instance}}): TPS超过1000"
  381. value: "{{ $value }}"
  382. threshold: "1000"
  383. - alert: Pod_waiting
  384. expr: kube_pod_container_status_waiting_reason{namespace=~"kube-system|default"} == 1
  385. for: 2s
  386. labels:
  387. severity: critical
  388. annotations:
  389. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}启动异常等待中"
  390. value: "{{ $value }}"
  391. threshold: "1"
  392. - alert: Pod_terminated
  393. expr: kube_pod_container_status_terminated_reason{namespace=~"kube-system|default|prom"} == 1
  394. for: 2s
  395. labels:
  396. severity: critical
  397. annotations:
  398. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.pod}}下的{{$labels.container}}被删除"
  399. value: "{{ $value }}"
  400. threshold: "1"
  401. - alert: Etcd_leader
  402. expr: etcd_server_has_leader{job="kubernetes-etcd"} == 0
  403. for: 2s
  404. labels:
  405. severity: critical
  406. annotations:
  407. description: "组件{{$labels.job}}({{$labels.instance}}): 当前没有leader"
  408. value: "{{ $value }}"
  409. threshold: "0"
  410. - alert: Etcd_leader_changes
  411. expr: rate(etcd_server_leader_changes_seen_total{job="kubernetes-etcd"}[1m]) > 0
  412. for: 2s
  413. labels:
  414. severity: critical
  415. annotations:
  416. description: "组件{{$labels.job}}({{$labels.instance}}): 当前leader已发生改变"
  417. value: "{{ $value }}"
  418. threshold: "0"
  419. - alert: Etcd_failed
  420. expr: rate(etcd_server_proposals_failed_total{job="kubernetes-etcd"}[1m]) > 0
  421. for: 2s
  422. labels:
  423. severity: critical
  424. annotations:
  425. description: "组件{{$labels.job}}({{$labels.instance}}): 服务失败"
  426. value: "{{ $value }}"
  427. threshold: "0"
  428. - alert: Etcd_db_total_size
  429. expr: etcd_debugging_mvcc_db_total_size_in_bytes{job="kubernetes-etcd"} > 10000000000
  430. for: 2s
  431. labels:
  432. severity: critical
  433. annotations:
  434. description: "组件{{$labels.job}}({{$labels.instance}}):db空间超过10G"
  435. value: "{{ $value }}"
  436. threshold: "10G"
  437. - alert: Endpoint_ready
  438. expr: kube_endpoint_address_not_ready{namespace=~"kube-system|default"} == 1
  439. for: 2s
  440. labels:
  441. severity: critical
  442. annotations:
  443. description: "空间{{$labels.namespace}}({{$labels.instance}}): 发现{{$labels.endpoint}}不可用"
  444. value: "{{ $value }}"
  445. threshold: "1"
  446. - name: 物理节点状态-监控告警
  447. rules:
  448. - alert: 物理节点cpu使用率
  449. expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)*100 > 90
  450. for: 2s
  451. labels:
  452. severity: critical
  453. annotations:
  454. summary: "{{ $labels.instance }}cpu使用率过高"
  455. description: "{{ $labels.instance }}的cpu使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
  456. - alert: 物理节点内存使用率
  457. expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
  458. for: 2s
  459. labels:
  460. severity: critical
  461. annotations:
  462. summary: "{{ $labels.instance }}内存使用率过高"
  463. description: "{{ $labels.instance }}的内存使用率超过90%,当前使用率[{{ $value }}],需要排查处理"
  464. - alert: InstanceDown
  465. expr: up == 0
  466. for: 2s
  467. labels:
  468. severity: critical
  469. annotations:
  470. summary: "{{ $labels.instance }}: 服务器宕机"
  471. description: "{{ $labels.instance }}: 服务器延时超过2分钟"
  472. - alert: 物理节点磁盘的IO性能
  473. expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
  474. for: 2s
  475. labels:
  476. severity: critical
  477. annotations:
  478. summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
  479. description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
  480. - alert: 入网流量带宽
  481. expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
  482. for: 2s
  483. labels:
  484. severity: critical
  485. annotations:
  486. summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
  487. description: "{{$labels.mountpoint }}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
  488. - alert: 出网流量带宽
  489. expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
  490. for: 2s
  491. labels:
  492. severity: critical
  493. annotations:
  494. summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
  495. description: "{{$labels.mountpoint }}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
  496. - alert: TCP会话
  497. expr: node_netstat_Tcp_CurrEstab > 1000
  498. for: 2s
  499. labels:
  500. severity: critical
  501. annotations:
  502. summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
  503. description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
  504. - alert: 磁盘容量
  505. expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
  506. for: 2s
  507. labels:
  508. severity: critical
  509. annotations:
  510. summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
  511. description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

生成alertmanager配置文件

  1. $ cat >alertmanager-cm.yaml <<EOF
  2. kind: ConfigMap
  3. apiVersion: v1
  4. metadata:
  5. name: alertmanager
  6. namespace: prom
  7. data:
  8. alertmanager.yml: |-
  9. global:
  10. resolve_timeout: 1m
  11. smtp_smarthost: 'smtp.163.com:25'
  12. smtp_from: 'lv916551516@163.com'
  13. smtp_auth_username: 'lv916551516@163.com'
  14. smtp_auth_password: 'QKRIUAMMLHGGYEGB'
  15. smtp_require_tls: false
  16. route:
  17. group_by: [alertname]
  18. group_wait: 10s
  19. group_interval: 10s
  20. repeat_interval: 10m
  21. receiver: default-receiver
  22. receivers:
  23. - name: 'default-receiver'
  24. email_configs:
  25. - to: '916551516@qq.com' # 发送至哪个邮箱
  26. send_resolved: true # 当告警恢复后,发送一份告警恢复的邮件
  27. EOF

准备部署文件

  1. $ cat prometheus-deploy.yaml # 准备Prometheus部署的yaml文件
  2. # Prometheus的数据存储方式我这里为了方便,采用了hostpath,并且指定了主机节点名称
  3. # 如果是用来生产环境,此方式不太合适,建议采用nfs或ceph
  4. ---
  5. apiVersion: apps/v1
  6. kind: Deployment
  7. metadata:
  8. name: prometheus-server
  9. namespace: prom
  10. labels:
  11. app: prometheus
  12. spec:
  13. replicas: 1
  14. selector:
  15. matchLabels:
  16. app: prometheus
  17. component: server
  18. #matchExpressions:
  19. #- {key: app, operator: In, values: [prometheus]}
  20. #- {key: component, operator: In, values: [server]}
  21. template:
  22. metadata:
  23. labels:
  24. app: prometheus
  25. component: server
  26. annotations:
  27. prometheus.io/scrape: 'false'
  28. spec:
  29. nodeName: node02 # 将此处的node02改为你集群的任意node节点(不建议是master节点)
  30. serviceAccountName: monitor # 此处指定的是上面访问cAvisor时创建的用户
  31. containers:
  32. - name: prometheus
  33. image: prom/prometheus:v2.20.1
  34. imagePullPolicy: IfNotPresent
  35. command:
  36. - prometheus
  37. - --config.file=/etc/prometheus/prometheus.yml
  38. - --storage.tsdb.path=/prometheus
  39. - --storage.tsdb.retention=30d
  40. - --web.enable-lifecycle
  41. ports:
  42. - containerPort: 9090
  43. protocol: TCP
  44. volumeMounts:
  45. - mountPath: /etc/prometheus
  46. name: prometheus-config
  47. - mountPath: /prometheus/
  48. name: prometheus-storage-volume
  49. - name: localtime
  50. mountPath: /etc/localtime
  51. - name: alertmanager
  52. image: prom/alertmanager:v0.21.0
  53. imagePullPolicy: IfNotPresent
  54. args:
  55. - "--config.file=/etc/alertmanager/alertmanager.yml"
  56. - "--log.level=info"
  57. ports:
  58. - containerPort: 9093
  59. protocol: TCP
  60. name: alertmanager
  61. volumeMounts:
  62. - name: alertmanager-config
  63. mountPath: /etc/alertmanager
  64. - name: alertmanager-storage
  65. mountPath: /alertmanager
  66. - name: localtime
  67. mountPath: /etc/localtime
  68. volumes:
  69. - name: prometheus-config
  70. configMap:
  71. name: prometheus-config
  72. - name: prometheus-storage-volume
  73. hostPath:
  74. path: /data
  75. type: Directory
  76. - name: alertmanager-config
  77. configMap:
  78. name: alertmanager
  79. - name: alertmanager-storage
  80. hostPath:
  81. path: /data/alertmanager
  82. type: DirectoryOrCreate
  83. - name: localtime
  84. hostPath:
  85. path: /usr/share/zoneinfo/Asia/Shanghai
  86. ---
  87. apiVersion: v1
  88. kind: Service
  89. metadata:
  90. name: prometheus
  91. namespace: prom
  92. labels:
  93. app: prometheus
  94. spec:
  95. type: NodePort
  96. ports:
  97. - port: 9090
  98. targetPort: 9090
  99. nodePort: 30090
  100. protocol: TCP
  101. selector:
  102. app: prometheus
  103. component: server
  104. ---
  105. apiVersion: v1
  106. kind: Service
  107. metadata:
  108. labels:
  109. name: prometheus
  110. kubernetes.io/cluster-service: 'true'
  111. name: alertmanager
  112. namespace: prom
  113. spec:
  114. ports:
  115. - name: alertmanager
  116. nodePort: 30066
  117. port: 9093
  118. protocol: TCP
  119. targetPort: 9093
  120. selector:
  121. app: prometheus
  122. sessionAffinity: None
  123. type: NodePort
  124. # 在node02机器上创建pod将要挂载的目录并赋予权限
  125. $ mkdir /data
  126. $ chown 65534.65534 /data # prometheus默认以id为 65534的用户运行,需要调整这个目录的属主,否则容器无法启动
  127. # 执行yaml文件(回到master节点)
  128. $ kubectl apply -f prometheus-cfg.yaml
  129. kubectl apply -f alertmanager-cm.yaml
  130. kubectl apply -f prometheus-deploy.yaml

访问Prometheus

访问地址; k8s集群中任意节点IP+30090

部署Prometheus监控k8s集群 - 图4

根据上面提示点击后,看到如下,都为“UP”则表示部署完成:

部署Prometheus监控k8s集群 - 图5

访问altermanager

访问任意节点IP+30066端口,看到如下页面即可:

部署Prometheus监控k8s集群 - 图6

部署grafana

参考博文: k8s之安装配置grafana

监控kube-schedule组件

此处将写下动态和静态发现两种配置。自行选择即可。

基于endpoints的动态发现

因为在配置Prometheus时,已经配置了基于endpoint的服务发现,所以我们只需要给kube-scheduler创建一个service,那么Prometheus自然就可以将其监控到。

  1. $ cat > kube-scheduler-service.yaml << EOF
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. annotations: # 想要Prometheus监控到此service,下面两个注解是不可或缺的。
  6. prometheus.io/port: "32107"
  7. prometheus.io/scrape: "true"
  8. labels:
  9. component: kube-scheduler
  10. tier: control-plane
  11. name: kube-scheduler
  12. namespace: kube-system
  13. spec:
  14. clusterIP:
  15. ports:
  16. - name: metrics
  17. port: 10251
  18. protocol: TCP
  19. targetPort: 10251
  20. nodePort: 32107
  21. selector:
  22. component: kube-scheduler
  23. tier: control-plane
  24. type: NodePort
  25. EOF
  26. # 创建service
  27. $ kubectl apply -f kube-scheduler-service.yaml

经过上述配置后,稍等两三秒,刷新Prometheus的target页面,即可看到类似下面的页面,已经将kube-scheduler添加到了监控列表中。

部署Prometheus监控k8s集群 - 图7

基于静态配置监控kube-scheduler

如果做了动态发现kube-scheduler的步骤,请自行删除kube-cheduler的service,否则会获取到重复数据。

  1. $ cat prometheus-cfg.yaml # 配置文件中追加以下配置
  2. # 请将targets中的IP换成你自己的kube-scheduler所在节点的IP
  3. ............ # 省略部分配置
  4. - job_name: 'kubernetes-schedule'
  5. scrape_interval: 5s
  6. static_configs:
  7. - targets: ['192.168.20.2:10251']
  8. # 重新生成Prometheus配置文件
  9. $ kubectl apply -f prometheus-cfg.yaml
  10. # 重新创建容器
  11. $ kubectl delete -f prometheus.yaml
  12. $ kubectl apply -f prometheus.yaml

然后查看Prometheus的target列表,看到以下配置,就说明配置无误。

部署Prometheus监控k8s集群 - 图8

监控kube-controller-manager组件

同样写下动态和静态两种配置方式。

基于endpoints的动态发现

  1. # 编写service配置文件
  2. $ cat > kube-controller-manager.yaml << EOF
  3. apiVersion: v1
  4. kind: Service
  5. metadata:
  6. annotations: # 同样,以下两个注解是必不可少的
  7. prometheus.io/port: "32108"
  8. prometheus.io/scrape: "true"
  9. labels:
  10. component: kube-controller-manager
  11. tier: control-plane
  12. name: kube-controller
  13. namespace: kube-system
  14. spec:
  15. clusterIP:
  16. ports:
  17. - name: metrics
  18. port: 10252
  19. protocol: TCP
  20. targetPort: 10252
  21. nodePort: 32108
  22. selector:
  23. component: kube-controller-manager
  24. tier: control-plane
  25. type: NodePort
  26. EOF
  27. # 上述中的targetPort指定的端口,就是kube-controller-manager监听的端口
  28. # 你可以在master上执行以下命令,确认下
  29. $ ss -lnput | grep kube-controller
  30. tcp LISTEN 0 128 127.0.0.1:10257 *:* users:(("kube-controller",pid=7906,fd=6))
  31. tcp LISTEN 0 128 [::]:10252 [::]:* users:(("kube-controller",pid=7906,fd=5))
  32. # 创建service
  33. $ kubectl apply -f kube-controller-manager.yaml

稍等几秒钟,Prometheus看到如下,即表示监控到了kube-controller-manager。

部署Prometheus监控k8s集群 - 图9

基于静态配置监控kube-controller-manager

如果做了动态发现kube-controller-manager的步骤,请自行删除kube-controller-manager的service,否则会获取到重复数据。

  1. $ cat prometheus-cfg.yaml # 配置文件中追加以下配置
  2. # 请将targets中的IP换成你自己的kube-scheduler所在节点的IP
  3. ............ # 省略部分配置
  4. - job_name: 'kubernetes-controller-manager'
  5. scrape_interval: 5s
  6. static_configs:
  7. - targets: ['192.168.20.2:10252']
  8. # 重新生成Prometheus配置文件
  9. $ kubectl apply -f prometheus-cfg.yaml
  10. # 重新创建容器
  11. $ kubectl delete -f prometheus.yaml
  12. $ kubectl apply -f prometheus.yaml

查看target列表如下:

部署Prometheus监控k8s集群 - 图10

监控kube-proxy组件

kube-proxy默认监听的地址是127.0.0.1:10249,想要修改监听的端口,需进行以下操作:

  1. $ kubectl edit configmap kube-proxy -n kube-system
  2. # 修改如下
  3. metricsBindAddress: "0.0.0.0:10249"
  4. # 重启pod
  5. $ kubectl get pods -n kube-system | grep kube-proxy |awk '{print $1}' | xargs kubectl delete pods -n kube-system
  6. # 确定本机10249端口在监听
  7. $ ss -antulp |grep :10249
  8. tcp LISTEN 0 128 [::]:10249 [::]:* users:(("kube-proxy",pid=32899,fd=17))

配置Prometheus

  1. $ vim prometheus-cfg.yaml # 新增以下配置
  2. # targets列表中指定的是k8s集群中每个节点上proxy监听的地址
  3. - job_name: 'kubernetes-kube-proxy'
  4. scrape_interval: 5s
  5. static_configs:
  6. - targets: ['192.168.20.2:10249','192.168.20.3:10249','192.168.20.4:10249']
  7. $ kubectl apply -f prometheus-cfg.yaml

至此,自行重启Prometheus后,即可在target列表看到kube-pory已经被监控到了,如下:

部署Prometheus监控k8s集群 - 图11

监控etcd组件

  1. # 在k8s-master节点建secret,将需要的etcd证书保存到secret对象etcd-certs中:
  2. $ kubectl -n prom create secret generic etcd-certs \
  3. --from-file=/etc/kubernetes/pki/etcd/server.key \
  4. --from-file=/etc/kubernetes/pki/etcd/server.crt \
  5. --from-file=/etc/kubernetes/pki/etcd/ca.crt
  6. # 查看创建的secret
  7. $ kubectl get secret -n prom|grep etcd-certs
  8. etcd-certs Opaque 3 60s
  9. # 修改Prometheus-deploy.yaml文件
  10. # 主要是将创建的secret通过volume挂载到Prometheus容器中,如下:
  11. $ vim prometheus-deploy.yaml
  12. # 在volumeMounts字段下新增如下:
  13. - name: k8s-certs
  14. mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
  15. # 在volumes字段新增如下:
  16. - name: k8s-certs
  17. secret:
  18. secretName: etcd-certs
  19. # 修改prometheus-cfg.yaml文件,增加如下job:
  20. $ vim prometheus-cfg.yaml
  21. ........... # 省略部分内容
  22. - job_name: 'kubernetes-etcd'
  23. scheme: https
  24. tls_config:
  25. ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
  26. cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
  27. key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
  28. scrape_interval: 5s
  29. static_configs:
  30. - targets: ['192.168.20.2:2379']
  31. # 重启Prometheus
  32. $ kubectl delete -f prometheus-deploy.yaml
  33. $ kubectl apply -f prometheus-deploy.yaml

注意,可能会遇到如下错误:

  1. # 启动问题:
  2. level=info ts=2020-09-21T15:39:18.938Z caller=main.go:755 msg="Notifier manager stopped"
  3. level=error ts=2020-09-21T15:39:18.938Z caller=main.go:764 err="opening storage failed: lock DB directory: resource temporarily unavailable"
  4. # 解决:删除 lock 文件
  5. $ rm -f /opt/prometheus/data/lock

最后,访问target列表,看到etcd的job列表,即表示成功,如下:

部署Prometheus监控k8s集群 - 图12

部署kube-state-metrics

Kube-state-metrics通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,所以我们可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等,调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated状态?Pod重启了多少次?多少job在运行中等等。

参考: 官方文档

  1. # 创建deployment文件
  2. $ cat > kube-state-metrics-deploy.yaml <<EOF
  3. apiVersion: apps/v1
  4. kind: Deployment
  5. metadata:
  6. name: kube-state-metrics
  7. namespace: kube-system
  8. spec:
  9. replicas: 1
  10. selector:
  11. matchLabels:
  12. app: kube-state-metrics
  13. template:
  14. metadata:
  15. labels:
  16. app: kube-state-metrics
  17. spec:
  18. serviceAccountName: kube-state-metrics
  19. containers:
  20. - name: kube-state-metrics
  21. # image: gcr.io/google_containers/kube-state-metrics-amd64:v1.3.1
  22. image: quay.io/coreos/kube-state-metrics:v1.9.0
  23. ports:
  24. - containerPort: 8080
  25. EOF
  26. $ 创建授权账户
  27. $ cat > kube-state-metrics-rbac.yaml <<EOF
  28. ---
  29. apiVersion: v1
  30. kind: ServiceAccount
  31. metadata:
  32. name: kube-state-metrics
  33. namespace: kube-system
  34. ---
  35. apiVersion: rbac.authorization.k8s.io/v1
  36. kind: ClusterRole
  37. metadata:
  38. name: kube-state-metrics
  39. rules:
  40. - apiGroups: [""]
  41. resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  42. verbs: ["list", "watch"]
  43. - apiGroups: ["extensions"]
  44. resources: ["daemonsets", "deployments", "replicasets"]
  45. verbs: ["list", "watch"]
  46. - apiGroups: ["apps"]
  47. resources: ["statefulsets"]
  48. verbs: ["list", "watch"]
  49. - apiGroups: ["batch"]
  50. resources: ["cronjobs", "jobs"]
  51. verbs: ["list", "watch"]
  52. - apiGroups: ["autoscaling"]
  53. resources: ["horizontalpodautoscalers"]
  54. verbs: ["list", "watch"]
  55. ---
  56. apiVersion: rbac.authorization.k8s.io/v1
  57. kind: ClusterRoleBinding
  58. metadata:
  59. name: kube-state-metrics
  60. roleRef:
  61. apiGroup: rbac.authorization.k8s.io
  62. kind: ClusterRole
  63. name: kube-state-metrics
  64. subjects:
  65. - kind: ServiceAccount
  66. name: kube-state-metrics
  67. namespace: kube-system
  68. EOF
  69. # 创建service
  70. $ cat > kube-state-metrics-svc.yaml <<EOF
  71. apiVersion: v1
  72. kind: Service
  73. metadata:
  74. annotations:
  75. prometheus.io/scrape: 'true'
  76. name: kube-state-metrics
  77. namespace: kube-system
  78. labels:
  79. app: kube-state-metrics
  80. spec:
  81. ports:
  82. - name: kube-state-metrics
  83. port: 8080
  84. protocol: TCP
  85. selector:
  86. app: kube-state-metrics
  87. EOF
  88. # 创建资源对象
  89. $ kubectl apply -f kube-state-metrics-rbac.yaml
  90. $ kubectl apply -f kube-state-metrics-deploy.yaml
  91. $ kubectl apply -f kube-state-metrics-svc.yaml

在Prometheus的target列表中可以看到如下,则表明数据采集正常:

部署Prometheus监控k8s集群 - 图13

其中,以kube_开头的,都是此容器采集到的。如下:

部署Prometheus监控k8s集群 - 图14

基于pod的k8s服务发现

在上面部署Prometheus时,我们定义了如下字段:

部署Prometheus监控k8s集群 - 图15

上述配置是用于自动发现pod的,现在我们编辑Prometheus这个pods,使其可以自己发现自己,编辑如下:

  1. $ kubectl edit pod prometheus-server-fc59797f6-296ld -n prom
  2. apiVersion: v1
  3. kind: Pod
  4. metadata:
  5. annotations:
  6. cni.projectcalico.org/podIP: 10.100.140.88/32
  7. cni.projectcalico.org/podIPs: 10.100.140.88/32
  8. prometheus.io/scrape: "true" # 将此处的false改为true即可
  9. ........ # 省略部分内容

然后查看Prometheus的target列表,看到如下即表示成功发现pod:

部署Prometheus监控k8s集群 - 图16