k8s-centos8u2-集群-kubernetes集群监控与告警


使用Prometheus和Grafana监控kubernetes集群

1 prometheus概念

由于docker容器的特殊性,传统的zabbix无法对k8s集群内的docker状态进行监控,所以需要使用prometheus来进行监控。
prometheus官网:官网地址

1.1 Prometheus的特点

  • 多维度数据模型,使用时间序列数据库TSDB而不使用mysql。
  • 灵活的查询语言PromQL。
  • 不依赖分布式存储,单个服务器节点是自主的。
  • 主要基于HTTP的pull方式主动采集时序数据。
  • 也可通过pushgateway获取主动推送到网关的数据。
  • 通过服务发现或者静态配置来发现目标服务对象。
  • 支持多种多样的图表和界面展示,比如Grafana等。

1.2 基本原理

1.2.1 原理说明

  • Prometheus的基本原理是通过各种exporter提供的HTTP协议接口周期性抓取被监控组件的状态,任意组件只要提供对应的HTTP接口就可以接入监控。
  • 不需要任何SDK或者其他的集成过程,非常适合做虚拟化环境监控系统,比如VM、Docker、Kubernetes等。
  • 互联网公司常用的组件大部分都有exporter可以直接使用,如Nginx、MySQL、Linux系统信息等。

1.2.2 架构图

image.png

1.2.3 三大套件

  • Server 主要负责数据采集和存储,提供PromQL查询语言的支持。
  • Alertmanager 警告管理器,用来进行报警。
  • Push Gateway 支持临时性Job主动推送指标的中间网关。

1.2.4 架构服务过程

  1. Prometheus Daemon负责定时去目标上抓取metrics(指标)数据
    每个抓取目标需要暴露一个http服务的接口给它定时抓取。
    支持通过配置文件、文本文件、Zookeeper、DNS SRV Lookup等方式指定抓取目标。
  2. PushGateway用于Client主动推送metrics到PushGateway
    而Prometheus只是定时去Gateway上抓取数据。
    适合一次性、短生命周期的服务。
  3. Prometheus在TSDB数据库存储抓取的所有数据
    通过一定规则进行清理和整理数据,并把得到的结果存储到新的时间序列中。
  4. Prometheus通过PromQL和其他API可视化地展示收集的数据。
    支持Grafana、Promdash等方式的图表数据可视化。
    Prometheus还提供HTTP API的查询方式,自定义所需要的输出。
  5. Alertmanager是独立于Prometheus的一个报警组件
    支持Prometheus的查询语句,提供十分灵活的报警方式。

1.2.5 常用的exporter

prometheus不同于zabbix,没有agent,使用的是针对不同服务的exporter。正常情况下,监控k8s集群及node,pod,常用的exporter有四个:

  • kube-state-metrics
    收集k8s集群master&etcd等基本状态信息
  • node-exporter
    收集k8s集群node信息
  • cadvisor
    收集k8s集群docker容器内部使用资源信息
  • blackbox-exporter
    收集k8s集群docker容器服务是否存活

1.3 部署软件说明

镜像 官方地址 github地址 部署
quay.io/coreos/kube-state-metrics:v1.5.0 https://quay.io/repository/coreos/kube-state-metrics?tab=info https://github.com/kubernetes/kube-state-metrics Deployment
prom/node-exporter:v0.15.0 https://hub.docker.com/r/prom/node-exporter https://github.com/prometheus/node_exporter DaemonSet
google/cadvisor:v0.28.3 https://hub.docker.com/r/google/cadvisor https://github.com/google/cadvisor DaemonSet
prom/blackbox-exporter:v0.15.1 https://hub.docker.com/r/prom/blackbox-exporter https://github.com/prometheus/blackbox_exporter Deployment
prom/prometheus:v2.14.0 https://hub.docker.com/r/prom/prometheus https://github.com/prometheus/prometheus nodeName: vms21.cos.com
grafana/grafana:5.4.2 https://grafana.com/ https://hub.docker.com/r/grafana/grafana https://github.com/grafana/grafana nodeName: vms22.cos.com
docker.io/prom/alertmanager:v0.14.0 https://hub.docker.com/r/prom/alertmanager https://github.com/prometheus/alertmanager Deployment

在使用较新版本出现问题时,使用较低版本,参考以上镜像版本。

2 部署kube-state-metrics

运维主机vms200

准备kube-state-metrics镜像

官方quay.io地址:https://quay.io/repository/coreos/kube-state-metrics?tab=info github地址:https://github.com/kubernetes/kube-state-metrics

  1. [root@vms200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
  2. v1.5.0: Pulling from coreos/kube-state-metrics
  3. cd784148e348: Pull complete
  4. f622528a393e: Pull complete
  5. Digest: sha256:b7a3143bd1eb7130759c9259073b9f239d0eeda09f5210f1cd31f1a530599ea1
  6. Status: Downloaded newer image for quay.io/coreos/kube-state-metrics:v1.5.0
  7. quay.io/coreos/kube-state-metrics:v1.5.0
  8. [root@vms200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.9.7
  9. ...
  10. [root@vms200 ~]# docker images | grep kube-state-metrics
  11. quay.io/coreos/kube-state-metrics v1.9.7 6497f02dbdad 3 months ago 32.8MB
  12. quay.io/coreos/kube-state-metrics v1.5.0 91599517197a 20 months ago 31.8MB
  13. [root@vms200 ~]# docker tag 6497f02dbdad harbor.op.com/public/kube-state-metrics:v1.9.7
  14. [root@vms200 ~]# docker tag quay.io/coreos/kube-state-metrics:v1.5.0 harbor.op.com/public/kube-state-metrics:v1.5.0
  15. [root@vms200 ~]# docker push harbor.op.com/public/kube-state-metrics:v1.9.7
  16. The push refers to repository [harbor.op.com/public/kube-state-metrics]
  17. d1ce60962f06: Pushed
  18. 0d1435bd79e4: Mounted from public/metrics-server
  19. v1.9.7: digest: sha256:2f82f0da199c60a7699c43c63a295c44e673242de0b7ee1b17c2d5a23bec34cb size: 738
  20. [root@vms200 ~]# docker push harbor.op.com/public/kube-state-metrics:v1.5.0
  21. The push refers to repository [harbor.op.com/public/kube-state-metrics]
  22. 5b3c36501a0a: Pushed
  23. 7bff100f35cb: Pushed
  24. v1.5.0: digest: sha256:16e9a1d63e80c19859fc1e2727ab7819f89aeae5f8ab5c3380860c2f88fe0a58 size: 739

准备资源配置清单

yaml下载要使用raw格式的地址

v1.9.7版本

v1.9.7:https://github.com/kubernetes/kube-state-metrics/tree/v1.9.7/examples/standard

image.png

  1. [root@vms200 ~]# cd /data/k8s-yaml/
  2. [root@vms200 k8s-yaml]# mkdir kube-state-metrics
  3. [root@vms200 k8s-yaml]# cd kube-state-metrics
  4. [root@vms200 kube-state-metrics]# mkdir v1.9.7
  5. [root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/service-account.yaml
  6. [root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/service.yaml
  7. [root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/deployment.yaml
  8. [root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/cluster-role.yaml
  9. [root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/cluster-role-binding.yaml

v1.9.7资源配置清单文件:/data/k8s-yaml/kube-state-metrics目录

  • rbac-v1.9.7.yaml(合并service-account.yamlcluster-role.yamlcluster-role-binding.yaml
  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. labels:
  5. app.kubernetes.io/name: kube-state-metrics
  6. app.kubernetes.io/version: v1.9.7
  7. name: kube-state-metrics
  8. namespace: kube-system
  9. ---
  10. apiVersion: rbac.authorization.k8s.io/v1
  11. kind: ClusterRole
  12. metadata:
  13. labels:
  14. app.kubernetes.io/name: kube-state-metrics
  15. app.kubernetes.io/version: v1.9.7
  16. name: kube-state-metrics
  17. rules:
  18. - apiGroups:
  19. - ""
  20. resources:
  21. - configmaps
  22. - secrets
  23. - nodes
  24. - pods
  25. - services
  26. - resourcequotas
  27. - replicationcontrollers
  28. - limitranges
  29. - persistentvolumeclaims
  30. - persistentvolumes
  31. - namespaces
  32. - endpoints
  33. verbs:
  34. - list
  35. - watch
  36. - apiGroups:
  37. - extensions
  38. resources:
  39. - daemonsets
  40. - deployments
  41. - replicasets
  42. - ingresses
  43. verbs:
  44. - list
  45. - watch
  46. - apiGroups:
  47. - apps
  48. resources:
  49. - statefulsets
  50. - daemonsets
  51. - deployments
  52. - replicasets
  53. verbs:
  54. - list
  55. - watch
  56. - apiGroups:
  57. - batch
  58. resources:
  59. - cronjobs
  60. - jobs
  61. verbs:
  62. - list
  63. - watch
  64. - apiGroups:
  65. - autoscaling
  66. resources:
  67. - horizontalpodautoscalers
  68. verbs:
  69. - list
  70. - watch
  71. - apiGroups:
  72. - authentication.k8s.io
  73. resources:
  74. - tokenreviews
  75. verbs:
  76. - create
  77. - apiGroups:
  78. - authorization.k8s.io
  79. resources:
  80. - subjectaccessreviews
  81. verbs:
  82. - create
  83. - apiGroups:
  84. - policy
  85. resources:
  86. - poddisruptionbudgets
  87. verbs:
  88. - list
  89. - watch
  90. - apiGroups:
  91. - certificates.k8s.io
  92. resources:
  93. - certificatesigningrequests
  94. verbs:
  95. - list
  96. - watch
  97. - apiGroups:
  98. - storage.k8s.io
  99. resources:
  100. - storageclasses
  101. - volumeattachments
  102. verbs:
  103. - list
  104. - watch
  105. - apiGroups:
  106. - admissionregistration.k8s.io
  107. resources:
  108. - mutatingwebhookconfigurations
  109. - validatingwebhookconfigurations
  110. verbs:
  111. - list
  112. - watch
  113. - apiGroups:
  114. - networking.k8s.io
  115. resources:
  116. - networkpolicies
  117. verbs:
  118. - list
  119. - watch
  120. ---
  121. apiVersion: rbac.authorization.k8s.io/v1
  122. kind: ClusterRoleBinding
  123. metadata:
  124. labels:
  125. app.kubernetes.io/name: kube-state-metrics
  126. app.kubernetes.io/version: v1.9.7
  127. name: kube-state-metrics
  128. roleRef:
  129. apiGroup: rbac.authorization.k8s.io
  130. kind: ClusterRole
  131. name: kube-state-metrics
  132. subjects:
  133. - kind: ServiceAccount
  134. name: kube-state-metrics
  135. namespace: kube-system
  • deployment-v1.9.7.yaml(修改deployment.yaml
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. labels:
  5. grafanak8sapp: "true"
  6. app.kubernetes.io/name: kube-state-metrics
  7. app.kubernetes.io/version: v1.9.7
  8. name: kube-state-metrics
  9. namespace: kube-system
  10. spec:
  11. replicas: 1
  12. selector:
  13. matchLabels:
  14. grafanak8sapp: "true"
  15. app.kubernetes.io/name: kube-state-metrics
  16. template:
  17. metadata:
  18. labels:
  19. grafanak8sapp: "true"
  20. app.kubernetes.io/name: kube-state-metrics
  21. app.kubernetes.io/version: v1.9.7
  22. spec:
  23. containers:
  24. - image: harbor.op.com/public/kube-state-metrics:v1.9.7
  25. imagePullPolicy: IfNotPresent
  26. livenessProbe:
  27. httpGet:
  28. path: /healthz
  29. port: 8080
  30. initialDelaySeconds: 5
  31. timeoutSeconds: 5
  32. name: kube-state-metrics
  33. ports:
  34. - containerPort: 8080
  35. name: http-metrics
  36. - containerPort: 8081
  37. name: telemetry
  38. readinessProbe:
  39. httpGet:
  40. path: /
  41. port: 8081
  42. initialDelaySeconds: 5
  43. timeoutSeconds: 5
  44. imagePullSecrets:
  45. - name: harbor
  46. nodeSelector:
  47. kubernetes.io/os: linux
  48. serviceAccountName: kube-state-metrics

说明:修改image、增加imagePullPolicy、增加imagePullSecrets

v1.5.0版本

v1.5.0:https://github.com/kubernetes/kube-state-metrics/tree/release-1.5/kubernetes

image.png

  1. [root@vms200 kube-state-metrics]# mkdir v1.5.0
  2. [root@vms200 kube-state-metrics]# cd v1.5.0/
  3. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-cluster-role-binding.yaml
  4. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-cluster-role.yaml
  5. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-deployment.yaml
  6. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-role-binding.yaml
  7. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-role.yaml
  8. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-service-account.yaml
  9. [root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-service.yaml

v1.5.0资源配置清单文件:/data/k8s-yaml/kube-state-metrics目录

  • rbac.yaml
  1. [root@vms200 ~]# cd /data/k8s-yaml/
  2. [root@vms200 k8s-yaml]# mkdir kube-state-metrics
  3. [root@vms200 k8s-yaml]# cd kube-state-metrics
  4. [root@vms200 kube-state-metrics]# vi rbac.yaml
  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. labels:
  5. addonmanager.kubernetes.io/mode: Reconcile
  6. kubernetes.io/cluster-service: "true"
  7. name: kube-state-metrics
  8. namespace: kube-system
  9. ---
  10. apiVersion: rbac.authorization.k8s.io/v1
  11. kind: ClusterRole
  12. metadata:
  13. labels:
  14. addonmanager.kubernetes.io/mode: Reconcile
  15. kubernetes.io/cluster-service: "true"
  16. name: kube-state-metrics
  17. rules:
  18. - apiGroups:
  19. - ""
  20. resources:
  21. - configmaps
  22. - secrets
  23. - nodes
  24. - pods
  25. - services
  26. - resourcequotas
  27. - replicationcontrollers
  28. - limitranges
  29. - persistentvolumeclaims
  30. - persistentvolumes
  31. - namespaces
  32. - endpoints
  33. verbs:
  34. - list
  35. - watch
  36. - apiGroups:
  37. - extensions
  38. resources:
  39. - daemonsets
  40. - deployments
  41. - replicasets
  42. verbs:
  43. - list
  44. - watch
  45. - apiGroups:
  46. - apps
  47. resources:
  48. - statefulsets
  49. verbs:
  50. - list
  51. - watch
  52. - apiGroups:
  53. - batch
  54. resources:
  55. - cronjobs
  56. - jobs
  57. verbs:
  58. - list
  59. - watch
  60. - apiGroups:
  61. - autoscaling
  62. resources:
  63. - horizontalpodautoscalers
  64. verbs:
  65. - list
  66. - watch
  67. ---
  68. apiVersion: rbac.authorization.k8s.io/v1
  69. kind: ClusterRoleBinding
  70. metadata:
  71. labels:
  72. addonmanager.kubernetes.io/mode: Reconcile
  73. kubernetes.io/cluster-service: "true"
  74. name: kube-state-metrics
  75. roleRef:
  76. apiGroup: rbac.authorization.k8s.io
  77. kind: ClusterRole
  78. name: kube-state-metrics
  79. subjects:
  80. - kind: ServiceAccount
  81. name: kube-state-metrics
  82. namespace: kube-system
  • deployment.yaml
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. annotations:
  5. deployment.kubernetes.io/revision: "2"
  6. labels:
  7. grafanak8sapp: "true"
  8. app: kube-state-metrics
  9. name: kube-state-metrics
  10. namespace: kube-system
  11. spec:
  12. selector:
  13. matchLabels:
  14. grafanak8sapp: "true"
  15. app: kube-state-metrics
  16. strategy:
  17. rollingUpdate:
  18. maxSurge: 25%
  19. maxUnavailable: 25%
  20. type: RollingUpdate
  21. template:
  22. metadata:
  23. creationTimestamp: null
  24. labels:
  25. grafanak8sapp: "true"
  26. app: kube-state-metrics
  27. spec:
  28. containers:
  29. - image: harbor.op.com/public/kube-state-metrics:v1.5.0
  30. name: kube-state-metrics
  31. ports:
  32. - containerPort: 8080
  33. name: http-metrics
  34. protocol: TCP
  35. readinessProbe:
  36. failureThreshold: 3
  37. httpGet:
  38. path: /healthz
  39. port: 8080
  40. scheme: HTTP
  41. initialDelaySeconds: 5
  42. periodSeconds: 10
  43. successThreshold: 1
  44. timeoutSeconds: 5
  45. imagePullPolicy: IfNotPresent
  46. imagePullSecrets:
  47. - name: harbor
  48. restartPolicy: Always
  49. serviceAccount: kube-state-metrics
  50. serviceAccountName: kube-state-metrics

应用资源配置清单

任意一台运算节点上:

v1.5.0

  1. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/rbac.yaml
  2. serviceaccount/kube-state-metrics created
  3. clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
  4. clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
  5. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/deployment.yaml
  6. deployment.apps/kube-state-metrics created

v1.9.7

  1. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/rbac-v1.9.7.yaml
  2. serviceaccount/kube-state-metrics created
  3. clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
  4. clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
  5. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/deployment-v1.9.7.yaml
  6. deployment.apps/kube-state-metrics created

检查启动情况

v1.5.0

  1. [root@vms22 ~]# kubectl get pods -n kube-system -o wide |grep kube-state-metrics
  2. kube-state-metrics-5ff77848c6-9grj9 1/1 Running 0 101s 172.26.22.3 vms22.cos.com <none> <none>
  3. [root@vms22 ~]# curl http://172.26.22.3:8080/healthz
  4. ok

v1.9.7

  1. [root@vms21 ~]# kubectl get pods -n kube-system -o wide |grep kube-state-metrics
  2. kube-state-metrics-5776ff76f-4f6dk 1/1 Running 0 20s 172.26.22.3 vms22.cos.com <none> <none>
  3. [root@vms21 ~]# curl http://172.26.22.3:8080/healthz
  4. OK[root@vms21 ~]#
  5. [root@vms21 ~]# curl http://172.26.22.3:8081
  6. <html>
  7. <head><title>Kube-State-Metrics Metrics Server</title></head>
  8. <body>
  9. <h1>Kube-State-Metrics Metrics</h1>
  10. <ul>
  11. <li><a href='/metrics'>metrics</a></li>
  12. </ul>
  13. </body>
  14. </html>

3 部署node-exporter

运维主机vms200上:

准备node-exporter镜像

官方dockerhub地址:https://hub.docker.com/r/prom/node-exporter github地址:https://github.com/prometheus/node_exporter

  1. [root@vms200 ~]# docker pull prom/node-exporter:v1.0.1
  2. v1.0.1: Pulling from prom/node-exporter
  3. 86fa074c6765: Pull complete
  4. ed1cd1c6cd7a: Pull complete
  5. ff1bb132ce7b: Pull complete
  6. Digest: sha256:cf66a6bbd573fd819ea09c72e21b528e9252d58d01ae13564a29749de1e48e0f
  7. Status: Downloaded newer image for prom/node-exporter:v1.0.1
  8. docker.io/prom/node-exporter:v1.0.1
  9. [root@vms200 ~]# docker tag docker.io/prom/node-exporter:v1.0.1 harbor.op.com/public/node-exporter:v1.0.1
  10. [root@vms200 ~]# docker push harbor.op.com/public/node-exporter:v1.0.1

准备资源配置清单

  1. [root@vms200 ~]# mkdir /data/k8s-yaml/node-exporter && cd /data/k8s-yaml/node-exporter
  • 由于node-exporter是监控node的,需要每个节点启动一个,所以使用ds控制器
  • 主要用途就是将宿主机的/procsys目录挂载给容器,使容器能获取node节点宿主机信息

/data/k8s-yaml/node-exporter/node-exporter-ds.yaml

  1. kind: DaemonSet
  2. apiVersion: apps/v1
  3. metadata:
  4. name: node-exporter
  5. namespace: kube-system
  6. labels:
  7. daemon: "node-exporter"
  8. grafanak8sapp: "true"
  9. spec:
  10. selector:
  11. matchLabels:
  12. daemon: "node-exporter"
  13. grafanak8sapp: "true"
  14. template:
  15. metadata:
  16. name: node-exporter
  17. labels:
  18. daemon: "node-exporter"
  19. grafanak8sapp: "true"
  20. spec:
  21. containers:
  22. - name: node-exporter
  23. image: harbor.op.com/public/node-exporter:v1.0.1
  24. imagePullPolicy: IfNotPresent
  25. args:
  26. - --path.procfs=/host_proc
  27. - --path.sysfs=/host_sys
  28. ports:
  29. - name: node-exporter
  30. hostPort: 9100
  31. containerPort: 9100
  32. protocol: TCP
  33. volumeMounts:
  34. - name: sys
  35. readOnly: true
  36. mountPath: /host_sys
  37. - name: proc
  38. readOnly: true
  39. mountPath: /host_proc
  40. imagePullSecrets:
  41. - name: harbor
  42. restartPolicy: Always
  43. hostNetwork: true
  44. volumes:
  45. - name: proc
  46. hostPath:
  47. path: /proc
  48. type: ""
  49. - name: sys
  50. hostPath:
  51. path: /sys
  52. type: ""

应用资源配置清单

任意运算节点上:

  1. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/node-exporter/node-exporter-ds.yaml
  2. daemonset.apps/node-exporter created
  • 检查
  1. [root@vms21 ~]# netstat -luntp | grep 9100
  2. tcp6 0 0 :::9100 :::* LISTEN 3711/node_exporter
  3. [root@vms21 ~]# kubectl get pod -n kube-system -o wide|grep node-exporter
  4. node-exporter-vrpfn 1/1 Running 0 2m8s 192.168.26.21 vms21.cos.com <none> <none>
  5. node-exporter-xw9k6 1/1 Running 0 2m8s 192.168.26.22 vms22.cos.com <none> <none>
  1. [root@vms21 ~]# curl -s http://192.168.26.21:9100/metrics | more
  2. # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
  3. # TYPE go_gc_duration_seconds summary
  4. go_gc_duration_seconds{quantile="0"} 0
  5. go_gc_duration_seconds{quantile="0.25"} 0
  6. go_gc_duration_seconds{quantile="0.5"} 0
  7. go_gc_duration_seconds{quantile="0.75"} 0
  8. go_gc_duration_seconds{quantile="1"} 0
  9. go_gc_duration_seconds_sum 0
  10. go_gc_duration_seconds_count 0
  11. ...
  12. [root@vms21 ~]# curl -s http://192.168.26.22:9100/metrics | more
  13. # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
  14. # TYPE go_gc_duration_seconds summary
  15. go_gc_duration_seconds{quantile="0"} 0
  16. go_gc_duration_seconds{quantile="0.25"} 0
  17. go_gc_duration_seconds{quantile="0.5"} 0
  18. go_gc_duration_seconds{quantile="0.75"} 0
  19. go_gc_duration_seconds{quantile="1"} 0
  20. go_gc_duration_seconds_sum 0
  21. go_gc_duration_seconds_count 0
  22. # HELP go_goroutines Number of goroutines that currently exist.
  23. ...
  • dashboard查看

image.png

4 部署cadvisor

运维主机vms200上:

准备cadvisor镜像

官方dockerhub地址:https://hub.docker.com/r/google/cadvisor
官方github地址:https://github.com/google/cadvisor

  1. [root@vms200 ~]# docker pull google/cadvisor:v0.33.0
  2. v0.33.0: Pulling from google/cadvisor
  3. 169185f82c45: Pull complete
  4. bd29476a29dd: Pull complete
  5. a2eb18ca776e: Pull complete
  6. Digest: sha256:47f1f8c02a3acfab77e74e2ec7acc0d475adc180ddff428503a4ce63f3d6061b
  7. Status: Downloaded newer image for google/cadvisor:v0.33.0
  8. docker.io/google/cadvisor:v0.33.0
  9. [root@vms200 ~]# docker pull google/cadvisor
  10. Using default tag: latest
  11. latest: Pulling from google/cadvisor
  12. ff3a5c916c92: Pull complete
  13. 44a45bb65cdf: Pull complete
  14. 0bbe1a2fe2a6: Pull complete
  15. Digest: sha256:815386ebbe9a3490f38785ab11bda34ec8dacf4634af77b8912832d4f85dca04
  16. Status: Downloaded newer image for google/cadvisor:latest
  17. docker.io/google/cadvisor:latest
  18. [root@vms200 ~]# docker images | grep cadvisor
  19. google/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
  20. google/cadvisor latest eb1210707573 22 months ago 69.6MB
  21. [root@vms200 ~]# docker tag 752d61707eac harbor.op.com/public/cadvisor:v0.33.0
  22. [root@vms200 ~]# docker tag eb1210707573 harbor.op.com/public/cadvisor:v200912
  23. [root@vms200 ~]# docker images | grep cadvisor
  24. google/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
  25. harbor.op.com/public/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
  26. google/cadvisor latest eb1210707573 22 months ago 69.6MB
  27. harbor.op.com/public/cadvisor v200912 eb1210707573 22 months ago 69.6MB
  28. [root@vms200 ~]# docker push harbor.op.com/public/cadvisor:v0.33.0
  29. The push refers to repository [harbor.op.com/public/cadvisor]
  30. 09c656718504: Pushed
  31. 6a395a55089d: Pushed
  32. 767f936afb51: Pushed
  33. v0.33.0: digest: sha256:47f1f8c02a3acfab77e74e2ec7acc0d475adc180ddff428503a4ce63f3d6061b size: 952
  34. [root@vms200 ~]# docker push harbor.op.com/public/cadvisor:v200912
  35. The push refers to repository [harbor.op.com/public/cadvisor]
  36. 66b3c2e84199: Pushed
  37. 9ea477e6d99e: Pushed
  38. cd7100a72410: Pushed
  39. v200912: digest: sha256:815386ebbe9a3490f38785ab11bda34ec8dacf4634af77b8912832d4f85dca04 size: 952

准备资源配置清单

  1. [root@vms200 ~]# mkdir /data/k8s-yaml/cadvisor && cd /data/k8s-yaml/cadvisor

该exporter是通过和kubelet交互,取到Pod运行时的资源消耗情况,并将接口暴露给Prometheus。

  • cadvisor由于要获取每个node上的pod信息,因此也需要使用daemonset方式运行
  • cadvisor采用daemonset方式运行在node节点上,通过污点的方式排除master
  • 同时将部分宿主机目录挂载到本地,如docker的数据目录

daemonset.yaml下载:https://github.com/google/cadvisor/tree/release-v0.33/deploy/kubernetes/base

image.png

  1. [root@vms200 cadvisor]# vi /data/k8s-yaml/cadvisor/daemonset.yaml
  1. apiVersion: apps/v1
  2. kind: DaemonSet
  3. metadata:
  4. name: cadvisor
  5. namespace: kube-system
  6. labels:
  7. app: cadvisor
  8. spec:
  9. selector:
  10. matchLabels:
  11. name: cadvisor
  12. template:
  13. metadata:
  14. labels:
  15. name: cadvisor
  16. spec:
  17. hostNetwork: true
  18. tolerations:
  19. - key: node-role.kubernetes.io/master
  20. effect: NoSchedule
  21. containers:
  22. - name: cadvisor
  23. image: harbor.op.com/public/cadvisor:v200912
  24. imagePullPolicy: IfNotPresent
  25. volumeMounts:
  26. - name: rootfs
  27. mountPath: /rootfs
  28. readOnly: true
  29. - name: var-run
  30. mountPath: /var/run
  31. - name: sys
  32. mountPath: /sys
  33. readOnly: true
  34. - name: docker
  35. mountPath: /var/lib/docker
  36. readOnly: true
  37. - name: disk
  38. mountPath: /dev/disk
  39. readOnly: true
  40. ports:
  41. - name: http
  42. containerPort: 4194
  43. protocol: TCP
  44. readinessProbe:
  45. tcpSocket:
  46. port: 4194
  47. initialDelaySeconds: 5
  48. periodSeconds: 10
  49. args:
  50. - --housekeeping_interval=10s
  51. - --port=4194
  52. terminationGracePeriodSeconds: 30
  53. volumes:
  54. - name: rootfs
  55. hostPath:
  56. path: /
  57. - name: var-run
  58. hostPath:
  59. path: /var/run
  60. - name: sys
  61. hostPath:
  62. path: /sys
  63. - name: docker
  64. hostPath:
  65. path: /data/docker
  66. - name: disk
  67. hostPath:
  68. path: /dev/disk

修改运算节点软连接

所有运算节点上:vms21、vms22

  1. [root@vms21 ~]# mount -o remount,rw /sys/fs/cgroup/
  2. [root@vms21 ~]# ln -s /sys/fs/cgroup/cpu,cpuacct/ /sys/fs/cgroup/cpuacct,cpu
  3. [root@vms21 ~]# ll /sys/fs/cgroup/ | grep cpu
  4. lrwxrwxrwx 1 root root 11 Sep 11 19:21 cpu -> cpu,cpuacct
  5. lrwxrwxrwx 1 root root 11 Sep 11 19:21 cpuacct -> cpu,cpuacct
  6. lrwxrwxrwx 1 root root 27 Sep 12 10:25 cpuacct,cpu -> /sys/fs/cgroup/cpu,cpuacct/
  7. dr-xr-xr-x 6 root root 0 Sep 11 19:21 cpu,cpuacct
  8. dr-xr-xr-x 4 root root 0 Sep 11 19:21 cpuset
  1. [root@vms22 ~]# mount -o remount,rw /sys/fs/cgroup/
  2. [root@vms22 ~]# ln -s /sys/fs/cgroup/cpu,cpuacct/ /sys/fs/cgroup/cpuacct,cpu
  3. [root@vms22 ~]# ll /sys/fs/cgroup/ | grep cpu
  4. lrwxrwxrwx 1 root root 11 Sep 11 19:22 cpu -> cpu,cpuacct
  5. lrwxrwxrwx 1 root root 11 Sep 11 19:22 cpuacct -> cpu,cpuacct
  6. lrwxrwxrwx 1 root root 27 Sep 12 10:25 cpuacct,cpu -> /sys/fs/cgroup/cpu,cpuacct/
  7. dr-xr-xr-x 6 root root 0 Sep 11 19:22 cpu,cpuacct
  8. dr-xr-xr-x 4 root root 0 Sep 11 19:22 cpuset
  • 原本是只读,现在改为可读可写;应用清单前,先在每个node上做好软连接,否则pod可能报错。

应用资源配置清单

任意运算节点上:

  1. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/cadvisor/daemonset.yaml
  2. daemonset.apps/cadvisor created
  3. [root@vms21 ~]# kubectl -n kube-system get pod -o wide|grep cadvisor
  4. cadvisor-q2z2g 0/1 Running 0 3s 192.168.26.22 vms22.cos.com <none> <none>
  5. cadvisor-xqg6k 0/1 Running 0 3s 192.168.26.21 vms21.cos.com <none> <none>
  6. [root@vms21 ~]# netstat -luntp|grep 4194
  7. tcp6 0 0 :::4194 :::* LISTEN 301579/cadvisor
  8. [root@vms21 ~]# kubectl get pod -n kube-system -l name=cadvisor -o wide
  9. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  10. cadvisor-q2z2g 1/1 Running 0 2m38s 192.168.26.22 vms22.cos.com <none> <none>
  11. cadvisor-xqg6k 1/1 Running 0 2m38s 192.168.26.21 vms21.cos.com <none> <none>
  12. [root@vms21 ~]# curl -s http://192.168.26.22:4194/metrics | more
  13. # HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor re
  14. vision.
  15. # TYPE cadvisor_version_info gauge
  16. cadvisor_version_info{cadvisorRevision="8949c822",cadvisorVersion="v0.32.0",dockerVersion="19.03.12",kernelVersion="4.18.0-193.el8.x86_64",osVersion=
  17. "Alpine Linux v3.7"} 1
  18. # HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals.
  19. # TYPE container_cpu_cfs_periods_total counter

image.png

5 部署blackbox-exporter

运维主机vms200上:

准备blackbox-exporter镜像

官方dockerhub地址:https://hub.docker.com/r/prom/blackbox-exporter
官方github地址:https://github.com/prometheus/blackbox_exporter

  1. [root@vms200 ~]# docker pull prom/blackbox-exporter:v0.17.0
  2. v0.17.0: Pulling from prom/blackbox-exporter
  3. 0f8c40e1270f: Pull complete
  4. 626a2a3fee8c: Pull complete
  5. d018b30262bb: Pull complete
  6. 2b24e2b7f642: Pull complete
  7. Digest: sha256:1d8a5c9ff17e2493a39e4aea706b4ea0c8302ae0dc2aa8b0e9188c5919c9bd9c
  8. Status: Downloaded newer image for prom/blackbox-exporter:v0.17.0
  9. docker.io/prom/blackbox-exporter:v0.17.0
  10. [root@vms200 ~]# docker tag docker.io/prom/blackbox-exporter:v0.17.0 harbor.op.com/public/blackbox-exporter:v0.17.0
  11. [root@vms200 ~]# docker push harbor.op.com/public/blackbox-exporter:v0.17.0
  12. The push refers to repository [harbor.op.com/public/blackbox-exporter]
  13. d072d0db0848: Pushed
  14. 42430a6dfa0e: Pushed
  15. 7a151fe67625: Pushed
  16. 1da8e4c8d307: Pushed
  17. v0.17.0: digest: sha256:d3e823580333ceedceadaa2bfea10c8efd4700c8ec0415df72f83c34e1f93314 size: 1155

准备资源配置清单

  1. [root@vms200 ~]# mkdir /data/k8s-yaml/blackbox-exporter && cd /data/k8s-yaml/blackbox-exporter
  • ConfigMap
  1. [root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/configmap.yaml
  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. labels:
  5. app: blackbox-exporter
  6. name: blackbox-exporter
  7. namespace: kube-system
  8. data:
  9. blackbox.yml: |-
  10. modules:
  11. http_2xx:
  12. prober: http
  13. timeout: 2s
  14. http:
  15. valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  16. valid_status_codes: [200,301,302]
  17. method: GET
  18. preferred_ip_protocol: "ip4"
  19. tcp_connect:
  20. prober: tcp
  21. timeout: 2s
  • Deployment
  1. [root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/deployment.yaml
  1. kind: Deployment
  2. apiVersion: apps/v1
  3. metadata:
  4. name: blackbox-exporter
  5. namespace: kube-system
  6. labels:
  7. app: blackbox-exporter
  8. annotations:
  9. deployment.kubernetes.io/revision: "1"
  10. spec:
  11. replicas: 1
  12. selector:
  13. matchLabels:
  14. app: blackbox-exporter
  15. template:
  16. metadata:
  17. labels:
  18. app: blackbox-exporter
  19. spec:
  20. volumes:
  21. - name: config
  22. configMap:
  23. name: blackbox-exporter
  24. defaultMode: 420
  25. containers:
  26. - name: blackbox-exporter
  27. image: harbor.op.com/public/blackbox-exporter:v0.17.0
  28. args:
  29. - --config.file=/etc/blackbox_exporter/blackbox.yml
  30. - --log.level=debug
  31. - --web.listen-address=:9115
  32. ports:
  33. - name: blackbox-port
  34. containerPort: 9115
  35. protocol: TCP
  36. resources:
  37. limits:
  38. cpu: 200m
  39. memory: 256Mi
  40. requests:
  41. cpu: 100m
  42. memory: 50Mi
  43. volumeMounts:
  44. - name: config
  45. mountPath: /etc/blackbox_exporter
  46. readinessProbe:
  47. tcpSocket:
  48. port: 9115
  49. initialDelaySeconds: 5
  50. timeoutSeconds: 5
  51. periodSeconds: 10
  52. successThreshold: 1
  53. failureThreshold: 3
  54. imagePullPolicy: IfNotPresent
  55. imagePullSecrets:
  56. - name: harbor
  57. restartPolicy: Always
  • Service
  1. [root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/service.yaml
  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: blackbox-exporter
  5. namespace: kube-system
  6. spec:
  7. selector:
  8. app: blackbox-exporter
  9. ports:
  10. - protocol: TCP
  11. port: 9115
  12. name: http
  • Ingress
  1. [root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/ingress.yaml
  1. apiVersion: extensions/v1beta1
  2. kind: Ingress
  3. metadata:
  4. name: blackbox-exporter
  5. namespace: kube-system
  6. spec:
  7. rules:
  8. - host: blackbox.op.com
  9. http:
  10. paths:
  11. - backend:
  12. serviceName: blackbox-exporter
  13. servicePort: 9115

解析域名

vms11

  1. [root@vms11 ~]# vi /var/named/op.com.zone
  1. ...
  2. blackbox A 192.168.26.10

注意serial前滚一个序号

  1. [root@vms11 ~]# systemctl restart named

检查:vms21

  1. [root@vms21 ~]# dig -t A blackbox.op.com 172.26.0.2 +short
  2. 192.168.26.10

应用资源配置清单

任意运算节点上:

  1. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/configmap.yaml
  2. configmap/blackbox-exporter created
  3. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/deployment.yaml
  4. deployment.apps/blackbox-exporter created
  5. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/service.yaml
  6. service/blackbox-exporter created
  7. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/ingress.yaml
  8. ingress.extensions/blackbox-exporter created

image.png

浏览器访问

http://blackbox.op.com/ 显示如下界面,表示blackbox已经运行成功。
image.png

6 部署prometheus

运维主机vms200上:

准备prometheus镜像

官方dockerhub地址:https://hub.docker.com/r/prom/prometheus
官方github地址:https://github.com/prometheus/prometheus

  1. [root@vms200 ~]# docker pull prom/prometheus:v2.21.0
  2. v2.21.0: Pulling from prom/prometheus
  3. ...
  4. Digest: sha256:d43417c260e516508eed1f1d59c10c49d96bbea93eafb4955b0df3aea5908971
  5. Status: Downloaded newer image for prom/prometheus:v2.21.0
  6. docker.io/prom/prometheus:v2.21.0
  7. [root@vms200 ~]# docker tag prom/prometheus:v2.21.0 harbor.op.com/infra/prometheus:v2.21.0
  8. [root@vms200 ~]# docker push harbor.op.com/infra/prometheus:v2.21.0
  9. The push refers to repository [harbor.op.com/infra/prometheus]
  10. ...
  11. v2.21.0: digest: sha256:f3ada803723ccbc443ebea19f7ab24d3323def496e222134bf9ed54ae5b787bd size: 2824

准备资源配置清单

运维主机vms200上:

  1. [root@vms200 ~]# mkdir -p /data/k8s-yaml/prometheus && cd /data/k8s-yaml/prometheus
  • RBAC
  1. [root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/rbac.yaml
  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. labels:
  5. addonmanager.kubernetes.io/mode: Reconcile
  6. kubernetes.io/cluster-service: "true"
  7. name: prometheus
  8. namespace: infra
  9. ---
  10. apiVersion: rbac.authorization.k8s.io/v1
  11. kind: ClusterRole
  12. metadata:
  13. labels:
  14. addonmanager.kubernetes.io/mode: Reconcile
  15. kubernetes.io/cluster-service: "true"
  16. name: prometheus
  17. rules:
  18. - apiGroups:
  19. - ""
  20. resources:
  21. - nodes
  22. - nodes/metrics
  23. - services
  24. - endpoints
  25. - pods
  26. verbs:
  27. - get
  28. - list
  29. - watch
  30. - apiGroups:
  31. - ""
  32. resources:
  33. - configmaps
  34. verbs:
  35. - get
  36. - nonResourceURLs:
  37. - /metrics
  38. verbs:
  39. - get
  40. ---
  41. apiVersion: rbac.authorization.k8s.io/v1
  42. kind: ClusterRoleBinding
  43. metadata:
  44. labels:
  45. addonmanager.kubernetes.io/mode: Reconcile
  46. kubernetes.io/cluster-service: "true"
  47. name: prometheus
  48. roleRef:
  49. apiGroup: rbac.authorization.k8s.io
  50. kind: ClusterRole
  51. name: prometheus
  52. subjects:
  53. - kind: ServiceAccount
  54. name: prometheus
  55. namespace: infra
  • Deployment
  1. [root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/deployment.yaml

Prometheus在生产环境中,一般采用一个单独的大内存node部署,采用污点让其它pod不会调度上来。本实验使用nodeName: vms21.cos.com来指定到192.168.26.21--storage.tsdb.min-block-duration 内存中缓存最新多少分钟的TSDB数据,生产中会缓存更多的数据。storage.tsdb.min-block-duration=10m只加载10分钟数据到内。 --storage.tsdb.retention TSDB数据保留的时间,生产中会保留更多的数据。 storage.tsdb.retention=72h 保留72小时数据。 加上--web.enable-lifecycle启用远程热加载配置文件,配置文件改变后不用重启prometheus。
调用指令是curl -X POST http://localhost:9090/-/reload

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. annotations:
  5. deployment.kubernetes.io/revision: "5"
  6. labels:
  7. name: prometheus
  8. name: prometheus
  9. namespace: infra
  10. spec:
  11. progressDeadlineSeconds: 600
  12. replicas: 1
  13. revisionHistoryLimit: 7
  14. selector:
  15. matchLabels:
  16. app: prometheus
  17. strategy:
  18. rollingUpdate:
  19. maxSurge: 1
  20. maxUnavailable: 1
  21. type: RollingUpdate
  22. template:
  23. metadata:
  24. labels:
  25. app: prometheus
  26. spec:
  27. nodeName: vms21.cos.com
  28. containers:
  29. - image: harbor.op.com/infra/prometheus:v2.21.0
  30. args:
  31. - --config.file=/data/etc/prometheus.yml
  32. - --storage.tsdb.path=/data/prom-db
  33. - --storage.tsdb.retention=72h
  34. - --storage.tsdb.min-block-duration=10m
  35. - --web.enable-lifecycle
  36. command:
  37. - /bin/prometheus
  38. name: prometheus
  39. ports:
  40. - containerPort: 9090
  41. protocol: TCP
  42. resources:
  43. limits:
  44. cpu: 500m
  45. memory: 2500Mi
  46. requests:
  47. cpu: 100m
  48. memory: 100Mi
  49. volumeMounts:
  50. - mountPath: /data
  51. name: data
  52. imagePullPolicy: IfNotPresent
  53. imagePullSecrets:
  54. - name: harbor
  55. securityContext:
  56. runAsUser: 0
  57. dnsPolicy: ClusterFirst
  58. restartPolicy: Always
  59. serviceAccount: prometheus
  60. serviceAccountName: prometheus
  61. volumes:
  62. - name: data
  63. nfs:
  64. server: vms200
  65. path: /data/nfs-volume/prometheus
  • Service
  1. [root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/service.yaml
  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: prometheus
  5. namespace: infra
  6. spec:
  7. ports:
  8. - port: 9090
  9. protocol: TCP
  10. name: prometheus
  11. selector:
  12. app: prometheus
  13. type: ClusterIP
  • Ingress
  1. [root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/ingress.yaml
  1. apiVersion: extensions/v1beta1
  2. kind: Ingress
  3. metadata:
  4. annotations:
  5. kubernetes.io/ingress.class: traefik
  6. name: prometheus
  7. namespace: infra
  8. spec:
  9. rules:
  10. - host: prometheus.op.com
  11. http:
  12. paths:
  13. - backend:
  14. serviceName: prometheus
  15. servicePort: 9090

准备prometheus的配置文件

运维主机vms200上:

  • 创建目录与拷贝证书
  1. [root@vms200 ~]# mkdir -pv /data/nfs-volume/prometheus/{etc,prom-db}
  2. ...
  3. [root@vms200 ~]# cd /data/nfs-volume/prometheus/etc
  4. [root@vms200 etc]# cp /opt/certs/{ca.pem,client.pem,client-key.pem} /data/nfs-volume/prometheus/etc/
  5. [root@vms200 etc]# ll
  6. total 12
  7. -rw-r--r-- 1 root root 1338 Sep 12 16:22 ca.pem
  8. -rw------- 1 root root 1675 Sep 12 16:22 client-key.pem
  9. -rw-r--r-- 1 root root 1363 Sep 12 16:22 client.pem
  • 准备配置

配置文件说明:此配置为通用配置,除第一个jobetcd是做的静态配置外,其他8个job都是做的自动发现。因此只需要修改etcd的配置后,就可以直接用于生产环境。

  1. [root@vms200 etc]# vi /data/nfs-volume/prometheus/etc/prometheus.yml
  1. global:
  2. scrape_interval: 15s
  3. evaluation_interval: 15s
  4. scrape_configs:
  5. - job_name: 'etcd'
  6. tls_config:
  7. ca_file: /data/etc/ca.pem
  8. cert_file: /data/etc/client.pem
  9. key_file: /data/etc/client-key.pem
  10. scheme: https
  11. static_configs:
  12. - targets:
  13. - '192.168.26.12:2379'
  14. - '192.168.26.21:2379'
  15. - '192.168.26.22:2379'
  16. - job_name: 'kubernetes-apiservers'
  17. kubernetes_sd_configs:
  18. - role: endpoints
  19. scheme: https
  20. tls_config:
  21. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  22. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  23. relabel_configs:
  24. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  25. action: keep
  26. regex: default;kubernetes;https
  27. - job_name: 'kubernetes-pods'
  28. kubernetes_sd_configs:
  29. - role: pod
  30. relabel_configs:
  31. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  32. action: keep
  33. regex: true
  34. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  35. action: replace
  36. target_label: __metrics_path__
  37. regex: (.+)
  38. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  39. action: replace
  40. regex: ([^:]+)(?::\d+)?;(\d+)
  41. replacement: $1:$2
  42. target_label: __address__
  43. - action: labelmap
  44. regex: __meta_kubernetes_pod_label_(.+)
  45. - source_labels: [__meta_kubernetes_namespace]
  46. action: replace
  47. target_label: kubernetes_namespace
  48. - source_labels: [__meta_kubernetes_pod_name]
  49. action: replace
  50. target_label: kubernetes_pod_name
  51. - job_name: 'kubernetes-kubelet'
  52. kubernetes_sd_configs:
  53. - role: node
  54. relabel_configs:
  55. - action: labelmap
  56. regex: __meta_kubernetes_node_label_(.+)
  57. - source_labels: [__meta_kubernetes_node_name]
  58. regex: (.+)
  59. target_label: __address__
  60. replacement: ${1}:10255
  61. - job_name: 'kubernetes-cadvisor'
  62. kubernetes_sd_configs:
  63. - role: node
  64. relabel_configs:
  65. - action: labelmap
  66. regex: __meta_kubernetes_node_label_(.+)
  67. - source_labels: [__meta_kubernetes_node_name]
  68. regex: (.+)
  69. target_label: __address__
  70. replacement: ${1}:4194
  71. - job_name: 'kubernetes-kube-state'
  72. kubernetes_sd_configs:
  73. - role: pod
  74. relabel_configs:
  75. - action: labelmap
  76. regex: __meta_kubernetes_pod_label_(.+)
  77. - source_labels: [__meta_kubernetes_namespace]
  78. action: replace
  79. target_label: kubernetes_namespace
  80. - source_labels: [__meta_kubernetes_pod_name]
  81. action: replace
  82. target_label: kubernetes_pod_name
  83. - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
  84. regex: .*true.*
  85. action: keep
  86. - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
  87. regex: 'node-exporter;(.*)'
  88. action: replace
  89. target_label: nodename
  90. - job_name: 'blackbox_http_pod_probe'
  91. metrics_path: /probe
  92. kubernetes_sd_configs:
  93. - role: pod
  94. params:
  95. module: [http_2xx]
  96. relabel_configs:
  97. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  98. action: keep
  99. regex: http
  100. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
  101. action: replace
  102. regex: ([^:]+)(?::\d+)?;(\d+);(.+)
  103. replacement: $1:$2$3
  104. target_label: __param_target
  105. - action: replace
  106. target_label: __address__
  107. replacement: blackbox-exporter.kube-system:9115
  108. - source_labels: [__param_target]
  109. target_label: instance
  110. - action: labelmap
  111. regex: __meta_kubernetes_pod_label_(.+)
  112. - source_labels: [__meta_kubernetes_namespace]
  113. action: replace
  114. target_label: kubernetes_namespace
  115. - source_labels: [__meta_kubernetes_pod_name]
  116. action: replace
  117. target_label: kubernetes_pod_name
  118. - job_name: 'blackbox_tcp_pod_probe'
  119. metrics_path: /probe
  120. kubernetes_sd_configs:
  121. - role: pod
  122. params:
  123. module: [tcp_connect]
  124. relabel_configs:
  125. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  126. action: keep
  127. regex: tcp
  128. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
  129. action: replace
  130. regex: ([^:]+)(?::\d+)?;(\d+)
  131. replacement: $1:$2
  132. target_label: __param_target
  133. - action: replace
  134. target_label: __address__
  135. replacement: blackbox-exporter.kube-system:9115
  136. - source_labels: [__param_target]
  137. target_label: instance
  138. - action: labelmap
  139. regex: __meta_kubernetes_pod_label_(.+)
  140. - source_labels: [__meta_kubernetes_namespace]
  141. action: replace
  142. target_label: kubernetes_namespace
  143. - source_labels: [__meta_kubernetes_pod_name]
  144. action: replace
  145. target_label: kubernetes_pod_name
  146. - job_name: 'traefik'
  147. kubernetes_sd_configs:
  148. - role: pod
  149. relabel_configs:
  150. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  151. action: keep
  152. regex: traefik
  153. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  154. action: replace
  155. target_label: __metrics_path__
  156. regex: (.+)
  157. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  158. action: replace
  159. regex: ([^:]+)(?::\d+)?;(\d+)
  160. replacement: $1:$2
  161. target_label: __address__
  162. - action: labelmap
  163. regex: __meta_kubernetes_pod_label_(.+)
  164. - source_labels: [__meta_kubernetes_namespace]
  165. action: replace
  166. target_label: kubernetes_namespace
  167. - source_labels: [__meta_kubernetes_pod_name]
  168. action: replace
  169. target_label: kubernetes_pod_name

应用资源配置清单

任意运算节点上:

  1. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/rbac.yaml
  2. serviceaccount/prometheus created
  3. clusterrole.rbac.authorization.k8s.io/prometheus created
  4. clusterrolebinding.rbac.authorization.k8s.io/prometheus created
  5. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/deployment.yaml
  6. deployment.apps/prometheus created
  7. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/service.yaml
  8. service/prometheus created
  9. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/ingress.yaml
  10. ingress.extensions/prometheus created

解析域名

vms11

  1. [root@vms11 ~]# vi /var/named/op.com.zone
  1. ...
  2. prometheus A 192.168.26.10

注意serial前滚一个序号

  1. [root@vms11 ~]# systemctl restart named
  2. [root@vms11 ~]# dig -t A prometheus.op.com 192.168.26.11 +short
  3. 192.168.26.10

浏览器访问

  • 先在dashboard查看pod

image.png

image.png

  • 点击Status > Configuration就是配置文件

image.png

Prometheus配置文件解析

官方文档: https://prometheus.io/docs/prometheus/latest/configuration/configuration/

  • vms200:/data/nfs-volume/prometheus/etc/prometheus.yml
  1. global:
  2. scrape_interval: 15s # 数据抓取周期,默认1m
  3. evaluation_interval: 15s # 估算规则周期,默认1m
  4. scrape_configs: # 抓取指标的方式,一个job就是一类指标的获取方式
  5. - job_name: 'etcd' # 指定etcd的指标获取方式,没指定scrape_interval会使用全局配置
  6. tls_config:
  7. ca_file: /data/etc/ca.pem
  8. cert_file: /data/etc/client.pem
  9. key_file: /data/etc/client-key.pem
  10. scheme: https # 默认是http方式获取
  11. static_configs:
  12. - targets:
  13. - '192.168.26.12:2379'
  14. - '192.168.26.21:2379'
  15. - '192.168.26.22:2379'
  16. - job_name: 'kubernetes-apiservers'
  17. kubernetes_sd_configs:
  18. - role: endpoints # 目标资源类型,支持node、endpoints、pod、service、ingress等
  19. scheme: https # tls,bearer_token_file都是与apiserver通信时使用
  20. tls_config:
  21. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  22. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  23. relabel_configs: # 对目标标签修改时使用
  24. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  25. action: keep # action支持:
  26. # keep,drop,replace,labelmap,labelkeep,labeldrop,hashmod
  27. regex: default;kubernetes;https
  28. - job_name: 'kubernetes-pods'
  29. kubernetes_sd_configs:
  30. - role: pod
  31. relabel_configs:
  32. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  33. action: keep
  34. regex: true
  35. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  36. action: replace
  37. target_label: __metrics_path__
  38. regex: (.+)
  39. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  40. action: replace
  41. regex: ([^:]+)(?::\d+)?;(\d+)
  42. replacement: $1:$2
  43. target_label: __address__
  44. - action: labelmap
  45. regex: __meta_kubernetes_pod_label_(.+)
  46. - source_labels: [__meta_kubernetes_namespace]
  47. action: replace
  48. target_label: kubernetes_namespace
  49. - source_labels: [__meta_kubernetes_pod_name]
  50. action: replace
  51. target_label: kubernetes_pod_name
  52. - job_name: 'kubernetes-kubelet'
  53. kubernetes_sd_configs:
  54. - role: node
  55. relabel_configs:
  56. - action: labelmap
  57. regex: __meta_kubernetes_node_label_(.+)
  58. - source_labels: [__meta_kubernetes_node_name]
  59. regex: (.+)
  60. target_label: __address__
  61. replacement: ${1}:10255
  62. - job_name: 'kubernetes-cadvisor'
  63. kubernetes_sd_configs:
  64. - role: node
  65. relabel_configs:
  66. - action: labelmap
  67. regex: __meta_kubernetes_node_label_(.+)
  68. - source_labels: [__meta_kubernetes_node_name]
  69. regex: (.+)
  70. target_label: __address__
  71. replacement: ${1}:4194
  72. - job_name: 'kubernetes-kube-state'
  73. kubernetes_sd_configs:
  74. - role: pod
  75. relabel_configs:
  76. - action: labelmap
  77. regex: __meta_kubernetes_pod_label_(.+)
  78. - source_labels: [__meta_kubernetes_namespace]
  79. action: replace
  80. target_label: kubernetes_namespace
  81. - source_labels: [__meta_kubernetes_pod_name]
  82. action: replace
  83. target_label: kubernetes_pod_name
  84. - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
  85. regex: .*true.*
  86. action: keep
  87. - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
  88. regex: 'node-exporter;(.*)'
  89. action: replace
  90. target_label: nodename
  91. - job_name: 'blackbox_http_pod_probe'
  92. metrics_path: /probe
  93. kubernetes_sd_configs:
  94. - role: pod
  95. params:
  96. module: [http_2xx]
  97. relabel_configs:
  98. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  99. action: keep
  100. regex: http
  101. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
  102. action: replace
  103. regex: ([^:]+)(?::\d+)?;(\d+);(.+)
  104. replacement: $1:$2$3
  105. target_label: __param_target
  106. - action: replace
  107. target_label: __address__
  108. replacement: blackbox-exporter.kube-system:9115
  109. - source_labels: [__param_target]
  110. target_label: instance
  111. - action: labelmap
  112. regex: __meta_kubernetes_pod_label_(.+)
  113. - source_labels: [__meta_kubernetes_namespace]
  114. action: replace
  115. target_label: kubernetes_namespace
  116. - source_labels: [__meta_kubernetes_pod_name]
  117. action: replace
  118. target_label: kubernetes_pod_name
  119. - job_name: 'blackbox_tcp_pod_probe'
  120. metrics_path: /probe
  121. kubernetes_sd_configs:
  122. - role: pod
  123. params:
  124. module: [tcp_connect]
  125. relabel_configs:
  126. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  127. action: keep
  128. regex: tcp
  129. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
  130. action: replace
  131. regex: ([^:]+)(?::\d+)?;(\d+)
  132. replacement: $1:$2
  133. target_label: __param_target
  134. - action: replace
  135. target_label: __address__
  136. replacement: blackbox-exporter.kube-system:9115
  137. - source_labels: [__param_target]
  138. target_label: instance
  139. - action: labelmap
  140. regex: __meta_kubernetes_pod_label_(.+)
  141. - source_labels: [__meta_kubernetes_namespace]
  142. action: replace
  143. target_label: kubernetes_namespace
  144. - source_labels: [__meta_kubernetes_pod_name]
  145. action: replace
  146. target_label: kubernetes_pod_name
  147. - job_name: 'traefik'
  148. kubernetes_sd_configs:
  149. - role: pod
  150. relabel_configs:
  151. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  152. action: keep
  153. regex: traefik
  154. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  155. action: replace
  156. target_label: __metrics_path__
  157. regex: (.+)
  158. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  159. action: replace
  160. regex: ([^:]+)(?::\d+)?;(\d+)
  161. replacement: $1:$2
  162. target_label: __address__
  163. - action: labelmap
  164. regex: __meta_kubernetes_pod_label_(.+)
  165. - source_labels: [__meta_kubernetes_namespace]
  166. action: replace
  167. target_label: kubernetes_namespace
  168. - source_labels: [__meta_kubernetes_pod_name]
  169. action: replace
  170. target_label: kubernetes_pod_name
  171. alerting: # Alertmanager配置
  172. alertmanagers:
  173. - static_configs:
  174. - targets: ["alertmanager"]
  175. rule_files: # 引用外部的告警或者监控规则,类似于include
  176. - "/data/etc/rules.yml"

Prometheus监控内容及方法

Pod接入Exporter

  • 当前实验部署的是通用的Exporter,其中Kube-state-metrics是通过Kubernetes API采集信息,Node-exporter用于收集主机信息,这两个Exporter与Pod无关,部署完毕后直接使用即可。
  • 根据Prometheus配置文件,可以看出Pod监控信息获取是通过标签(注释)选择器来实现的,给资源添加对应的标签或者注释来实现数据的监控。

Targets - job-name

  • 点击Status >Targets,展示的就是在prometheus.yml中配置的job-name,这些targets基本可以满足监控收集数据的需求。

image.png

vms12服务器没启动,监控到了一个etcd为DOWN状态

  • Targets(jobs):show less

image.png

总共有9个job_name,有5个job-name已经被发现并获取数据;接下来就需要将剩下的4个job_name对应的服务纳入监控;纳入监控的方式是给需要收集数据的服务添加annotations

etcd

监控etcd服务

key value
etcd_server_has_leader 1
etcd_http_failed_total 1

kubernetes-apiserver

监控apiserver服务

kubernetes-kubelet

监控kubelet服务

kubernetes-kube-state

监控基本信息

  • node-exporter> 监控Node节点信息
  • kube-state-metrics> 监控pod信息

traefik

  • 监控traefik-ingress-controller | key | value | | :—- | :—- | | traefik_entrypoint_requests_total{code=”200”,entrypoint=”http”,method=”PUT”,protocol=”http”} | 138 | | traefik_entrypoint_requests_total{code=”200”,entrypoint=”http”,method=”GET”,protocol=”http”} | 285 | | traefik_entrypoint_open_connections{entrypoint=”http”,method=”PUT”,protocol=”http”} | 1 | | … | … |
  • Traefik接入:

在traefik的pod控制器上加annotations,并重启pod,监控生效。 (JSON格式)

  1. "annotations": {
  2. "prometheus_io_scheme": "traefik",
  3. "prometheus_io_path": "/metrics",
  4. "prometheus_io_port": "8080"
  5. }

或在traefik的部署yaml文件中的spec.template.metadata加入注释,然后重启Pod。(yaml格式)

  1. annotations:
  2. prometheus_io_scheme: traefik
  3. prometheus_io_path: /metrics
  4. prometheus_io_port: "8080"

blackbox*

blackbox是检测容器内服务存活性的,也就是端口健康状态检查,分为tcp和http两种方法。能用http的情况尽量用http,没有提供http接口的服务才用tcp。

监控服务是否存活,检测TCP/HTTP服务状态

  • blackbox_tcp_pod_porbe 监控tcp协议服务是否存活 | key | value | | :—- | :—- | | probe_success | 1 | | probe_ip_protocol | 4 | | probe_failed_due_to_regex | 0 | | probe_duration_seconds | 0.000597546 | | probe_dns_lookup_time_seconds | 0.00010898 |

接入Blackbox监控: 在pod控制器上加annotations,并重启pod,监控生效。(JSON格式)

  1. "annotations": {
  2. "blackbox_port": "20880",
  3. "blackbox_scheme": "tcp"
  4. }
  • blackbox_http_pod_probe 监控http协议服务是否存活 | key | value | | :—- | :—- | | probe_success | 1 | | probe_ip_protocol | 4 | | probe_http_version | 1.1 | | probe_http_status_code | 200 | | probe_http_ssl | 0 | | probe_http_redirects | 1 | | probe_http_last_modified_timestamp_seconds | 1.553861888e+09 | | probe_http_duration_seconds{phase=”transfer”} | 0.000238343 | | probe_http_duration_seconds{phase=”tls”} | 0 | | probe_http_duration_seconds{phase=”resolve”} | 5.4095e-05 | | probe_http_duration_seconds{phase=”processing”} | 0.000966104 | | probe_http_duration_seconds{phase=”connect”} | 0.000520821 | | probe_http_content_length | 716 | | probe_failed_due_to_regex | 0 | | probe_duration_seconds | 0.00272609 | | probe_dns_lookup_time_seconds | 5.4095e-05 |

接入Blackbox监控: 在pod控制器上加annotations,并重启pod,监控生效。(JSON格式)

  1. "annotations": {
  2. "blackbox_path": "/",
  3. "blackbox_port": "8080",
  4. "blackbox_scheme": "http"
  5. }

接入Blackbox监控:(yaml格式)在对应pod添加annotations,以下分别是TCP探测和HTTP探测。

  1. annotations:
  2. blackbox_port: "20880"
  3. blackbox_scheme: tcp
  4. annotations:
  5. blackbox_port: "8080"
  6. blackbox_scheme: http
  7. blackbox_path: /hello?name=health

kubernetes-pods*

  • 监控JVM信息 | key | value | | :—- | :—- | | jvm_info{version=”1.7.0_80-b15”,vendor=”Oracle Corporation”,runtime=”Java(TM) SE Runtime Environment”,} | 1.0 | | jmx_config_reload_success_total | 0.0 | | process_resident_memory_bytes | 4.693897216E9 | | process_virtual_memory_bytes | 1.2138840064E10 | | process_max_fds | 65536.0 | | process_open_fds | 123.0 | | process_start_time_seconds | 1.54331073249E9 | | process_cpu_seconds_total | 196465.74 | | jvm_buffer_pool_used_buffers{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_used_buffers{pool=”direct”,} | 150.0 | | jvm_buffer_pool_capacity_bytes{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_capacity_bytes{pool=”direct”,} | 6216688.0 | | jvm_buffer_pool_used_bytes{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_used_bytes{pool=”direct”,} | 6216688.0 | | jvm_gc_collection_seconds_sum{gc=”PS MarkSweep”,} | 1.867 | | … | … |

Pod接入监控:(JSON格式)在pod控制器上加annotations,并重启pod,监控生效。

  1. "annotations": {
  2. "prometheus_io_scrape": "true",
  3. "prometheus_io_port": "12346",
  4. "prometheus_io_path": "/"
  5. }

Pod接入监控:(yaml格式)在对应pod添加annotations,并重启pod。该信息是jmx_javaagent-0.3.1.jar收集的,端口是12346。true是字符串!

  1. annotations:
  2. prometheus_io_scrape: "true"
  3. prometheus_io_port: "12346"
  4. prometheus_io_path: /

修改traefik服务接入prometheus监控

dashboard上:
kube-system名称空间 > daemonset > traefik-ingress-controller : spec > template > metadata下,添加(JSON格式)

  1. "annotations": {
  2. "prometheus_io_scheme": "traefik",
  3. "prometheus_io_path": "/metrics",
  4. "prometheus_io_port": "8080"
  5. }

删除pod,重启traefik,观察监控

继续添加blackbox监控配置项(JSON格式)

  1. "annotations": {
  2. "prometheus_io_scheme": "traefik",
  3. "prometheus_io_path": "/metrics",
  4. "prometheus_io_port": "8080",
  5. "blackbox_path": "/",
  6. "blackbox_port": "8080",
  7. "blackbox_scheme": "http"
  8. }

也可以在vms200上修改traefik的yaml文件:

  1. [root@vms200 ~]# vi /data/k8s-yaml/traefik/traefik-deploy.yaml

labels同级,添加annotations配置

  1. apiVersion: apps/v1
  2. kind: DaemonSet
  3. metadata:
  4. name: traefik-ingress-controller
  5. labels:
  6. app: traefik
  7. spec:
  8. selector:
  9. matchLabels:
  10. app: traefik
  11. template:
  12. metadata:
  13. name: traefik
  14. labels:
  15. app: traefik
  16. annotations:
  17. prometheus_io_scheme: "traefik"
  18. prometheus_io_path: "/metrics"
  19. prometheus_io_port: "8080"
  20. blackbox_path: "/"
  21. blackbox_port: "8080"
  22. blackbox_scheme: "http"
  23. ...

任意节点重新应用配置

  1. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/traefik/traefik-deploy.yaml -n kube-system
  2. service/traefik unchanged
  3. daemonset.apps/traefik-ingress-controller created

等待pod重启以后,再在prometheus上查看traefik、blackbox是否能正常获取数据了
image.png

dubbo服务接入prometheus监控

使用测试环境FAT的dubbo服务来做演示,其他环境类似(注意启动vms11上的zk)

  1. dashboard中开启apollo-portal(infra空间)和test空间中的apollo
  2. dubbo-demo-service使用tcp的annotation
  3. dubbo-demo-consumer使用HTTP的annotation

以上环境比较耗资源,可以不使用apollo

  1. 修改dubbo-monitor(infra空间)的zk为zk_test(vms11上的zk)
  2. 修改app空间的dubbo-demo-service的镜像为master版本(不使用apollo配置),添加tcp的annotation
  3. 修改app空间的dubbo-demo-consumer的镜像为master版本(不使用apollo配置),添加HTTP的annotation

修改dubbo-service服务接入prometheus监控

dashboard上:

  • 首先在dubbo-demo-service资源中添加一个TCP的annotations
  • 添加监控jvm信息,以便监控pod中的jvm信息(12346是dubbo的POD启动命令中使用jmx_javaagent用到的端口,因此可以用来收集jvm信息)

test名称空间 > deployment > dubbo-demo-service Edit:spec > template > metadata下,添加 (JSON格式)

  1. "annotations": {
  2. "prometheus_io_scrape": "true",
  3. "prometheus_io_path": "/",
  4. "prometheus_io_port": "12346",
  5. "blackbox_port": "20880",
  6. "blackbox_scheme": "tcp"
  7. }

删除pod,重启应用,观察监控(见后文监控观察

也可以修改部署配置清单文件(增加annotations),然后在任意节点重新应用配置

  1. spec:
  2. replicas: 0
  3. selector:
  4. matchLabels:
  5. name: dubbo-demo-service
  6. template:
  7. metadata:
  8. creationTimestamp: null
  9. labels:
  10. app: dubbo-demo-service
  11. name: dubbo-demo-service
  12. annotations:
  13. blackbox_port: '20880'
  14. blackbox_scheme: tcp
  15. prometheus_io_path: /
  16. prometheus_io_port: '12346'
  17. prometheus_io_scrape: 'true'
  18. ...

修改dubbo-consumer服务接入prometheus监控

dashboard上:

  • 在dubbo-demo-consumer资源中添加一个HTTP的annotations;
  • 添加监控jvm信息,以便监控pod中的jvm信息(12346是dubbo的POD启动命令中使用jmx_javaagent用到的端口,因此可以用来收集jvm信息)

test名称空间 > deployment > dubbo-demo-consumer Edit:spec > template > metadata下,添加 (JSON格式)

  1. "annotations": {
  2. "prometheus_io_scrape": "true",
  3. "prometheus_io_path": "/",
  4. "prometheus_io_port": "12346",
  5. "blackbox_path": "/hello",
  6. "blackbox_port": "8080",
  7. "blackbox_scheme": "http"
  8. }

删除pod,重启应用,观察监控(见后文监控观察

也可以修改部署配置清单文件(增加annotations),然后在任意节点重新应用配置

  1. spec:
  2. replicas: 1
  3. selector:
  4. matchLabels:
  5. name: dubbo-demo-consumer
  6. template:
  7. metadata:
  8. creationTimestamp: null
  9. labels:
  10. app: dubbo-demo-consumer
  11. name: dubbo-demo-consumer
  12. annotations:
  13. blackbox_path: /hello
  14. blackbox_port: '8080'
  15. blackbox_scheme: http
  16. prometheus_io_path: /
  17. prometheus_io_port: '12346'
  18. prometheus_io_scrape: 'true'
  19. ...

监控观察

浏览器中查看http://blackbox.op.com和http://prometheus.op.com/targets运行的dubbo-demo-server服务,tcp端口20880已经被发现并在监控中

image.png

image.png

至此,所有9个job_name都成功完美获取了监控数据

image.png

7 部署Grafana

运维主机vms200上:

准备grafana镜像

官方dockerhub地址:https://hub.docker.com/r/grafana/grafana
官方github地址:https://github.com/grafana/grafana
grafana官网:https://grafana.com/

  1. [root@vms200 ~]# docker pull grafana/grafana:7.1.5
  2. 7.1.5: Pulling from grafana/grafana
  3. df20fa9351a1: Pull complete
  4. 9942118288f3: Pull complete
  5. 1fb6e3df6e68: Pull complete
  6. 7e3d0d675cf3: Pull complete
  7. 4c1eb3303598: Pull complete
  8. a5ec11eae53c: Pull complete
  9. Digest: sha256:579044d31fad95f015c78dff8db25c85e2e0f5fdf37f414ce850eb045dd47265
  10. Status: Downloaded newer image for grafana/grafana:7.1.5
  11. docker.io/grafana/grafana:7.1.5
  12. [root@vms200 ~]# docker tag docker.io/grafana/grafana:7.1.5 harbor.op.com/infra/grafana:7.1.5
  13. [root@vms200 ~]# docker push harbor.op.com/infra/grafana:7.1.5
  14. The push refers to repository [harbor.op.com/infra/grafana]
  15. 9c957ea29f01: Pushed
  16. 7fcdc437fb25: Pushed
  17. 5c98ed105d7e: Pushed
  18. 43376507b219: Pushed
  19. cb596e3b6acf: Pushed
  20. 50644c29ef5a: Pushed
  21. 7.1.5: digest: sha256:dfd940ed4dd82a6369cb057fe5ab4cc8c774c1c5b943b2f4b618302a7979de61 size: 1579

准备资源配置清单

  1. [root@vms200 ~]# mkdir /data/k8s-yaml/grafana && cd /data/k8s-yaml/grafana
  • RBAC
  1. [root@vms200 grafana]# vi /data/k8s-yaml/grafana/rbac.yaml
  1. apiVersion: rbac.authorization.k8s.io/v1
  2. kind: ClusterRole
  3. metadata:
  4. labels:
  5. addonmanager.kubernetes.io/mode: Reconcile
  6. kubernetes.io/cluster-service: "true"
  7. name: grafana
  8. rules:
  9. - apiGroups:
  10. - "*"
  11. resources:
  12. - namespaces
  13. - deployments
  14. - pods
  15. verbs:
  16. - get
  17. - list
  18. - watch
  19. ---
  20. apiVersion: rbac.authorization.k8s.io/v1
  21. kind: ClusterRoleBinding
  22. metadata:
  23. labels:
  24. addonmanager.kubernetes.io/mode: Reconcile
  25. kubernetes.io/cluster-service: "true"
  26. name: grafana
  27. roleRef:
  28. apiGroup: rbac.authorization.k8s.io
  29. kind: ClusterRole
  30. name: grafana
  31. subjects:
  32. - kind: User
  33. name: k8s-node
  • Deployment
  1. [root@vms200 grafana]# vi /data/k8s-yaml/grafana/deployment.yaml
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. labels:
  5. app: grafana
  6. name: grafana
  7. name: grafana
  8. namespace: infra
  9. spec:
  10. progressDeadlineSeconds: 600
  11. replicas: 1
  12. revisionHistoryLimit: 7
  13. selector:
  14. matchLabels:
  15. name: grafana
  16. strategy:
  17. rollingUpdate:
  18. maxSurge: 1
  19. maxUnavailable: 1
  20. type: RollingUpdate
  21. template:
  22. metadata:
  23. labels:
  24. app: grafana
  25. name: grafana
  26. spec:
  27. containers:
  28. - image: harbor.op.com/infra/grafana:7.1.5
  29. imagePullPolicy: IfNotPresent
  30. name: grafana
  31. ports:
  32. - containerPort: 3000
  33. protocol: TCP
  34. volumeMounts:
  35. - mountPath: /var/lib/grafana
  36. name: data
  37. imagePullSecrets:
  38. - name: harbor
  39. nodeName: vms22.cos.com
  40. restartPolicy: Always
  41. securityContext:
  42. runAsUser: 0
  43. volumes:
  44. - nfs:
  45. server: vms200
  46. path: /data/nfs-volume/grafana
  47. name: data

创建grafana数据目录

  1. [root@vms200 grafana]# mkdir /data/nfs-volume/grafana
  • Service
  1. [root@vms200 grafana]# vi /data/k8s-yaml/grafana/service.yaml
  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: grafana
  5. namespace: infra
  6. spec:
  7. ports:
  8. - port: 3000
  9. protocol: TCP
  10. selector:
  11. app: grafana
  12. type: ClusterIP
  • Ingress
  1. [root@vms200 grafana]# vi /data/k8s-yaml/grafana/ingress.yaml
  1. apiVersion: extensions/v1beta1
  2. kind: Ingress
  3. metadata:
  4. name: grafana
  5. namespace: infra
  6. spec:
  7. rules:
  8. - host: grafana.op.com
  9. http:
  10. paths:
  11. - path: /
  12. backend:
  13. serviceName: grafana
  14. servicePort: 3000

应用资源配置清单

任意运算节点上:

  1. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/rbac.yaml
  2. clusterrole.rbac.authorization.k8s.io/grafana created
  3. clusterrolebinding.rbac.authorization.k8s.io/grafana created
  4. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/deployment.yaml
  5. deployment.apps/grafana created
  6. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/service.yaml
  7. service/grafana created
  8. [root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/ingress.yaml
  9. ingress.extensions/grafana created
  1. [root@vms22 ~]# kubectl get pod -l name=grafana -o wide -n infra
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. grafana-7677b5db6b-hrf87 1/1 Running 0 38m 172.26.22.8 vms22.cos.com <none> <none>

解析域名

vms11

  1. [root@vms11 ~]# vi /var/named/op.com.zone
  1. ...
  2. grafana A 192.168.26.10

注意serial前滚一个序号

  1. [root@vms11 ~]# systemctl restart named
  2. [root@vms11 ~]# dig -t A grafana.op.com 192.168.26.11 +short
  3. 192.168.26.10

浏览器访问

http://grafana.op.com (用户名:admin 密 码:admin)
image.png

登录后需要修改管理员密码(admin123),进入:
image.png

配置grafana页面

外观

Configuration > Preferences

  • UI ThemeLight
  • Home DashboardDefault
  • TimezoneBrowser time

save保存
image.png

插件

Configuration > Plugins 查看插件,需要安装以下5个插件:

  • grafana-kubernetes-app
  • grafana-clock-panel
  • grafana-piechart-panel
  • briangann-gauge-panel
  • natel-discrete-panel

插件安装有两种方式:(插件包下载地址:https://github.com/swbook/k8s-grafana-plugins)

  • 方式一:进入Container中,执行 grafana-cli plugins install $plugin_name
  • 方式二:手动下载插件zip包并解压
    1、 查询插件版本号$version:访问 https://grafana.com/api/plugins/repo/$plugin_name
    2、下载zip包:wget https://grafana.com/api/plugins/$plugin_name/versions/$version/download
    3、将zip包解压到 /var/lib/grafana/plugins 下
  • 两种方式的插件安装完毕后,重启Grafana的Pod。

grafana确认启动好以后,进入grafana容器内部,进行插件安装(方式一)

任一运算节点 vms22

  1. [root@vms22 ~]# kubectl -n infra exec -it grafana-7677b5db6b-hrf87 -- /bin/bash
  2. bash-5.0# grafana-cli plugins install grafana-kubernetes-app
  3. installing grafana-kubernetes-app @ 1.0.1
  4. from: https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download
  5. into: /var/lib/grafana/plugins
  6. Installed grafana-kubernetes-app successfully
  7. Restart grafana after installing plugins . <service grafana-server restart>
  8. bash-5.0# grafana-cli plugins install grafana-clock-panel
  9. installing grafana-clock-panel @ 1.1.1
  10. from: https://grafana.com/api/plugins/grafana-clock-panel/versions/1.1.1/download
  11. into: /var/lib/grafana/plugins
  12. Installed grafana-clock-panel successfully
  13. Restart grafana after installing plugins . <service grafana-server restart>
  14. bash-5.0# grafana-cli plugins install grafana-piechart-panel
  15. installing grafana-piechart-panel @ 1.6.0
  16. from: https://grafana.com/api/plugins/grafana-piechart-panel/versions/1.6.0/download
  17. into: /var/lib/grafana/plugins
  18. Installed grafana-piechart-panel successfully
  19. Restart grafana after installing plugins . <service grafana-server restart>
  20. bash-5.0# grafana-cli plugins install briangann-gauge-panel
  21. installing briangann-gauge-panel @ 0.0.6
  22. from: https://grafana.com/api/plugins/briangann-gauge-panel/versions/0.0.6/download
  23. into: /var/lib/grafana/plugins
  24. Installed briangann-gauge-panel successfully
  25. Restart grafana after installing plugins . <service grafana-server restart>
  26. bash-5.0# grafana-cli plugins install natel-discrete-panel
  27. installing natel-discrete-panel @ 0.1.0
  28. from: https://grafana.com/api/plugins/natel-discrete-panel/versions/0.1.0/download
  29. into: /var/lib/grafana/plugins
  30. Installed natel-discrete-panel successfully
  31. Restart grafana after installing plugins . <service grafana-server restart>

安装完后查看: vms200

  1. [root@vms200 grafana]# cd /data/nfs-volume/grafana/plugins
  2. [root@vms200 plugins]# ll
  3. total 0
  4. drwxr-xr-x 4 root root 253 Sep 12 20:39 briangann-gauge-panel
  5. drwxr-xr-x 5 root root 253 Sep 12 20:34 grafana-clock-panel
  6. drwxr-xr-x 4 root root 198 Sep 12 20:31 grafana-kubernetes-app
  7. drwxr-xr-x 4 root root 233 Sep 12 20:35 grafana-piechart-panel
  8. drwxr-xr-x 5 root root 216 Sep 12 20:41 natel-discrete-panel

安装方法二 (已按方法一进行安装,这里仅做示例)

vms200

  1. [root@vms200 grafana]# cd /data/nfs-volume/grafana/plugins
  • Kubernetes App

下载地址:https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download

  1. [root@vms200 plugins]# wget https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download -O grafana-kubernetes-app.zip
  2. ...
  3. [root@vms200 plugins]# unzip grafana-kubernetes-app.zip
  • Clock Pannel

下载地址:https://grafana.com/api/plugins/grafana-clock-panel/versions/1.1.1/download

  • Pie Chart

下载地址:https://grafana.com/api/plugins/grafana-piechart-panel/versions/1.6.0/download

  • D3 Gauge

下载地址:https://grafana.com/api/plugins/briangann-gauge-panel/versions/0.0.6/download

  • Discrete

下载地址:https://grafana.com/api/plugins/natel-discrete-panel/versions/0.1.0/download

插件安装完成后,重启grafana的pod

Configuration > Plugins:从插件列表中选择Kubernetes

image.png

出现enable后点击

image.png

左侧菜单出现Kubernetes图标

image.png

Kubernetes插件安装好后,会有4个dashboard

image.png

配置grafana数据源

Configuration >Data Sources,选择prometheus

  • HTTP | key | value | | :—- | :—- | | URL | http://prometheus.op.com | | Access | Server(Default) | | HTTP Method | GET | | TLS Client Auth | 勾选 | | With CA Cert | 勾选 | | CA Cert | CV:vms200:/opt/certs/ca.pem 复制粘贴文件内容 | | Client Cert | CV:vms200:/opt/certs/client.pem | | Client Key | CV:vms200:/opt/certs/client-key.pem |
  • Save & Test 多点几次

image.png

配置Kubernetes集群Dashboard

kubernetes > +New Cluster

  • Add a new cluster | key | value | | :—- | :—- | | Name | myk8s |
  • Prometheus Read | Key | value | | —- | —- | | Datasource | Prometheus |

选择Datasource后,继续填写以下选项。

  • HTTP | key | value | | :—- | :—- | | URL | https://192.168.26.10:8443 (api-server的VIP地址) | | Access | Server(Default) (这里要选择一下) |
  • Auth | key | value | | :—- | :—- | | TLS Client Auth | 勾选 | | With Ca Cert | 勾选 |

ca.pemclient.pemclient-key.pem内容分别粘贴至CA CertClient CertClient Key对应的文本框内

  • Save

image.png

添加完成后,进入Configuration >Data Sources
image.png

选择并点击myk8s,进入页面后,在页面底部点击Save & Test,不用管出现HTTP Error Forbidden(可能是设置的https请求),多点击几次Save & Test,测试发现grafana就可以获取数据了。

点击kubernetes,出现:(如果这里不出现,则切换低版本的grafana:5.4.2)
image.png

注意:

  • K8S Container中,所有Pannel的pod_name 替换成 container_label_io_kubernetes_pod_name

image.png

选择Total Memory Usage下拉菜单项Edit
image.png

将pod_name替换成container_label_io_kubernetes_pod_name`,就有图了:
image.png

将所有Pannel的pod_name 替换成 container_label_io_kubernetes_pod_name

  • K8S-Cluster
  • K8S-Node
  • K8S-Deployments

这4个dashboard根据实际进行优化。

配置自定义dashboard

根据Prometheus数据源里的数据,配置如下dashboard:

  • etcd dashboard
  • traefik dashboard
  • generic dashboard
  • JMX dashboard
  • blackbox dashboard

这些dashboard的JSON配置文件下载地址:https://github.com/swbook/k8s-GrafanaDashboard

下载配置文件后,然后进行导入(Import dashboard from file or Grafana.com)。
image.png

示例:

  • JMX dashboard

image.png

  • blackbox dashboard

image.png

8 部署alertmanager告警插件

运维主机vms200上:

准备镜像

  1. [root@vms200 ~]# docker pull docker.io/prom/alertmanager:v0.14.0
  2. [root@vms200 ~]# docker tag prom/alertmanager:v0.14.0 harbor.op.com/infra/alertmanager:v0.14.0
  3. [root@vms200 ~]# docker push harbor.op.com/infra/alertmanager:v0.14.0
  4. [root@vms200 ~]# docker pull prom/alertmanager:v0.21.0
  5. [root@vms200 ~]# docker tag prom/alertmanager:v0.21.0 harbor.op.com/infra/alertmanager:v0.21.0
  6. [root@vms200 ~]# docker push harbor.op.com/infra/alertmanager:v0.21.0

准备资源配置清单

  1. [root@vms200 ~]# mkdir /data/k8s-yaml/alertmanager && cd /data/k8s-yaml/alertmanager
  • configmap.yaml
  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: alertmanager-config
  5. namespace: infra
  6. data:
  7. config.yml: |-
  8. global:
  9. # 在没有报警的情况下声明为已解决的时间
  10. resolve_timeout: 5m
  11. # 配置邮件发送信息
  12. mtp_smarthost: 'smtp.qq.com:25'
  13. smtp_from: '385314590@qq.com'
  14. smtp_auth_username: '385314590@qq.com'
  15. smtp_auth_password: 'XXXX'
  16. smtp_require_tls: false
  17. templates:
  18. - '/etc/alertmanager/*.tmpl'
  19. # 所有报警信息进入后的根路由,用来设置报警的分发策略
  20. route:
  21. # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
  22. group_by: ['alertname', 'cluster']
  23. # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
  24. group_wait: 30s
  25. # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
  26. group_interval: 5m
  27. # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
  28. repeat_interval: 5m
  29. # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
  30. receiver: default
  31. receivers:
  32. - name: 'default'
  33. email_configs:
  34. - to: 'k8s_cloud@126.com'
  35. send_resolved: true

注意改成自己的邮箱!网上有可以配置发送中文告警邮件:

  1. ...
  2. receivers:
  3. - name: 'default'
  4. email_configs:
  5. - to: 'xxxx@qq.com'
  6. send_resolved: true
  7. html: '{{ template "email.to.html" . }}'
  8. headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }
  9. email.tmpl: |
  10. {{ define "email.to.html" }}
  11. {{- if gt (len .Alerts.Firing) 0 -}}
  12. {{ range .Alerts }}
  13. 告警程序: prometheus_alert <br>
  14. 告警级别: {{ .Labels.severity }} <br>
  15. 告警类型: {{ .Labels.alertname }} <br>
  16. 故障主机: {{ .Labels.instance }} <br>
  17. 告警主题: {{ .Annotations.summary }} <br>
  18. 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
  19. {{ end }}{{ end -}}
  20. {{- if gt (len .Alerts.Resolved) 0 -}}
  21. {{ range .Alerts }}
  22. 告警程序: prometheus_alert <br>
  23. 告警级别: {{ .Labels.severity }} <br>
  24. 告警类型: {{ .Labels.alertname }} <br>
  25. 故障主机: {{ .Labels.instance }} <br>
  26. 告警主题: {{ .Annotations.summary }} <br>
  27. 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
  28. 恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
  29. {{ end }}{{ end -}}
  30. {{- end }}
  • deployment.yaml
  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: alertmanager
  5. namespace: infra
  6. spec:
  7. replicas: 1
  8. selector:
  9. matchLabels:
  10. app: alertmanager
  11. template:
  12. metadata:
  13. labels:
  14. app: alertmanager
  15. spec:
  16. containers:
  17. - name: alertmanager
  18. image: harbor.op.com/infra/alertmanager:v0.21.0
  19. args:
  20. - "--config.file=/etc/alertmanager/config.yml"
  21. - "--storage.path=/alertmanager"
  22. ports:
  23. - name: alertmanager
  24. containerPort: 9093
  25. volumeMounts:
  26. - name: alertmanager-cm
  27. mountPath: /etc/alertmanager
  28. volumes:
  29. - name: alertmanager-cm
  30. configMap:
  31. name: alertmanager-config
  32. imagePullSecrets:
  33. - name: harbor
  • service.yaml
  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: alertmanager
  5. namespace: infra
  6. spec:
  7. selector:
  8. app: alertmanager
  9. ports:
  10. - port: 80
  11. targetPort: 9093

Prometheus调用alert采用service nam,不走ingress

应用资源配置清单

vms21或vms22上

  1. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/configmap.yaml
  2. configmap/alertmanager-config created
  3. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/deployment.yaml
  4. deployment.apps/alertmanager created
  5. [root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/service.yaml
  6. service/alertmanager created

添加告警规则

  • rules.yml
  1. [root@vms200 ~]# vi /data/nfs-volume/prometheus/etc/rules.yml
  1. groups:
  2. - name: hostStatsAlert
  3. rules:
  4. - alert: hostCpuUsageAlert
  5. expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
  6. for: 5m
  7. labels:
  8. severity: warning
  9. annotations:
  10. summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
  11. - alert: hostMemUsageAlert
  12. expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
  13. for: 5m
  14. labels:
  15. severity: warning
  16. annotations:
  17. summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
  18. - alert: OutOfInodes
  19. expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
  20. for: 5m
  21. labels:
  22. severity: warning
  23. annotations:
  24. summary: "Out of inodes (instance {{ $labels.instance }})"
  25. description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
  26. - alert: OutOfDiskSpace
  27. expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
  28. for: 5m
  29. labels:
  30. severity: warning
  31. annotations:
  32. summary: "Out of disk space (instance {{ $labels.instance }})"
  33. description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
  34. - alert: UnusualNetworkThroughputIn
  35. expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
  36. for: 5m
  37. labels:
  38. severity: warning
  39. annotations:
  40. summary: "Unusual network throughput in (instance {{ $labels.instance }})"
  41. description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
  42. - alert: UnusualNetworkThroughputOut
  43. expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
  44. for: 5m
  45. labels:
  46. severity: warning
  47. annotations:
  48. summary: "Unusual network throughput out (instance {{ $labels.instance }})"
  49. description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
  50. - alert: UnusualDiskReadRate
  51. expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
  52. for: 5m
  53. labels:
  54. severity: warning
  55. annotations:
  56. summary: "Unusual disk read rate (instance {{ $labels.instance }})"
  57. description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
  58. - alert: UnusualDiskWriteRate
  59. expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
  60. for: 5m
  61. labels:
  62. severity: warning
  63. annotations:
  64. summary: "Unusual disk write rate (instance {{ $labels.instance }})"
  65. description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
  66. - alert: UnusualDiskReadLatency
  67. expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
  68. for: 5m
  69. labels:
  70. severity: warning
  71. annotations:
  72. summary: "Unusual disk read latency (instance {{ $labels.instance }})"
  73. description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
  74. - alert: UnusualDiskWriteLatency
  75. expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
  76. for: 5m
  77. labels:
  78. severity: warning
  79. annotations:
  80. summary: "Unusual disk write latency (instance {{ $labels.instance }})"
  81. description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
  82. - name: http_status
  83. rules:
  84. - alert: ProbeFailed
  85. expr: probe_success == 0
  86. for: 1m
  87. labels:
  88. severity: error
  89. annotations:
  90. summary: "Probe failed (instance {{ $labels.instance }})"
  91. description: "Probe failed (current value: {{ $value }})"
  92. - alert: StatusCode
  93. expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
  94. for: 1m
  95. labels:
  96. severity: error
  97. annotations:
  98. summary: "Status Code (instance {{ $labels.instance }})"
  99. description: "HTTP status code is not 200-399 (current value: {{ $value }})"
  100. - alert: SslCertificateWillExpireSoon
  101. expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  102. for: 5m
  103. labels:
  104. severity: warning
  105. annotations:
  106. summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
  107. description: "SSL certificate expires in 30 days (current value: {{ $value }})"
  108. - alert: SslCertificateHasExpired
  109. expr: probe_ssl_earliest_cert_expiry - time() <= 0
  110. for: 5m
  111. labels:
  112. severity: error
  113. annotations:
  114. summary: "SSL certificate has expired (instance {{ $labels.instance }})"
  115. description: "SSL certificate has expired already (current value: {{ $value }})"
  116. - alert: BlackboxSlowPing
  117. expr: probe_icmp_duration_seconds > 2
  118. for: 5m
  119. labels:
  120. severity: warning
  121. annotations:
  122. summary: "Blackbox slow ping (instance {{ $labels.instance }})"
  123. description: "Blackbox ping took more than 2s (current value: {{ $value }})"
  124. - alert: BlackboxSlowRequests
  125. expr: probe_http_duration_seconds > 2
  126. for: 5m
  127. labels:
  128. severity: warning
  129. annotations:
  130. summary: "Blackbox slow requests (instance {{ $labels.instance }})"
  131. description: "Blackbox request took more than 2s (current value: {{ $value }})"
  132. - alert: PodCpuUsagePercent
  133. expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
  134. for: 5m
  135. labels:
  136. severity: warning
  137. annotations:
  138. summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
  • 在prometheus配置文件中追加配置:在末尾追加,关联告警规则
  1. [root@vms200 ~]# vi /data/nfs-volume/prometheus/etc/prometheus.yml
  1. ...
  2. alerting:
  3. alertmanagers:
  4. - static_configs:
  5. - targets: ["alertmanager"]
  6. rule_files:
  7. - "/data/etc/rules.yml"
  • 重载配置:> 可以重启Prometheus的pod,但生产商因为Prometheus太庞大,删掉容易拖垮集群,所以要采用平滑加载方法(Prometheus支持),有三种方法:

任一主机:

  1. [root@vms200 ~]# curl -X POST http://prometheus.op.com/-/reload

或:(作一运算节点)

  1. [root@vms21 ~]# kubectl get pod -n infra | grep prom
  2. prometheus-76fc88fbcc-bqznx 1/1 Running 0 5h59m
  3. [root@vms21 ~]# kubectl exec -n infra prometheus-76fc88fbcc-bqznx -it -n -- kill -HUP 1

或:(Prometheus在vms21)

  1. [root@vms21 ~]# ps aux|grep prometheus | grep -v grep
  2. root 192560 26.2 10.0 1801172 401368 ? Ssl 06:19 0:47 /bin/prometheus --config.file=/data/etc/prometheus.yml --storage.tsdb.path=/data/prom-db --storage.tsdb.retention=72h --storage.tsdb.min-block-duration=10m --web.enable-lifecycle
  3. [root@vms21 ~]# kill -SIGHUP 192560

image.png

  1. /data/etc/rules.yml > hostStatsAlert
  2. OutOfDiskSpace (0 active)
  3. OutOfInodes (0 active)
  4. UnusualDiskReadLatency (0 active)
  5. UnusualDiskReadRate (0 active)
  6. UnusualDiskWriteLatency (0 active)
  7. UnusualDiskWriteRate (0 active)
  8. UnusualNetworkThroughputIn (0 active)
  9. UnusualNetworkThroughputOut (0 active)
  10. hostCpuUsageAlert (0 active)
  11. hostMemUsageAlert (0 active)
  12. /data/etc/rules.yml > http_status
  13. BlackboxSlowPing (0 active)
  14. BlackboxSlowRequests (0 active)
  15. PodCpuUsagePercent (0 active)
  16. ProbeFailed (0 active)
  17. SslCertificateHasExpired (0 active)
  18. SslCertificateWillExpireSoon (0 active)
  19. StatusCode (0 active)

告警测试

当停掉dubbo-demo-service的Pod后,blackbox的HTTP会探测失败,然后触发告警:

image.png

image.png

仔细观察,先是Pending(2),alert中项目变为黄色,然后Firing(2)
image.png

等到alert中项目变为红色的时候就会发邮件告警。
image.png

  • 查看邮箱

image.png

如果需要自己定制告警规则和告警内容,需要研究promql,修改配置文件。

至此,promethus监控与告警完美交付成功!