k8s监控实战-部署prometheus

目录

  • k8s监控实战-部署prometheus

    • 1 prometheus前言相关
      • 1.1 Prometheus的特点
      • 1.2 基本原理
        • 1.2.1 原理说明
        • 1.2.2 架构图:
        • 1.2.3 三大套件
        • 1.2.4 架构服务过程
        • 1.2.5 常用的exporter
    • 2 部署4个exporter
      • 2.1 部署kube-state-metrics
        • 2.1.1 准备docker镜像
        • 2.1.2 准备rbac资源清单
        • 2.1.3 准备Dp资源清单
        • 2.1.4 应用资源配置清单
      • 2.2 部署node-exporter
        • 2.2.1 准备docker镜像
        • 2.2.2 准备ds资源清单
        • 2.2.3 应用资源配置清单:
      • 2.3 部署cadvisor
        • 2.3.1 准备docker镜像
        • 2.3.2 准备ds资源清单
        • 2.3.3 应用资源配置清单:
      • 2.4 部署blackbox-exporter
        • 2.4.1 准备docker镜像
        • 2.4.2 准备cm资源清单
        • 2.4.3 准备dp资源清单
        • 2.4.4 准备svc资源清单
        • 2.4.5 准备ingress资源清单
        • 2.4.6 添加域名解析
        • 2.4.7 应用资源配置清单
        • 2.4.8 访问域名测试
    • 3 部署prometheus server
      • 3.1 准备prometheus server环境
        • 3.1.1 准备docker镜像
        • 3.1.2 准备rbac资源清单
        • 3.1.3 准备dp资源清单
        • 3.1.4 准备svc资源清单
        • 3.1.5 准备ingress资源清单
        • 3.1.6 添加域名解析
      • 3.2 部署prometheus server
        • 3.2.1 准备目录和证书
        • 3.2.2 创建prometheus配置文件
        • 3.2.3 应用资源配置清单
        • 3.2.4 浏览器验证
    • 4 使服务能被prometheus自动监控
      • 4.1 让traefik能被自动监控
        • 4.1.1 修改traefik的yaml
        • 4.1.2 应用配置查看
      • 4.2 用blackbox检测TCP/HTTP服务状态
        • 4.2.1 被检测服务准备
        • 4.2.2 添加tcp的annotation
        • 4.2.3 添加http的annotation
      • 4.3 添加监控jvm信息

        1 prometheus前言相关

        由于docker容器的特殊性,传统的zabbix无法对k8s集群内的docker状态进行监控,所以需要使用prometheus来进行监控
        prometheus官网:官网地址

        1.1 Prometheus的特点

  • 多维度数据模型,使用时间序列数据库TSDB而不使用mysql。

  • 灵活的查询语言PromQL。
  • 不依赖分布式存储,单个服务器节点是自主的。
  • 主要基于HTTP的pull方式主动采集时序数据
  • 也可通过pushgateway获取主动推送到网关的数据。
  • 通过服务发现或者静态配置来发现目标服务对象。
  • 支持多种多样的图表和界面展示,比如Grafana等。

    1.2 基本原理

    1.2.1 原理说明

    Prometheus的基本原理是通过各种exporter提供的HTTP协议接口
    周期性抓取被监控组件的状态,任意组件只要提供对应的HTTP接口就可以接入监控。
    不需要任何SDK或者其他的集成过程,非常适合做虚拟化环境监控系统,比如VM、Docker、Kubernetes等。
    互联网公司常用的组件大部分都有exporter可以直接使用,如Nginx、MySQL、Linux系统信息等。

    1.2.2 架构图:

    K8S(13)监控实战-部署prometheus - 图1

    1.2.3 三大套件

  • Server 主要负责数据采集和存储,提供PromQL查询语言的支持。

  • Alertmanager 警告管理器,用来进行报警。
  • Push Gateway 支持临时性Job主动推送指标的中间网关。

    1.2.4 架构服务过程

  1. Prometheus Daemon负责定时去目标上抓取metrics(指标)数据
    每个抓取目标需要暴露一个http服务的接口给它定时抓取。
    支持通过配置文件、文本文件、Zookeeper、DNS SRV Lookup等方式指定抓取目标。
  2. PushGateway用于Client主动推送metrics到PushGateway
    而Prometheus只是定时去Gateway上抓取数据。
    适合一次性、短生命周期的服务
  3. Prometheus在TSDB数据库存储抓取的所有数据
    通过一定规则进行清理和整理数据,并把得到的结果存储到新的时间序列中。
  4. Prometheus通过PromQL和其他API可视化地展示收集的数据。
    支持Grafana、Promdash等方式的图表数据可视化。
    Prometheus还提供HTTP API的查询方式,自定义所需要的输出。
  5. Alertmanager是独立于Prometheus的一个报警组件
    支持Prometheus的查询语句,提供十分灵活的报警方式。

    1.2.5 常用的exporter

    prometheus不同于zabbix,没有agent,使用的是针对不同服务的exporter
    正常情况下,监控k8s集群及node,pod,常用的exporter有四个:
  • kube-state-metrics
    收集k8s集群master&etcd等基本状态信息
  • node-exporter
    收集k8s集群node信息
  • cadvisor
    收集k8s集群docker容器内部使用资源信息
  • blackbox-exporte
    收集k8s集群docker容器服务是否存活

    2 部署4个exporter

    老套路,下载docker镜像,准备资源配置清单,应用资源配置清单:

    2.1 部署kube-state-metrics

    2.1.1 准备docker镜像

    1. docker pull quay.io/coreos/kube-state-metrics:v1.5.0
    2. docker tag 91599517197a harbor.zq.com/public/kube-state-metrics:v1.5.0
    3. docker push harbor.zq.com/public/kube-state-metrics:v1.5.0
    准备目录
    1. mkdir /data/k8s-yaml/kube-state-metrics
    2. cd /data/k8s-yaml/kube-state-metrics

    2.1.2 准备rbac资源清单

    ``` cat >rbac.yaml <<’EOF’ apiVersion: v1 kind: ServiceAccount metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: “true” name: kube-state-metrics namespace: kube-system

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: “true” name: kube-state-metrics rules:

  • apiGroups:
    • “” resources:
    • configmaps
    • secrets
    • nodes
    • pods
    • services
    • resourcequotas
    • replicationcontrollers
    • limitranges
    • persistentvolumeclaims
    • persistentvolumes
    • namespaces
    • endpoints verbs:
    • list
    • watch
  • apiGroups:
    • policy resources:
    • poddisruptionbudgets verbs:
    • list
    • watch
  • apiGroups:
    • extensions resources:
    • daemonsets
    • deployments
    • replicasets verbs:
    • list
    • watch
  • apiGroups:
    • apps resources:
    • statefulsets verbs:
    • list
    • watch
  • apiGroups:
    • batch resources:
    • cronjobs
    • jobs verbs:
    • list
    • watch
  • apiGroups:
    • autoscaling resources:
    • horizontalpodautoscalers verbs:
    • list
    • watch

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: “true” name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects:

  • kind: ServiceAccount name: kube-state-metrics namespace: kube-system EOF
    1. <a name="d4352174"></a>
    2. #### 2.1.3 准备Dp资源清单
    cat >dp.yaml <<’EOF’ apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: “2” labels: grafanak8sapp: “true” app: kube-state-metrics name: kube-state-metrics namespace: kube-system spec: selector: matchLabels:
    1. grafanak8sapp: "true"
    2. app: kube-state-metrics
    strategy: rollingUpdate:
    1. maxSurge: 25%
    2. maxUnavailable: 25%
    type: RollingUpdate template: metadata:
    1. labels:
    2. grafanak8sapp: "true"
    3. app: kube-state-metrics
    spec:
    1. containers:
    2. - name: kube-state-metrics
    3. image: harbor.zq.com/public/kube-state-metrics:v1.5.0
    4. imagePullPolicy: IfNotPresent
    5. ports:
    6. - containerPort: 8080
    7. name: http-metrics
    8. protocol: TCP
    9. readinessProbe:
    10. failureThreshold: 3
    11. httpGet:
    12. path: /healthz
    13. port: 8080
    14. scheme: HTTP
    15. initialDelaySeconds: 5
    16. periodSeconds: 10
    17. successThreshold: 1
    18. timeoutSeconds: 5
    19. serviceAccountName: kube-state-metrics
    EOF
    1. <a name="c5368dd5"></a>
    2. #### 2.1.4 应用资源配置清单
    3. 任意node节点执行
    kubectl apply -f http://k8s-yaml.zq.com/kube-state-metrics/rbac.yaml kubectl apply -f http://k8s-yaml.zq.com/kube-state-metrics/dp.yaml 创建报错:no matches for kind “Deployment“ in version “extensions/v1beta1“bash kubectl create -f rbac.yaml kubectl create -f dp.yaml error: unable to recognize “dp.yaml”: no matches for kind “Deployment” in version “extensions/v1beta1” 解决:

修改yaml文件:

apiVersion: extensions/v1beta1 kind: Deployment metadata:

修改如下:

apiVersion: apps/v1 kind: Deployment

这个主要是由于版本升级的原因

我的 k8s 版本是 1.18.5

在这个版本中 Deployment 已经启用extensions/v1beta1

DaemonSet, Deployment, StatefulSet, and ReplicaSet resources will no longer be served from extensions/v1beta1, apps/v1beta1, or apps/v1beta2 by default in v1.16.

  1. 参考小笔记:[https://www.yuque.com/r/note/8407f01c-eaa3-40e6-a1f0-0272130aded6](https://www.yuque.com/r/note/8407f01c-eaa3-40e6-a1f0-0272130aded6)<br />**验证测试**

kubectl get pod -n kube-system -o wide|grep kube-state-metrices ~]# curl http://172.7.21.4:8080/healthz ok

  1. 返回OK表示已经成功运行。
  2. <a name="4f11b12b"></a>
  3. ### 2.2 部署node-exporter
  4. **由于node-exporter是监控node的,需要每个节点启动一个,所以使用ds控制器**
  5. <a name="f941603a"></a>
  6. #### 2.2.1 准备docker镜像

docker pull prom/node-exporter:v0.15.0 docker tag 12d51ffa2b22 harbor.zq.com/public/node-exporter:v0.15.0 docker push harbor.zq.com/public/node-exporter:v0.15.0

  1. 准备目录

mkdir /data/k8s-yaml/node-exporter cd /data/k8s-yaml/node-exporter

  1. <a name="27259428"></a>
  2. #### 2.2.2 准备ds资源清单

cat >ds.yaml <<’EOF’ apiVersion: extensions/v1beta1 kind: Deployment metadata: name: node-exporter namespace: kube-system labels: daemon: “node-exporter” grafanak8sapp: “true” spec: selector: matchLabels: daemon: “node-exporter” grafanak8sapp: “true” template: metadata: name: node-exporter labels: daemon: “node-exporter” grafanak8sapp: “true” spec: volumes:

  1. - name: proc
  2. hostPath:
  3. path: /proc
  4. type: ""
  5. - name: sys
  6. hostPath:
  7. path: /sys
  8. type: ""
  9. containers:
  10. - name: node-exporter
  11. image: harbor.zq.com/public/node-exporter:v0.15.0
  12. imagePullPolicy: IfNotPresent
  13. args:
  14. - --path.procfs=/host_proc
  15. - --path.sysfs=/host_sys
  16. ports:
  17. - name: node-exporter
  18. hostPort: 9100
  19. containerPort: 9100
  20. protocol: TCP
  21. volumeMounts:
  22. - name: sys
  23. readOnly: true
  24. mountPath: /host_sys
  25. - name: proc
  26. readOnly: true
  27. mountPath: /host_proc
  28. hostNetwork: true

EOF

  1. > 主要用途就是将宿主机的`/proc`,`sys`目录挂载给容器,是容器能获取node节点宿主机信息
  2. <a name="0c12cd01"></a>
  3. #### 2.2.3 应用资源配置清单:
  4. 任意node节点

kubectl apply -f http://k8s-yaml.zq.com/node-exporter/ds.yaml kubectl get pod -n kube-system -o wide|grep node-exporter

  1. 创建报错:no matches for kind Deployment in version extensions/v1beta1
  2. ```bash
  3. kubectl create -f ds.yaml
  4. error: unable to recognize "ds.yam": no matches for kind "Deployment" in version "extensions/v1beta1"
  5. 解决:
  6. 修改yaml文件:
  7. ---
  8. apiVersion: extensions/v1beta1
  9. kind: Deployment
  10. metadata:
  11. 修改如下:
  12. ---
  13. apiVersion: apps/v1
  14. kind: Deployment
  15. 这个主要是由于版本升级的原因
  16. 我的 k8s 版本是 1.18.5
  17. 在这个版本中 Deployment 已经启用extensions/v1beta1
  18. DaemonSet, Deployment, StatefulSet, and ReplicaSet resources will no longer be served from extensions/v1beta1, apps/v1beta1, or apps/v1beta2 by default in v1.16.

参考小笔记:https://www.yuque.com/r/note/8407f01c-eaa3-40e6-a1f0-0272130aded6

2.3 部署cadvisor

2.3.1 准备docker镜像

  1. docker pull google/cadvisor:v0.28.3
  2. docker tag 75f88e3ec333 harbor.zq.com/public/cadvisor:0.28.3
  3. docker push harbor.zq.com/public/cadvisor:0.28.3

准备目录

  1. mkdir /data/k8s-yaml/cadvisor
  2. cd /data/k8s-yaml/cadvisor

2.3.2 准备ds资源清单

cadvisor由于要获取每个node上的pod信息,因此也需要使用daemonset方式运行

  1. cat >ds.yaml <<'EOF'
  2. apiVersion: apps/v1
  3. kind: DaemonSet
  4. metadata:
  5. name: cadvisor
  6. namespace: kube-system
  7. labels:
  8. app: cadvisor
  9. spec:
  10. selector:
  11. matchLabels:
  12. name: cadvisor
  13. template:
  14. metadata:
  15. labels:
  16. name: cadvisor
  17. spec:
  18. hostNetwork: true
  19. #------pod的tolerations与node的Taints配合,做POD指定调度----
  20. tolerations:
  21. - key: node-role.kubernetes.io/master
  22. effect: NoSchedule
  23. #-------------------------------------
  24. containers:
  25. - name: cadvisor
  26. image: harbor.zq.com/public/cadvisor:v0.28.3
  27. imagePullPolicy: IfNotPresent
  28. volumeMounts:
  29. - name: rootfs
  30. mountPath: /rootfs
  31. readOnly: true
  32. - name: var-run
  33. mountPath: /var/run
  34. - name: sys
  35. mountPath: /sys
  36. readOnly: true
  37. - name: docker
  38. mountPath: /var/lib/docker
  39. readOnly: true
  40. ports:
  41. - name: http
  42. containerPort: 4194
  43. protocol: TCP
  44. readinessProbe:
  45. tcpSocket:
  46. port: 4194
  47. initialDelaySeconds: 5
  48. periodSeconds: 10
  49. args:
  50. - --housekeeping_interval=10s
  51. - --port=4194
  52. terminationGracePeriodSeconds: 30
  53. volumes:
  54. - name: rootfs
  55. hostPath:
  56. path: /
  57. - name: var-run
  58. hostPath:
  59. path: /var/run
  60. - name: sys
  61. hostPath:
  62. path: /sys
  63. - name: docker
  64. hostPath:
  65. path: /data/docker
  66. EOF

2.3.3 应用资源配置清单:

应用清单前,先在每个node上做以下软连接,否则服务可能报错

  1. mount -o remount,rw /sys/fs/cgroup/
  2. ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu

应用清单

  1. kubectl apply -f http://k8s-yaml.zq.com/cadvisor/ds.yaml

检查:

  1. kubectl -n kube-system get pod -o wide|grep cadvisor

2.4 部署blackbox-exporter

2.4.1 准备docker镜像

  1. docker pull prom/blackbox-exporter:v0.15.1
  2. docker tag 81b70b6158be harbor.zq.com/public/blackbox-exporter:v0.15.1
  3. docker push harbor.zq.com/public/blackbox-exporter:v0.15.1

准备目录

  1. mkdir /data/k8s-yaml/blackbox-exporter
  2. cd /data/k8s-yaml/blackbox-exporter

2.4.2 准备cm资源清单

  1. cat >cm.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: ConfigMap
  4. metadata:
  5. labels:
  6. app: blackbox-exporter
  7. name: blackbox-exporter
  8. namespace: kube-system
  9. data:
  10. blackbox.yml: |-
  11. modules:
  12. http_2xx:
  13. prober: http
  14. timeout: 2s
  15. http:
  16. valid_http_versions: ["HTTP/1.1", "HTTP/2"]
  17. valid_status_codes: [200,301,302]
  18. method: GET
  19. preferred_ip_protocol: "ip4"
  20. tcp_connect:
  21. prober: tcp
  22. timeout: 2s
  23. EOF

2.4.3 准备dp资源清单

  1. cat >dp.yaml <<'EOF'
  2. kind: Deployment
  3. apiVersion: extensions/v1beta1
  4. metadata:
  5. name: blackbox-exporter
  6. namespace: kube-system
  7. labels:
  8. app: blackbox-exporter
  9. annotations:
  10. deployment.kubernetes.io/revision: 1
  11. spec:
  12. replicas: 1
  13. selector:
  14. matchLabels:
  15. app: blackbox-exporter
  16. template:
  17. metadata:
  18. labels:
  19. app: blackbox-exporter
  20. spec:
  21. volumes:
  22. - name: config
  23. configMap:
  24. name: blackbox-exporter
  25. defaultMode: 420
  26. containers:
  27. - name: blackbox-exporter
  28. image: harbor.zq.com/public/blackbox-exporter:v0.15.1
  29. imagePullPolicy: IfNotPresent
  30. args:
  31. - --config.file=/etc/blackbox_exporter/blackbox.yml
  32. - --log.level=info
  33. - --web.listen-address=:9115
  34. ports:
  35. - name: blackbox-port
  36. containerPort: 9115
  37. protocol: TCP
  38. resources:
  39. limits:
  40. cpu: 200m
  41. memory: 256Mi
  42. requests:
  43. cpu: 100m
  44. memory: 50Mi
  45. volumeMounts:
  46. - name: config
  47. mountPath: /etc/blackbox_exporter
  48. readinessProbe:
  49. tcpSocket:
  50. port: 9115
  51. initialDelaySeconds: 5
  52. timeoutSeconds: 5
  53. periodSeconds: 10
  54. successThreshold: 1
  55. failureThreshold: 3
  56. EOF

如果创建dp.yaml失败,报错:error: unable to decode “dp.yaml”: resource.metadataOnlyObject.ObjectMeta: v1.ObjectMeta.Labels: Annotations: ReadString: expects “ or n, but found 1, error found in #10 byte of …|evision”:1},”labels”|…, bigger context …|nnotations”:{“deployment.kubernetes.io/revision”:1},”labels”:{“app”:”blackbox-exporter”},”name”:”bla|…
未修改:
annotations:
deployment.kubernetes.io/revision: 1
修改为:
annotations:
deployment.kubernetes.io/revision: “true”
参考小笔记:https://www.yuque.com/r/note/27d36e88-0a01-4177-aae5-f2a478777ca3

2.4.4 准备svc资源清单

  1. cat >svc.yaml <<'EOF'
  2. kind: Service
  3. apiVersion: v1
  4. metadata:
  5. name: blackbox-exporter
  6. namespace: kube-system
  7. spec:
  8. selector:
  9. app: blackbox-exporter
  10. ports:
  11. - name: blackbox-port
  12. protocol: TCP
  13. port: 9115
  14. EOF

如果没有域名可以使用NodePort暴露端口

  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: blackbox-exporter
  5. namespace: kube-system
  6. spec:
  7. selector:
  8. app: blackbox-exporter
  9. type: NodePort
  10. ports:
  11. - name: blackbox-port
  12. protocol: TCP
  13. port: 9115
  14. targetPort: 9115

2.4.5 准备ingress资源清单

  1. cat >ingress.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Ingress
  4. metadata:
  5. name: blackbox-exporter
  6. namespace: kube-system
  7. spec:
  8. rules:
  9. - host: blackbox.zq.com
  10. http:
  11. paths:
  12. - path: /
  13. backend:
  14. serviceName: blackbox-exporter
  15. servicePort: blackbox-port
  16. EOF

2.4.6 添加域名解析

这里用到了一个域名,添加解析

  1. vi /var/named/zq.com.zone
  2. blackbox A 10.4.7.10
  3. systemctl restart named

2.4.7 应用资源配置清单

  1. kubectl apply -f http://k8s-yaml.zq.com/blackbox-exporter/cm.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/blackbox-exporter/dp.yaml
  3. kubectl apply -f http://k8s-yaml.zq.com/blackbox-exporter/svc.yaml
  4. kubectl apply -f http://k8s-yaml.zq.com/blackbox-exporter/ingress.yaml

2.4.8 访问域名测试

访问http://blackbox.zq.com,显示如下界面,表示blackbox已经运行成
K8S(13)监控实战-部署prometheus - 图2

3 部署prometheus server

3.1 准备prometheus server环境

3.1.1 准备docker镜像

  1. docker pull prom/prometheus:v2.14.0
  2. docker tag 7317640d555e harbor.zq.com/infra/prometheus:v2.14.0
  3. docker push harbor.zq.com/infra/prometheus:v2.14.0

准备目录

  1. mkdir /data/k8s-yaml/prometheus-server
  2. cd /data/k8s-yaml/prometheus-server

3.1.2 准备rbac资源清单

  1. cat >rbac.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: ServiceAccount
  4. metadata:
  5. labels:
  6. addonmanager.kubernetes.io/mode: Reconcile
  7. kubernetes.io/cluster-service: "true"
  8. name: prometheus
  9. namespace: infra
  10. ---
  11. apiVersion: rbac.authorization.k8s.io/v1
  12. kind: ClusterRole
  13. metadata:
  14. labels:
  15. addonmanager.kubernetes.io/mode: Reconcile
  16. kubernetes.io/cluster-service: "true"
  17. name: prometheus
  18. rules:
  19. - apiGroups:
  20. - ""
  21. resources:
  22. - nodes
  23. - nodes/metrics
  24. - services
  25. - endpoints
  26. - pods
  27. verbs:
  28. - get
  29. - list
  30. - watch
  31. - apiGroups:
  32. - ""
  33. resources:
  34. - configmaps
  35. verbs:
  36. - get
  37. - nonResourceURLs:
  38. - /metrics
  39. verbs:
  40. - get
  41. ---
  42. apiVersion: rbac.authorization.k8s.io/v1
  43. kind: ClusterRoleBinding
  44. metadata:
  45. labels:
  46. addonmanager.kubernetes.io/mode: Reconcile
  47. kubernetes.io/cluster-service: "true"
  48. name: prometheus
  49. roleRef:
  50. apiGroup: rbac.authorization.k8s.io
  51. kind: ClusterRole
  52. name: prometheus
  53. subjects:
  54. - kind: ServiceAccount
  55. name: prometheus
  56. namespace: infra
  57. EOF

3.1.3 准备dp资源清单

加上--web.enable-lifecycle启用远程热加载配置文件,配置文件改变后不用重启prometheus 调用指令是curl -X POST http://localhost:9090/-/reload storage.tsdb.min-block-duration=10m只加载10分钟数据到内 storage.tsdb.retention=72h 保留72小时数据

  1. cat >dp.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Deployment
  4. metadata:
  5. annotations:
  6. deployment.kubernetes.io/revision: "5"
  7. labels:
  8. name: prometheus
  9. name: prometheus
  10. namespace: infra
  11. spec:
  12. progressDeadlineSeconds: 600
  13. replicas: 1
  14. revisionHistoryLimit: 7
  15. selector:
  16. matchLabels:
  17. app: prometheus
  18. strategy:
  19. rollingUpdate:
  20. maxSurge: 1
  21. maxUnavailable: 1
  22. type: RollingUpdate
  23. template:
  24. metadata:
  25. labels:
  26. app: prometheus
  27. spec:
  28. containers:
  29. - name: prometheus
  30. image: harbor.zq.com/infra/prometheus:v2.14.0
  31. imagePullPolicy: IfNotPresent
  32. command:
  33. - /bin/prometheus
  34. args:
  35. - --config.file=/data/etc/prometheus.yml
  36. - --storage.tsdb.path=/data/prom-db
  37. - --storage.tsdb.min-block-duration=10m
  38. - --storage.tsdb.retention=72h
  39. - --web.enable-lifecycle
  40. ports:
  41. - containerPort: 9090
  42. protocol: TCP
  43. volumeMounts:
  44. - mountPath: /data
  45. name: data
  46. resources:
  47. requests:
  48. cpu: "1000m"
  49. memory: "1.5Gi"
  50. limits:
  51. cpu: "2000m"
  52. memory: "3Gi"
  53. imagePullSecrets:
  54. - name: harbor
  55. securityContext:
  56. runAsUser: 0
  57. serviceAccountName: prometheus
  58. volumes:
  59. - name: data
  60. nfs:
  61. server: hdss7-200
  62. path: /data/nfs-volume/prometheus
  63. EOF

如果没有nfs可以使用hostPath:挂载
需要修改

  1. - name: data
  2. hostPath:
  3. path: /data/nfs-volume/prometheus
  4. type: Directory

3.1.4 准备svc资源清单

  1. cat >svc.yaml <<'EOF'
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. name: prometheus
  6. namespace: infra
  7. spec:
  8. ports:
  9. - port: 9090
  10. protocol: TCP
  11. targetPort: 9090
  12. selector:
  13. app: prometheus
  14. EOF
  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: prometheus
  5. namespace: infra
  6. spec:
  7. selector:
  8. app: prometheus
  9. type: NodePort
  10. ports:
  11. - port: 9090
  12. protocol: TCP
  13. targetPort: 9090
  14. nodePort: 32133

3.1.5 准备ingress资源清单

  1. cat >ingress.yaml <<'EOF'
  2. apiVersion: extensions/v1beta1
  3. kind: Ingress
  4. metadata:
  5. annotations:
  6. kubernetes.io/ingress.class: traefik
  7. name: prometheus
  8. namespace: infra
  9. spec:
  10. rules:
  11. - host: prometheus.zq.com
  12. http:
  13. paths:
  14. - path: /
  15. backend:
  16. serviceName: prometheus
  17. servicePort: 9090
  18. EOF

3.1.6 添加域名解析

这里用到一个域名prometheus.zq.com,添加解析:

  1. vi /var/named/od.com.zone
  2. prometheus A 10.4.7.10
  3. systemctl restart named

3.2 部署prometheus server

3.2.1 准备目录和证书

  1. mkdir -p /data/nfs-volume/prometheus/
  2. mkdir -p /data/nfs-volume/prometheus/prom-db
  3. cd /data/nfs-volume/prometheus/
  4. # 拷贝配置文件中用到的证书:
  5. cp /opt/certs/ca.pem ./
  6. cp /opt/certs/client.pem ./
  7. cp /opt/certs/client-key.pem ./

3.2.2 创建prometheus配置文件

配置文件说明: 此配置为通用配置,除第一个jobetcd是做的静态配置外,其他8个job都是做的自动发现 因此只需要修改etcd的配置后,就可以直接用于生产环境

  1. cat >/data/nfs-volume/prometheus/prometheus.yml <<'EOF'
  2. global:
  3. scrape_interval: 15s
  4. evaluation_interval: 15s
  5. scrape_configs:
  6. - job_name: 'etcd'
  7. tls_config:
  8. ca_file: /data/etc/ca.pem
  9. cert_file: /data/etc/client.pem
  10. key_file: /data/etc/client-key.pem
  11. scheme: https
  12. static_configs:
  13. - targets:
  14. - '10.4.7.12:2379'
  15. - '10.4.7.21:2379'
  16. - '10.4.7.22:2379'
  17. - job_name: 'kubernetes-apiservers'
  18. kubernetes_sd_configs:
  19. - role: endpoints
  20. scheme: https
  21. tls_config:
  22. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  23. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  24. relabel_configs:
  25. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  26. action: keep
  27. regex: default;kubernetes;https
  28. - job_name: 'kubernetes-pods'
  29. kubernetes_sd_configs:
  30. - role: pod
  31. relabel_configs:
  32. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  33. action: keep
  34. regex: true
  35. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  36. action: replace
  37. target_label: __metrics_path__
  38. regex: (.+)
  39. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  40. action: replace
  41. regex: ([^:]+)(?::\d+)?;(\d+)
  42. replacement: $1:$2
  43. target_label: __address__
  44. - action: labelmap
  45. regex: __meta_kubernetes_pod_label_(.+)
  46. - source_labels: [__meta_kubernetes_namespace]
  47. action: replace
  48. target_label: kubernetes_namespace
  49. - source_labels: [__meta_kubernetes_pod_name]
  50. action: replace
  51. target_label: kubernetes_pod_name
  52. - job_name: 'kubernetes-kubelet'
  53. kubernetes_sd_configs:
  54. - role: node
  55. relabel_configs:
  56. - action: labelmap
  57. regex: __meta_kubernetes_node_label_(.+)
  58. - source_labels: [__meta_kubernetes_node_name]
  59. regex: (.+)
  60. target_label: __address__
  61. replacement: ${1}:10255
  62. - job_name: 'kubernetes-cadvisor'
  63. kubernetes_sd_configs:
  64. - role: node
  65. relabel_configs:
  66. - action: labelmap
  67. regex: __meta_kubernetes_node_label_(.+)
  68. - source_labels: [__meta_kubernetes_node_name]
  69. regex: (.+)
  70. target_label: __address__
  71. replacement: ${1}:4194
  72. - job_name: 'kubernetes-kube-state'
  73. kubernetes_sd_configs:
  74. - role: pod
  75. relabel_configs:
  76. - action: labelmap
  77. regex: __meta_kubernetes_pod_label_(.+)
  78. - source_labels: [__meta_kubernetes_namespace]
  79. action: replace
  80. target_label: kubernetes_namespace
  81. - source_labels: [__meta_kubernetes_pod_name]
  82. action: replace
  83. target_label: kubernetes_pod_name
  84. - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
  85. regex: .*true.*
  86. action: keep
  87. - source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
  88. regex: 'node-exporter;(.*)'
  89. action: replace
  90. target_label: nodename
  91. - job_name: 'blackbox_http_pod_probe'
  92. metrics_path: /probe
  93. kubernetes_sd_configs:
  94. - role: pod
  95. params:
  96. module: [http_2xx]
  97. relabel_configs:
  98. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  99. action: keep
  100. regex: http
  101. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
  102. action: replace
  103. regex: ([^:]+)(?::\d+)?;(\d+);(.+)
  104. replacement: $1:$2$3
  105. target_label: __param_target
  106. - action: replace
  107. target_label: __address__
  108. replacement: blackbox-exporter.kube-system:9115
  109. - source_labels: [__param_target]
  110. target_label: instance
  111. - action: labelmap
  112. regex: __meta_kubernetes_pod_label_(.+)
  113. - source_labels: [__meta_kubernetes_namespace]
  114. action: replace
  115. target_label: kubernetes_namespace
  116. - source_labels: [__meta_kubernetes_pod_name]
  117. action: replace
  118. target_label: kubernetes_pod_name
  119. - job_name: 'blackbox_tcp_pod_probe'
  120. metrics_path: /probe
  121. kubernetes_sd_configs:
  122. - role: pod
  123. params:
  124. module: [tcp_connect]
  125. relabel_configs:
  126. - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
  127. action: keep
  128. regex: tcp
  129. - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
  130. action: replace
  131. regex: ([^:]+)(?::\d+)?;(\d+)
  132. replacement: $1:$2
  133. target_label: __param_target
  134. - action: replace
  135. target_label: __address__
  136. replacement: blackbox-exporter.kube-system:9115
  137. - source_labels: [__param_target]
  138. target_label: instance
  139. - action: labelmap
  140. regex: __meta_kubernetes_pod_label_(.+)
  141. - source_labels: [__meta_kubernetes_namespace]
  142. action: replace
  143. target_label: kubernetes_namespace
  144. - source_labels: [__meta_kubernetes_pod_name]
  145. action: replace
  146. target_label: kubernetes_pod_name
  147. - job_name: 'traefik'
  148. kubernetes_sd_configs:
  149. - role: pod
  150. relabel_configs:
  151. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
  152. action: keep
  153. regex: traefik
  154. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  155. action: replace
  156. target_label: __metrics_path__
  157. regex: (.+)
  158. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  159. action: replace
  160. regex: ([^:]+)(?::\d+)?;(\d+)
  161. replacement: $1:$2
  162. target_label: __address__
  163. - action: labelmap
  164. regex: __meta_kubernetes_pod_label_(.+)
  165. - source_labels: [__meta_kubernetes_namespace]
  166. action: replace
  167. target_label: kubernetes_namespace
  168. - source_labels: [__meta_kubernetes_pod_name]
  169. action: replace
  170. target_label: kubernetes_pod_name
  171. EOF

3.2.3 应用资源配置清单

  1. kubectl apply -f http://k8s-yaml.zq.com/prometheus-server/rbac.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/prometheus-server/dp.yaml
  3. kubectl apply -f http://k8s-yaml.zq.com/prometheus-server/svc.yaml
  4. kubectl apply -f http://k8s-yaml.zq.com/prometheus-server/ingress.yaml

3.2.4 浏览器验证

访问http://prometheus.zq.com,如果能成功访问的话,表示启动成功
点击status->configuration就是我们的配置文件
K8S(13)监控实战-部署prometheus - 图3

4 使服务能被prometheus自动监控

点击status->targets,展示的就是我们在prometheus.yml中配置的job-name,这些targets基本可以满足我们收集数据的需求。
K8S(13)监控实战-部署prometheus - 图4

5个编号的job-name已经被发现并获取数据 接下来就需要将剩下的4个ob-name对应的服务纳入监控 纳入监控的方式是给需要收集数据的服务添加annotations

4.1 让traefik能被自动监控

4.1.1 修改traefik的yaml

修改fraefik的yaml文件,跟labels同级,添加annotations配置

  1. vim /data/k8s-yaml/traefik/ds.yaml
  2. ........
  3. spec:
  4. template:
  5. metadata:
  6. labels:
  7. k8s-app: traefik-ingress
  8. name: traefik-ingress
  9. #--------增加内容--------
  10. annotations:
  11. prometheus_io_scheme: "traefik"
  12. prometheus_io_path: "/metrics"
  13. prometheus_io_port: "8080"
  14. #--------增加结束--------
  15. spec:
  16. serviceAccountName: traefik-ingress-controller
  17. ........

任意节点重新应用配置

  1. kubectl delete -f http://k8s-yaml.zq.com/traefik/ds.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/traefik/ds.yaml

4.1.2 应用配置查看

等待pod重启以后,再在prometheus上查看traefik是否能正常获取数据了
K8S(13)监控实战-部署prometheus - 图5

4.2 用blackbox检测TCP/HTTP服务状态

blackbox是检测容器内服务存活性的,也就是端口健康状态检查,分为tcp和http两种方法
能用http的情况尽量用http,没有提供http接口的服务才用tcp

4.2.1 被检测服务准备

使用测试环境的dubbo服务来做演示,其他环境类似

  1. dashboard中开启apollo-portal和test空间中的apollo
  2. dubbo-demo-service使用tcp的annotation
  3. dubbo-demo-consumer使用HTTP的annotation

    4.2.2 添加tcp的annotation

    等两个服务起来以后,首先在dubbo-demo-service资源中添加一个TCP的annotation
    1. vim /data/k8s-yaml/test/dubbo-demo-server/dp.yaml
    2. ......
    3. spec:
    4. ......
    5. template:
    6. metadata:
    7. labels:
    8. app: dubbo-demo-service
    9. name: dubbo-demo-service
    10. #--------增加内容--------
    11. annotations:
    12. blackbox_port: "20880"
    13. blackbox_scheme: "tcp"
    14. #--------增加结束--------
    15. spec:
    16. containers:
    17. image: harbor.zq.com/app/dubbo-demo-service:apollo_200512_0746
    任意节点重新应用配置
    1. kubectl delete -f http://k8s-yaml.zq.com/test/dubbo-demo-server/dp.yaml
    2. kubectl apply -f http://k8s-yaml.zq.com/test/dubbo-demo-server/dp.yaml
    浏览器中查看http://blackbox.zq.com/http://prometheus.zq.com/targets
    我们运行的dubbo-demo-server服务,tcp端口20880已经被发现并在监控中
    K8S(13)监控实战-部署prometheus - 图6

    4.2.3 添加http的annotation

    接下来在dubbo-demo-consumer资源中添加一个HTTP的annotation:
    1. vim /data/k8s-yaml/test/dubbo-demo-consumer/dp.yaml
    2. spec:
    3. ......
    4. template:
    5. metadata:
    6. labels:
    7. app: dubbo-demo-consumer
    8. name: dubbo-demo-consumer
    9. #--------增加内容--------
    10. annotations:
    11. blackbox_path: "/hello?name=health"
    12. blackbox_port: "8080"
    13. blackbox_scheme: "http"
    14. #--------增加结束--------
    15. spec:
    16. containers:
    17. - name: dubbo-demo-consumer
    18. ......
    任意节点重新应用配置
    1. kubectl delete -f http://k8s-yaml.zq.com/test/dubbo-demo-consumer/dp.yaml
    2. kubectl apply -f http://k8s-yaml.zq.com/test/dubbo-demo-consumer/dp.yaml
    K8S(13)监控实战-部署prometheus - 图7

    4.3 添加监控jvm信息

    dubbo-demo-service和dubbo-demo-consumer都添加下列annotation注解,以便监控pod中的jvm信息
    1. vim /data/k8s-yaml/test/dubbo-demo-server/dp.yaml
    2. vim /data/k8s-yaml/test/dubbo-demo-consumer/dp.yaml
    3. annotations:
    4. #....已有略....
    5. prometheus_io_scrape: "true"
    6. prometheus_io_port: "12346"
    7. prometheus_io_path: "/"

    12346是dubbo的POD启动命令中使用jmx_javaagent用到的端口,因此可以用来收集jvm信息

任意节点重新应用配置

  1. kubectl apply -f http://k8s-yaml.zq.com/test/dubbo-demo-server/dp.yaml
  2. kubectl apply -f http://k8s-yaml.zq.com/test/dubbo-demo-consumer/dp.yaml

K8S(13)监控实战-部署prometheus - 图8
至此,所有9个服务,都获取了数据