什么是Operator

Operator是由CoreOS开发的,用来扩展Kubernetes API,特定的应用程序控制器,它用来创建、配置和管理复杂的有状态应用,如数据库、缓存和监控系统。Operator基于Kubernetes的资源和控制器概念之上构建,但同时又包含了应用程序特定的领域知识。创建Operator的关键是CRD(自定义资源)的设计。
Operator是将运维人员对软件操作的知识给代码化,同时利用 Kubernetes 强大的抽象来管理大规模的软件应用。目前CoreOS官方提供了几种Operator的实现,其中就包括我们今天的主角:Prometheus OperatorOperator的核心实现就是基于 Kubernetes 的以下两个概念:

  • 资源:对象的状态定义
  • 控制器:观测、分析和行动,以调节资源的分布

当前CoreOS提供的以下四种Operator:

  • etcd:创建etcd集群
  • Rook:云原生环境下的文件、块、对象存储服务
  • Prometheus:创建Prometheus监控实例
  • Tectonic:部署Kubernetes集群

接下来我们将使用Operator创建Prometheus。

安装

我们这里直接通过 Prometheus-Operator 的源码来进行安装,当然也可以用 Helm 来进行一键安装,我们采用源码安装可以去了解更多的实现细节。首页将源码下下来:
gitlab地址:https://github.com/prometheus-operator/kube-prometheus
下载时要注意版本,官方有推荐的版本对应下载
Kubernetes compatibility matrix
The following versions are supported and work as we test against these versions in their respective branches. But note that other versions might work!

kube-prometheus stack Kubernetes 1.14 Kubernetes 1.15 Kubernetes 1.16 Kubernetes 1.17 Kubernetes 1.18 Kubernetes 1.19
release-0.3
release-0.4 ✔ (v1.16.5+)
release-0.5
release-0.6
HEAD x

Note: Due to two bugs in Kubernetes v1.16.1, and prior to Kubernetes v1.16.5 the kube-prometheus release-0.4 branch only supports v1.16.5 and higher. The extension-apiserver-authentication-reader role in the kube-system namespace can be manually edited to include list and watch permissions in order to workaround the second issue with Kubernetes v1.16.2 through v1.16.4.

我的kuberbetes的集群地址是 v1.15.9,所以我下载是版本是release-0.3

  1. https://github.com/prometheus-operator/kube-prometheus/tags
  2. wget https://github.com/prometheus-operator/kube-prometheus/archive/v0.3.0.tar.gz
  3. tar zxf v0.3.0.tar.gz
  4. cd kube-prometheus-0.3.0/manifests

进入到 manifests 目录下面,这个目录下面包含我们所有的资源清单文件,直接在该文件夹下面执行创建资源命令即可:

  1. kubectl apply -f setup/
  2. kubectl apply -f .

部署完成后,会创建一个名为monitoring的 namespace,所以资源对象对将部署在改命名空间下面,此外 Operator 会自动创建4个 CRD 资源对象:

  1. [admin@ch-k8s1 manifests]$ kubectl get crd |grep coreos
  2. alertmanagers.monitoring.coreos.com 2020-10-10T07:49:14Z
  3. podmonitors.monitoring.coreos.com 2020-12-01T05:36:48Z
  4. prometheuses.monitoring.coreos.com 2020-10-10T07:49:14Z
  5. prometheusrules.monitoring.coreos.com 2020-10-10T07:49:14Z
  6. servicemonitors.monitoring.coreos.com 2020-10-10T07:49:14Z
  7. [admin@ch-k8s1 manifests]$

可以在 monitoring 命名空间下面查看所有的 Pod,其中 alertmanager 和 prometheus 是用 StatefulSet 控制器管理的,其中还有一个比较核心的 prometheus-operator 的 Pod,用来控制其他资源对象和监听对象变化的:

  1. [admin@ch-k8s1 manifests]$ kubectl get pods -n monitoring
  2. NAME READY STATUS RESTARTS AGE
  3. alertmanager-main-0 2/2 Running 0 24h
  4. alertmanager-main-1 2/2 Running 0 24h
  5. alertmanager-main-2 2/2 Running 0 24h
  6. grafana-65446cdfd4-z8vgh 1/1 Running 0 21h
  7. kube-state-metrics-7f6d7b46b4-dhg8q 3/3 Running 0 24h
  8. node-exporter-9fhkt 2/2 Running 0 24h
  9. node-exporter-b8gcm 2/2 Running 0 24h
  10. node-exporter-fdfxg 2/2 Running 0 24h
  11. node-exporter-pxz6f 2/2 Running 0 24h
  12. node-exporter-rvrtq 2/2 Running 0 24h
  13. node-exporter-s5pxn 2/2 Running 0 24h
  14. prometheus-adapter-68698bc948-5bz6c 1/1 Running 0 24h
  15. prometheus-k8s-0 3/3 Running 1 24h
  16. prometheus-k8s-1 3/3 Running 1 24h
  17. prometheus-operator-6685db5c6-b4s6l 1/1 Running 0 24h

查看创建的 Service:

  1. [admin@ch-k8s1 manifests]$ kubectl get svc -n monitoring
  2. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  3. alertmanager-main ClusterIP 10.43.160.126 <none> 9093/TCP 24h
  4. alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 24h
  5. grafana NodePort 10.43.69.32 <none> 3000/TCP 24h
  6. kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 24h
  7. node-exporter ClusterIP None <none> 9100/TCP 24h
  8. prometheus-adapter ClusterIP 10.43.52.173 <none> 443/TCP 24h
  9. prometheus-k8s NodePort 10.43.79.240 <none> 9090/TCP 24h
  10. prometheus-operated ClusterIP None <none> 9090/TCP 24h
  11. prometheus-operator ClusterIP None <none> 8080/TCP 24h

可以看到上面针对 grafana 和 prometheus 都创建了一个类型为 ClusterIP 的 Service,当然如果我们想要在外网访问这两个服务的话可以通过创建对应的 Ingress 对象或者使用 NodePort 类型的 Service,我们这里为了简单,直接使用 NodePort 类型的服务即可,编辑 grafana 和 prometheus-k8s 这两个 Service,将服务类型更改为 NodePort:

  1. [admin@ch-k8s1 manifests]$ kubectl edit svc grafana -n monitoring

spec: clusterIP: 10.43.69.32 externalTrafficPolicy: Cluster ports:

  • name: http nodePort: 30001 port: 3000 protocol: TCP targetPort: http selector: app: grafana sessionAffinity: None type: NodePort
    1. [admin@ch-k8s1 manifests]$ kubectl edit svc prometheus-k8s -n monitoring
    spec: clusterIP: 10.43.79.240 externalTrafficPolicy: Cluster ports:
  • name: web nodePort: 30002 port: 9090 protocol: TCP targetPort: web selector: app: prometheus prometheus: k8s sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 type: NodePort
    1. [admin@ch-k8s1 manifests]$ kubectl get svc -n monitoring
    2. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    3. alertmanager-main ClusterIP 10.43.160.126 <none> 9093/TCP 24h
    4. alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 24h
    5. grafana NodePort 10.43.69.32 <none> 3000:30001/TCP 24h
    6. kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 24h
    7. node-exporter ClusterIP None <none> 9100/TCP 24h
    8. prometheus-adapter ClusterIP 10.43.52.173 <none> 443/TCP 24h
    9. prometheus-k8s NodePort 10.43.79.240 <none> 9090:30002/TCP 24h
    10. prometheus-operated ClusterIP None <none> 9090/TCP 24h
    11. prometheus-operator ClusterIP None <none> 8080/TCP 24h
    更改完成后,我们就可以通过去访问上面的两个服务了,比如查看 prometheus 的 targets 页面:
    image.png
    我们可以看到大部分的配置都是正常的,只有两三个没有管理到对应的监控目标,比如 kube-controller-manager 和 kube-scheduler 这两个系统组件,这就和 ServiceMonitor 的定义有关系了,我们先来查看下 kube-scheduler 组件对应的 ServiceMonitor 资源的定义:(prometheus-serviceMonitorKubeScheduler.yaml)

    配置kube-scheduler

    ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: monitoring spec: endpoints:
  • interval: 30s #30s获取一次信息 port: http-metrics # 对应service的端口名 jobLabel: k8s-app
    namespaceSelector: # 表示去匹配某一命名空间中的service,如果想从所有的namespace中匹配用any: true matchNames:
    • kube-system selector: # 匹配的 Service 的labels,如果使用mathLabels,则下面的所有标签都匹配时才会匹配该service,如果使用matchExpressions,则至少匹配一个标签的service都会被选择 matchLabels: k8s-app: kube-scheduler 上面是一个典型的 ServiceMonitor 资源文件的声明方式,上面我们通过`selector.matchLabels`在 kube-system 这个命名空间下面匹配具有`k8s-app=kube-scheduler`这样的 Service,但是我们系统中根本就没有对应的 Service,所以我们需要手动创建一个 Service:(prometheus-kubeSchedulerService.yaml)yaml apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: k8s-app: kube-scheduler spec: selector: component: kube-scheduler ports:
  • name: http-metrics port: 10251 targetPort: 10251 protocol: TCP
    1. 其中最重要的是上面 labels selector 部分,labels 区域的配置必须和我们上面的 ServiceMonitor 对象中的 selector 保持一致,`selector`下面配置的是`component=kube-scheduler`,为什么会是这个 label 标签呢,我们可以去 describe kube-scheduelr 这个 Pod
    2. ```yaml
    3. $ kubectl describe pod kube-scheduler-k8s-master -n kube-system
    4. Name: kube-scheduler-k8s-master
    5. Namespace: kube-system
    6. Priority: 2000000000
    7. PriorityClassName: system-cluster-critical
    8. Node: k8s-master/172.16.138.40
    9. Start Time: Tue, 19 Feb 2019 21:15:05 -0500
    10. Labels: component=kube-scheduler
    11. tier=control-plane
    12. ......
    我们可以看到这个 Pod 具有component=kube-schedulertier=control-plane这两个标签,而前面这个标签具有更唯一的特性,所以使用前面这个标签较好,这样上面创建的 Service 就可以和我们的 Pod 进行关联了,直接创建即可:
    1. $ kubectl create -f prometheus-kubeSchedulerService.yaml
    2. $ kubectl get svc -n kube-system -l k8s-app=kube-scheduler
    3. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    4. kube-scheduler ClusterIP 10.103.165.58 <none> 10251/TCP 4m
    创建完成后,隔一小会儿后去 prometheus 查看 targets 下面 kube-scheduler 的状态:
    image.png我们可以看到现在已经发现了 target,但是抓取数据结果出错了,这个错误是因为我们集群是使用 kubeadm 搭建的,其中 kube-scheduler 默认是绑定在127.0.0.1上面的,而上面我们这个地方是想通过节点的 IP 去访问,所以访问被拒绝了,我们只要把 kube-scheduler 绑定的地址更改成0.0.0.0即可满足要求,由于 kube-scheduler 是以静态 Pod 的形式运行在集群中的,所以我们只需要更改静态 Pod 目录下面对应的 YAML (kube-scheduler.yaml)文件即可: ```yaml $ cd /etc/kubernetes/manifests 将 kube-scheduler.yaml 文件中-command的—address地址更改成0.0.0.0 $ vim kube-scheduler.yaml apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: “” creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers:
  • command:
    • kube-scheduler
    • —address=0.0.0.0
    • —kubeconfig=/etc/kubernetes/scheduler.conf
    • —leader-elect=true …. ``` 修改完成后我们将该文件从当前文件夹中移除,隔一会儿再移回该目录,就可以自动更新了,然后再去看 prometheus 中 kube-scheduler 这个 target 是否已经正常了:
      image.png

      配置kube-controller-manager

      我们来查看一下kube-controller-manager的ServiceMonitor资源的定义: ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: monitoring spec: endpoints:
  • interval: 30s metricRelabelings:
    • action: drop regex: etcd_(debugging|disk|request|server).* sourceLabels:
      • name port: http-metrics jobLabel: k8s-app namespaceSelector: matchNames:
    • kube-system selector: matchLabels: k8s-app: kube-controller-manager
      1. 上面我们可以看到是通过k8s-app: kube-controller-manager这个标签选择的service,但系统中没有这个service。这里我们手动创建一个:<br />创建前我们需要看确定pod的标签:
      2. ```yaml
      3. $ kubectl describe pod kube-controller-manager-k8s-master -n kube-system
      4. Name: kube-controller-manager-k8s-master
      5. Namespace: kube-system
      6. Priority: 2000000000
      7. PriorityClassName: system-cluster-critical
      8. Node: k8s-master/172.16.138.40
      9. Start Time: Tue, 19 Feb 2019 21:15:16 -0500
      10. Labels: component=kube-controller-manager
      11. tier=control-plane
      12. ....
      创建svc ```yaml apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: k8s-app: kube-controller-manager spec: selector: component: kube-controller-manager ports:
  • name: http-metrics port: 10252 targetPort: 10252 protocol: TCP 创建完后,我们查看targer<br />![image.png](https://cdn.nlark.com/yuque/0/2020/png/2678081/1606898080953-7f1f8ffd-6fba-4f5f-b407-0fe77ba1a309.png#crop=0&crop=0&crop=1&crop=1&height=291&id=A6EGR&margin=%5Bobject%20Object%5D&name=image.png&originHeight=582&originWidth=2772&originalType=binary&ratio=1&rotation=0&showTitle=false&size=189940&status=done&style=none&title=&width=1386)<br />这里和上面是同一个问题。让我们使用上面的方法修改。让我们修改kube-controller-manager.yaml:yaml apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: “” creationTimestamp: null labels: component: kube-controller-manager tier: control-plane name: kube-controller-manager namespace: kube-system spec: containers:
  • command:
    • kube-controller-manager
    • —node-monitor-grace-period=10s
    • —pod-eviction-timeout=10s
    • —address=0.0.0.0 #修改 …… ``` 修改完成后我们将该文件从当前文件夹中移除,隔一会儿再移回该目录,就可以自动更新了,然后再去看 prometheus 中 kube-controller-manager 这个 target 是否已经正常了:
      image.png
      上面的监控数据配置完成后,现在我们可以去查看下 grafana 下面的 dashboard,同样使用上面的 NodePort 访问即可,第一次登录使用 admin:admin 登录即可,进入首页后,可以发现已经和我们的 Prometheus 数据源关联上了,正常来说可以看到一些监控图表了:
      image.png

配置 PrometheusRule

现在我们知道怎么自定义一个 ServiceMonitor 对象了,但是如果需要自定义一个报警规则的话呢?比如现在我们去查看 Prometheus Dashboard 的 Alert 页面下面就已经有一些报警规则了,还有一些是已经触发规则的了:

但是这些报警信息是哪里来的呢?他们应该用怎样的方式通知我们呢?我们知道之前我们使用自定义的方式可以在 Prometheus 的配置文件之中指定 AlertManager 实例和 报警的 rules 文件,现在我们通过 Operator 部署的呢?我们可以在 Prometheus Dashboard 的 Config 页面下面查看关于 AlertManager 的配置:
image.png
但是这些报警信息是哪里来的呢?他们应该用怎样的方式通知我们呢?我们知道之前我们使用自定义的方式可以在 Prometheus 的配置文件之中指定 AlertManager 实例和 报警的 rules 文件,现在我们通过 Operator 部署的呢?我们可以在 Prometheus Dashboard 的 Config 页面下面查看关于 AlertManager 的配置:

  1. alerting:
  2. alert_relabel_configs:
  3. - separator: ;
  4. regex: prometheus_replica
  5. replacement: $1
  6. action: labeldrop
  7. alertmanagers:
  8. - kubernetes_sd_configs:
  9. - role: endpoints
  10. namespaces:
  11. names:
  12. - monitoring
  13. scheme: http
  14. path_prefix: /
  15. timeout: 10s
  16. api_version: v1
  17. relabel_configs:
  18. - source_labels: [__meta_kubernetes_service_name]
  19. separator: ;
  20. regex: alertmanager-main
  21. replacement: $1
  22. action: keep
  23. - source_labels: [__meta_kubernetes_endpoint_port_name]
  24. separator: ;
  25. regex: web
  26. replacement: $1
  27. action: keep
  28. rule_files:
  29. - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml

上面 alertmanagers 实例的配置我们可以看到是通过角色为 endpoints 的 kubernetes 的服务发现机制获取的,匹配的是服务名为 alertmanager-main,端口名未 web 的 Service 服务,我们查看下 alertmanager-main 这个 Service:

  1. [admin@ch-k8s1 ~]$ kubectl get svc alertmanager-main -n monitoring -o yaml
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. annotations:
  6. kubectl.kubernetes.io/last-applied-configuration: |
  7. {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"alertmanager-main","namespace":"monitoring"},"spec":{"ports":[{"name":"web","port":9093,"targetPort":"web"}],"selector":{"alertmanager":"main","app":"alertmanager"},"sessionAffinity":"ClientIP"}}
  8. creationTimestamp: "2020-12-02T09:15:28Z"
  9. labels:
  10. alertmanager: main
  11. name: alertmanager-main
  12. namespace: monitoring
  13. resourceVersion: "11875829"
  14. selfLink: /api/v1/namespaces/monitoring/services/alertmanager-main
  15. uid: 7e191c6c-0add-4931-b0fc-b12645ad6bb7
  16. spec:
  17. clusterIP: 10.43.56.30
  18. ports:
  19. - name: web
  20. port: 9093
  21. protocol: TCP
  22. targetPort: web
  23. selector:
  24. alertmanager: main
  25. app: alertmanager
  26. sessionAffinity: ClientIP
  27. sessionAffinityConfig:
  28. clientIP:
  29. timeoutSeconds: 10800
  30. type: ClusterIP
  31. status:
  32. loadBalancer: {}

可以看到服务名正是 alertmanager-main,Port 定义的名称也是 web,符合上面的规则,所以 Prometheus 和 AlertManager 组件就正确关联上了。而对应的报警规则文件位于:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/目录下面所有的 YAML 文件。我们可以进入 Prometheus 的 Pod 中验证下该目录下面是否有 YAML 文件:

  1. [admin@ch-k8s1 ~]$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring
  2. kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
  3. Defaulting container name to prometheus.
  4. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
  5. /prometheus $ cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/monitoring-prometheus-k8s-rules.yaml
  6. groups:
  7. - name: node-exporter.rules
  8. rules:
  9. - expr: |
  10. count without (cpu) (
  11. count without (mode) (
  12. node_cpu_seconds_total{job="node-exporter"}
  13. )
  14. )
  15. record: instance:node_num_cpu:sum
  16. - expr: |
  17. 1 - avg without (cpu, mode) (
  18. rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m])
  19. )
  20. record: instance:node_cpu_utilisation:rate1m
  21. - expr: |
  22. (
  23. ........

这个 YAML 文件实际上就是我们之前创建的一个 PrometheusRule 文件包含的:

  1. [admin@ch-k8s1 manifests]$ cat prometheus-rules.yaml
  2. apiVersion: monitoring.coreos.com/v1
  3. kind: PrometheusRule
  4. metadata:
  5. labels:
  6. prometheus: k8s
  7. role: alert-rules
  8. name: prometheus-k8s-rules
  9. namespace: monitoring
  10. spec:
  11. groups:
  12. - name: node-exporter.rules
  13. rules:
  14. - expr: |
  15. count without (cpu) (
  16. count without (mode) (
  17. node_cpu_seconds_total{job="node-exporter"}
  18. )
  19. )
  20. record: instance:node_num_cpu:sum
  21. - expr: |
  22. 1 - avg without (cpu, mode) (
  23. rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m])
  24. )
  25. record: instance:node_cpu_utilisation:rate1m
  26. - expr: |
  27. (
  28. node_load1{job="node-exporter"}
  29. /
  30. instance:node_num_cpu:sum{job="node-exporter"}
  31. )
  32. record: instance:node_load1_per_cpu:ratio
  33. .........

我们这里的 PrometheusRule 的 name 为 prometheus-k8s-rules,namespace 为 monitoring,我们可以猜想到我们创建一个 PrometheusRule 资源对象后,会自动在上面的 prometheus-k8s-rulefiles-0 目录下面生成一个对应的<namespace>-<name>.yaml文件,所以如果以后我们需要自定义一个报警选项的话,只需要定义一个 PrometheusRule 资源对象即可。至于为什么 Prometheus 能够识别这个 PrometheusRule 资源对象呢?这就需要查看我们创建的 prometheus 这个资源对象了,里面有非常重要的一个属性 ruleSelector,用来匹配 rule 规则的过滤器,要求匹配具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 资源对象,现在明白了吧?

  1. ruleSelector:
  2. matchLabels:
  3. prometheus: k8s
  4. role: alert-rules

所以我们要想自定义一个报警规则,只需要创建一个具有 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就行了,比如现在我们添加一个自定义的报警,创建文件 prometheus-blackboxRules.yaml:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: PrometheusRule
  3. metadata:
  4. labels:
  5. prometheus: k8s
  6. role: alert-rules
  7. name: blackbox-rules
  8. namespace: monitoring
  9. spec:
  10. groups:
  11. - name: blackbox_network_stats
  12. rules:
  13. - alert: blackbox_network_stats
  14. expr: probe_success == 0
  15. for: 1m
  16. labels:
  17. severity: critical
  18. annotations:
  19. summary: "Instance {{ $labels.instance }} is down"
  20. description: "This requires immediate action!"
  21. - name: ssl_expiry.rules
  22. rules:
  23. - alert: SSLCertExpiringSoon
  24. expr: (probe_ssl_earliest_cert_expiry{job="blackbox_ssl_expiry"} - time())/86400 < 30
  25. for: 10m
  26. labels:
  27. severity: warn
  28. annotations:
  29. summary: "ssl证书过期警告"
  30. description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'
  31. - name: 主机状态-监控告警
  32. rules:
  33. - alert: 主机状态
  34. expr: up == 0
  35. for: 1m
  36. labels:
  37. severity: critical
  38. annotations:
  39. summary: "{{$labels.instance}}:服务器宕机"
  40. description: "{{$labels.instance}}:服务器延时超过5分钟"
  41. - alert: CPU使用情况
  42. expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
  43. for: 1m
  44. labels:
  45. severity: warn
  46. annotations:
  47. summary: "{{$labels.mountpoint}} CPU使用率过高!"
  48. description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
  49. - alert: 内存使用
  50. expr: 100 -(node_memory_MemTotal_bytes -node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes ) / node_memory_MemTotal_bytes * 100> 80
  51. for: 1m
  52. labels:
  53. severity: warn
  54. annotations:
  55. summary: "{{$labels.mountpoint}} 内存使用率过高!"
  56. description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"
  57. - alert: IO性能
  58. expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
  59. for: 1m
  60. labels:
  61. severity: warn
  62. annotations:
  63. summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
  64. description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
  65. - alert: 网络
  66. expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
  67. for: 1m
  68. labels:
  69. severity: warn
  70. annotations:
  71. summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
  72. description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
  73. - alert: TCP会话
  74. expr: node_netstat_Tcp_CurrEstab > 1000
  75. for: 1m
  76. labels:
  77. severity: critical
  78. annotations:
  79. summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
  80. description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
  81. - alert: 磁盘容量
  82. expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
  83. for: 1m
  84. labels:
  85. severity: warn
  86. annotations:
  87. summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
  88. description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

注意 label 标签一定至少要有 prometheus=k8s 和 role=alert-rules,创建完成后,隔一会儿再去容器中查看下 rules 文件夹:

  1. [admin@ch-k8s1 new]$ kubectl exec -it prometheus-k8s-0 -n monitoring -- /bin/sh
  2. Defaulting container name to prometheus.
  3. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
  4. /prometheus $ cd /etc/prometheus/rules/prometheus-k8s-rulefiles-0/
  5. /etc/prometheus/rules/prometheus-k8s-rulefiles-0 $ ls -l
  6. total 0
  7. lrwxrwxrwx 1 root 2000 37 Dec 2 09:52 monitoring-blackbox-rules.yaml -> ..data/monitoring-blackbox-rules.yaml
  8. lrwxrwxrwx 1 root root 43 Dec 2 09:15 monitoring-prometheus-k8s-rules.yaml -> ..data/monitoring-prometheus-k8s-rules.yaml
  9. /etc/prometheus/rules/prometheus-k8s-rulefiles-0 $ ls
  10. monitoring-blackbox-rules.yaml monitoring-prometheus-k8s-rules.yaml

可以看到我们创建的 rule 文件已经被注入到了对应的 rulefiles 文件夹下面了,证明我们上面的设想是正确的。然后再去 Prometheus Dashboard 的 Alert 页面下面就可以查看到上面我们新建的报警规则了:
image.png

配置报警

我们知道了如何去添加一个报警规则配置项,但是这些报警信息用怎样的方式去发送呢?前面的课程中我们知道我们可以通过 AlertManager 的配置文件去配置各种报警接收器,现在我们是通过 Operator 提供的 alertmanager 资源对象创建的组件,应该怎样去修改配置呢?
首先我们将 alertmanager-main 这个 Service 改为 NodePort 类型的 Service,修改完成后我们可以在页面上的 status 路径下面查看 AlertManager 的配置信息:

  1. .....
  2. selector:
  3. alertmanager: main
  4. app: alertmanager
  5. sessionAffinity: ClientIP
  6. sessionAffinityConfig:
  7. clientIP:
  8. timeoutSeconds: 10800
  9. .....

image.png
这些配置信息实际上是来自于我们之前在kube-prometheus-0.3.0/manifests/目录下面创建的 alertmanager-secret.yaml 文件:

  1. [admin@ch-k8s1 manifests]$ cat alertmanager-secret.yaml
  2. apiVersion: v1
  3. data:
  4. alertmanager.yaml: Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg==
  5. kind: Secret
  6. metadata:
  7. name: alertmanager-main
  8. namespace: monitoring
  9. type: Opaque

可以将 alertmanager.yaml 对应的 value 值做一个 base64 解码:

  1. [admin@ch-k8s1 manifests]$ echo Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAibnVsbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gImpvYiIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJudWxsIgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJudWxsIg== | base64 -d

解码出来的结果:

  1. "global":
  2. "resolve_timeout": "5m"
  3. "receivers":
  4. - "name": "null"
  5. "route":
  6. "group_by":
  7. - "job"
  8. "group_interval": "5m"
  9. "group_wait": "30s"
  10. "receiver": "null"
  11. "repeat_interval": "12h"
  12. "routes":
  13. - "match":
  14. "alertname": "Watchdog"
  15. "receiver": "null"

我们可以看到内容和上面查看的配置信息是一致的,所以如果我们想要添加自己的接收器,或者模板消息,我们就可以更改这个文件:

  1. global:
  2. resolve_timeout: 5m
  3. smtp_smarthost: 'smtp.163.com:25' #使用163邮箱
  4. smtp_from: 'monitoring2020@163.com' #邮箱名称
  5. smtp_auth_username: 'monitoring2020@163.com' #登录邮箱名称
  6. smtp_auth_password: 'NHGJPAVAGO00000000' #授权码,不是登陆密码
  7. smtp_require_tls: false
  8. #wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 企业微信地址
  9. # 定义模板信心
  10. templates:
  11. - 'template/*.tmpl' #模板路径
  12. route: #默认路由
  13. group_by: ['instance','job'] #根据instance和job标签分组,同标签下的告警会在一个邮件中展现
  14. group_wait: 30s # 最初即第一次等待多久时间发送一组警报的通知
  15. group_interval: 5m # 在发送新警报前的等待时间
  16. repeat_interval: 10h #重复告警间隔
  17. receiver: email #默认接收者的名称,以下receivers name的名称
  18. routes: #子路由,不满足子路由的都走默认路由
  19. - receiver: leader
  20. match: #普通匹配
  21. severity: critical #报警规则中定义的报警级别
  22. - receiver: support_team
  23. match_re: #正则匹配
  24. severity: ^(warn|critical)$
  25. receivers: #定义三个接受者,和上面三个路由对应
  26. - name: 'email'
  27. email_configs:
  28. - to: 'jordan@163.com'
  29. - name: 'leader'
  30. email_configs:
  31. - to: 'jordan@wicre.com'
  32. - name: 'support_team'
  33. email_configs:
  34. - to: 'mlkdesti@163.com'
  35. html: '{{ template "test.html" . }}' # 设定邮箱的内容模板
  36. headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题
  37. send_resolved: true #恢复的时候发送告警消息
  38. webhook_configs: # webhook配置
  39. - url: 'http://127.0.0.1:5001'
  40. send_resolved: true
  41. wechat_configs: # 企业微信报警配置
  42. - send_resolved: true
  43. to_party: '1' # 接收组的id
  44. agent_id: '1000002' # (企业微信-->自定应用-->AgentId)
  45. corp_id: '******' # 企业信息(我的企业-->CorpId[在底部])
  46. api_secret: '******' # 企业微信(企业微信-->自定应用-->Secret)
  47. message: '{{ template "test.html" . }}' # 发送消息模板的设定
  48. # 一个inhibition规则是在与另一组匹配器匹配的警报存在的条件下,使匹配一组匹配器的警报失效的规则。两个警报必须具有一组相同的标签。
  49. inhibit_rules:
  50. - source_match:
  51. severity: 'critical'
  52. target_match:
  53. severity: 'warning'
  54. equal: ['alertname', 'dev', 'instance']

将上面文件保存为 alertmanager.yaml,然后使用这个文件创建一个 Secret 对象:

  1. #删除原secret对象
  2. kubectl delete secret alertmanager-main -n monitoring
  3. secret "alertmanager-main" deleted
  4. #将自己的配置文件导入到新的secret
  5. kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring

我们添加了两个接收器,默认的通过邮箱进行发送,对于 CoreDNSDown 这个报警我们通过 wechat 来进行发送,上面的步骤创建完成后,收到
image.png

我们再次查看 AlertManager 页面的 status 页面的配置信息可以看到已经变成上面我们的配置信息了:
image.png

AlertManager 配置也可以使用模板(.tmpl文件),这些模板可以与 alertmanager.yaml 配置文件一起添加到 Secret 对象中,比如:

  1. apiVersionv1
  2. kindsecret
  3. metadata
  4. namealertmanager-example
  5. data
  6. alertmanager.yaml:{BASE64_CONFIG}
  7. template_1.tmpl:{BASE64_TEMPLATE_1}
  8. template_2.tmpl:{BASE64_TEMPLATE_2}
  9. ...

模板会被放置到与配置文件相同的路径,当然要使用这些模板文件,还需要在 alertmanager.yaml 配置文件中指定:

  1. templates:
  2. - '*.tmpl'

创建成功后,Secret 对象将会挂载到 AlertManager 对象创建的 AlertManager Pod 中去。
样例:我们创建一个test.tmpl文件,添加如下内容:

  1. [admin@ch-k8s1 new]$ cat test.tmpl
  2. {{ define "test.html" }}
  3. {{ if gt (len .Alerts.Firing) 0 }}{{ range .Alerts }}
  4. @故障告警:<br>
  5. 告警程序: prometheus_alert <br>
  6. 告警级别: {{ .Labels.severity }} <br>
  7. 告警类型: {{ .Labels.alertname }} <br>
  8. 故障主机: {{ .Labels.instance }} <br>
  9. 告警主题: {{ .Annotations.summary }} <br>
  10. 告警详情: {{ .Annotations.description }} <br>
  11. 触发时间: {{ .StartsAt }} <br>
  12. {{ end }}
  13. {{ end }}
  14. {{ if gt (len .Alerts.Resolved) 0 }}{{ range .Alerts }}
  15. @故障恢复:<br>
  16. 告警主机:{{ .Labels.instance }} <br>
  17. 告警主题:{{ .Annotations.summary }} <br>
  18. 恢复时间: {{ .EndsAt }} <br>
  19. {{ end }}
  20. {{ end }}
  21. {{ end }}

删除原secret对象

  1. $ kubectl delete secret alertmanager-main -n monitoring
  2. secret "alertmanager-main" deleted

创建新的secret对象

  1. $ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=test.tmpl -n monitoring
  2. secret/alertmanager-main created

过一会我们的微信就会收到告警信息。当然这里标签定义的问题,获取的值不全,我们可以根据实际情况自定义。

自动发现配置

我们想一个问题,如果在我们的 Kubernetes 集群中有了很多的 Service/Pod,那么我们都需要一个一个的去建立一个对应的 ServiceMonitor 对象来进行监控吗?这样岂不是又变得麻烦起来了?
为解决这个问题,Prometheus Operator 为我们提供了一个额外的抓取配置的来解决这个问题,我们可以通过添加额外的配置来进行服务发现进行自动监控。和前面自定义的方式一样,我们想要在 Prometheus Operator 当中去自动发现并监控具有prometheus.io/scrape=true这个 annotations 的 Service,之前我们定义的 Prometheus 的配置如下:

  1. - job_name: 'kubernetes-service-endpoints'
  2. kubernetes_sd_configs:
  3. - role: endpoints
  4. relabel_configs:
  5. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  6. action: keep
  7. regex: true
  8. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  9. action: replace
  10. target_label: __scheme__
  11. regex: (https?)
  12. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  13. action: replace
  14. target_label: __metrics_path__
  15. regex: (.+)
  16. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  17. action: replace
  18. target_label: __address__
  19. regex: ([^:]+)(?::\d+)?;(\d+)
  20. replacement: $1:$2
  21. - action: labelmap
  22. regex: __meta_kubernetes_service_label_(.+)
  23. - source_labels: [__meta_kubernetes_namespace]
  24. action: replace
  25. target_label: kubernetes_namespace
  26. - source_labels: [__meta_kubernetes_service_name]
  27. action: replace
  28. target_label: kubernetes_name

要想自动发现集群中的 Service,就需要我们在 Service 的annotation区域添加prometheus.io/scrape=true的声明,将上面文件直接保存为 prometheus-additional.yaml,然后通过这个文件创建一个对应的 Secret 对象:

  1. [admin@ch-k8s1 new]$ kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
  2. secret/additional-configs created

创建完成后,会将上面配置信息进行 base64 编码后作为 prometheus-additional.yaml 这个 key 对应的值存在:

  1. [admin@ch-k8s1 new]$ kubectl get secret additional-configs -n monitoring -o yaml
  2. apiVersion: v1
  3. data:
  4. prometheus-additional.yaml: LSBqb2JfbmFtZTogJ2t1YmVybmV0ZXMtc2VydmljZS1lbmRwb2ludHMnCiAga3ViZXJuZXRlc19zZF9jb25maWdzOgogIC0gcm9sZTogZW5kcG9pbnRzCiAgcmVsYWJlbF9jb25maWdzOgogIC0gc291cmNlX2xhYmVsczogW19fbWV0YV9rdWJlcm5ldGVzX3NlcnZpY2VfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3NjcmFwZV0KICAgIGFjdGlvbjoga2VlcAogICAgcmVnZXg6IHRydWUKICAtIHNvdXJjZV9sYWJlbHM6IFtfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19zY2hlbWVdCiAgICBhY3Rpb246IHJlcGxhY2UKICAgIHRhcmdldF9sYWJlbDogX19zY2hlbWVfXwogICAgcmVnZXg6IChodHRwcz8pCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9hbm5vdGF0aW9uX3Byb21ldGhldXNfaW9fcGF0aF0KICAgIGFjdGlvbjogcmVwbGFjZQogICAgdGFyZ2V0X2xhYmVsOiBfX21ldHJpY3NfcGF0aF9fCiAgICByZWdleDogKC4rKQogIC0gc291cmNlX2xhYmVsczogW19fYWRkcmVzc19fLCBfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19wb3J0XQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZWdleDogKFteOl0rKSg/OjpcZCspPzsoXGQrKQogICAgcmVwbGFjZW1lbnQ6ICQxOiQyCiAgLSBhY3Rpb246IGxhYmVsbWFwCiAgICByZWdleDogX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9sYWJlbF8oLispCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfbmFtZXNwYWNlXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZXNwYWNlCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9uYW1lXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZQo=
  5. kind: Secret
  6. metadata:
  7. creationTimestamp: "2020-12-03T03:49:35Z"
  8. name: additional-configs
  9. namespace: monitoring
  10. resourceVersion: "12056954"
  11. selfLink: /api/v1/namespaces/monitoring/secrets/additional-configs
  12. uid: b065947c-4117-4ccc-b32f-0b59f09b7ea4
  13. type: Opaque

然后我们只需要在声明 prometheus 的资源对象文件中添加上这个额外的配置:(prometheus-prometheus.yaml)

  1. [admin@ch-k8s1 manifests]$ cat prometheus-prometheus.yaml
  2. apiVersion: monitoring.coreos.com/v1
  3. kind: Prometheus
  4. metadata:
  5. labels:
  6. prometheus: k8s
  7. name: k8s
  8. namespace: monitoring
  9. spec:
  10. alerting:
  11. alertmanagers:
  12. - name: alertmanager-main
  13. namespace: monitoring
  14. port: web
  15. baseImage: quay.io/prometheus/prometheus
  16. nodeSelector:
  17. kubernetes.io/os: linux
  18. podMonitorSelector: {}
  19. replicas: 2
  20. resources:
  21. requests:
  22. memory: 400Mi
  23. ruleSelector:
  24. matchLabels:
  25. prometheus: k8s
  26. role: alert-rules
  27. securityContext:
  28. fsGroup: 2000
  29. runAsNonRoot: true
  30. runAsUser: 1000
  31. additionalScrapeConfigs:
  32. name: additional-configs
  33. key: prometheus-additional.yaml
  34. serviceAccountName: prometheus-k8s
  35. serviceMonitorNamespaceSelector: {}
  36. serviceMonitorSelector: {}
  37. version: v2.11.0

添加完成后,直接更新 prometheus 这个 CRD 资源对象:

  1. [admin@ch-k8s1 manifests]$ kubectl apply -f prometheus-prometheus.yaml
  2. prometheus.monitoring.coreos.com/k8s configured

隔一小会儿,可以前往 Prometheus 的 Dashboard 中查看配置是否生效:
image.png
在 Prometheus Dashboard 的配置页面下面我们可以看到已经有了对应的的配置信息了,但是我们切换到 targets 页面下面却并没有发现对应的监控任务,查看 Prometheus 的 Pod 日志:

  1. [admin@ch-k8s1 ~]$ kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
  2. ....
  3. e"
  4. level=error ts=2020-12-03T06:03:49.780Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
  5. level=error ts=2020-12-03T06:03:49.781Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
  6. level=error ts=2020-12-03T06:03:50.775Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"
  7. level=error ts=2020-12-03T06:03:50.783Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
  8. level=error ts=2020-12-03T06:03:50.784Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
  9. level=error ts=2020-12-03T06:03:51.778Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:264: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" at the cluster scope"
  10. level=error ts=2020-12-03T06:03:51.786Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" at the cluster scope"
  11. level=error ts=2020-12-03T06:03:51.787Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:263: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" at the cluster scope"
  12. ....

可以看到有很多错误日志出现,都是xxx is forbidden,这说明是 RBAC 权限的问题,通过 prometheus 资源对象的配置可以知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole:(prometheus-clusterRole.yaml)

  1. [admin@ch-k8s1 manifests]$ cat prometheus-clusterRole.yaml
  2. apiVersion: rbac.authorization.k8s.io/v1
  3. kind: ClusterRole
  4. metadata:
  5. name: prometheus-k8s
  6. rules:
  7. - apiGroups:
  8. - ""
  9. resources:
  10. - nodes/metrics
  11. verbs:
  12. - get
  13. - nonResourceURLs:
  14. - /metrics
  15. verbs:
  16. - get

上面的权限规则中我们可以看到明显没有对 Service 或者 Pod 的 list 权限,所以报错了,要解决这个问题,我们只需要添加上需要的权限即可:

  1. apiVersion: rbac.authorization.k8s.io/v1
  2. kind: ClusterRole
  3. metadata:
  4. name: prometheus-k8s
  5. rules:
  6. - apiGroups:
  7. - ""
  8. resources:
  9. - nodes
  10. - services
  11. - endpoints
  12. - pods
  13. - nodes/proxy
  14. verbs:
  15. - get
  16. - list
  17. - watch
  18. - apiGroups:
  19. - ""
  20. resources:
  21. - configmaps
  22. - nodes/metrics
  23. verbs:
  24. - get
  25. - nonResourceURLs:
  26. - /metrics
  27. verbs:
  28. - get