使用 Operator 管理 Prometheus

创建 Prometheus 实例

当集群中已经安装 Prometheus Operator 之后,对于部署 Prometheus Server 实例就变成了声明一个Prometheus 资源,如下所示,我们在 Monitoring 命名空间下创建一个 Prometheus 实例:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: Prometheus
  3. metadata:
  4. name: inst
  5. namespace: monitoring
  6. spec:
  7. resources:
  8. requests:
  9. memory: 400Mi

将以上内容保存到 prometheus-inst.yaml 文件,并通过 kubectl 进行创建:

  1. $ kubectl create -f prometheus-inst.yaml
  2. prometheus.monitoring.coreos.com/inst-1 created

此时,查看 default 命名空间下的 statefulsets 资源,可以看到 Prometheus Operator 自动通过Statefulset 创建的 Prometheus 实例:

  1. $ kubectl -n monitoring get statefulsets
  2. NAME DESIRED CURRENT AGE
  3. prometheus-inst 1 1 1m

查看Pod实例:

  1. $ kubectl -n monitoring get pods
  2. NAME READY STATUS RESTARTS AGE
  3. prometheus-inst-0 3/3 Running 1 1m
  4. prometheus-operator-6db8dbb7dd-2hz55 1/1 Running 0 45m

通过port-forward访问Prometheus实例:

  1. $ kubectl -n monitoring port-forward statefulsets/prometheus-inst 9090:9090

通过 http://localhost:9090 可以在本地直接打开Prometheus Operator创建的Prometheus实例。查看配置信息,可以看到目前Operator创建了只包含基本配置的Prometheus实例:

使用 Operator 管理 Prometheus - 图1

使用 ServiceMonitor 管理监控配置

修改监控配置项也是 Prometheus 下常用的运维操作之一,为了能够自动化的管理 Prometheus 的配置,Prometheus Operator 使用了自定义资源类型 ServiceMonitor 来描述监控对象的信息。

这里我们首先在集群中部署一个示例应用,将以下内容保存到 example-app.yaml,并使用 kubectl 命令行工具创建:

  1. kind: Service
  2. apiVersion: v1
  3. metadata:
  4. name: example-app
  5. labels:
  6. app: example-app
  7. spec:
  8. selector:
  9. app: example-app
  10. ports:
  11. - name: web
  12. port: 8080
  13. ---
  14. apiVersion: extensions/v1beta1
  15. kind: Deployment
  16. metadata:
  17. name: example-app
  18. spec:
  19. replicas: 3
  20. template:
  21. metadata:
  22. labels:
  23. app: example-app
  24. spec:
  25. containers:
  26. - name: example-app
  27. image: fabxc/instrumented_app
  28. ports:
  29. - name: web
  30. containerPort: 8080

示例应用会通过Deployment创建3个Pod实例,并且通过Service暴露应用访问信息。

  1. $ kubectl get pods
  2. NAME READY STATUS RESTARTS AGE
  3. example-app-94c8bc8-l27vx 2/2 Running 0 1m
  4. example-app-94c8bc8-lcsrm 2/2 Running 0 1m
  5. example-app-94c8bc8-n6wp5 2/2 Running 0 1m

在本地同样通过port-forward访问任意Pod实例

  1. $ kubectl port-forward deployments/example-app 8080:8080

访问本地的http://localhost:8080/metrics实例应用程序会返回以下样本数据:

  1. # TYPE codelab_api_http_requests_in_progress gauge
  2. codelab_api_http_requests_in_progress 3
  3. # HELP codelab_api_request_duration_seconds A histogram of the API HTTP request durations in seconds.
  4. # TYPE codelab_api_request_duration_seconds histogram
  5. codelab_api_request_duration_seconds_bucket{method="GET",path="/api/bar",status="200",le="0.0001"} 0

为了能够让Prometheus能够采集部署在Kubernetes下应用的监控数据,在原生的Prometheus配置方式中,我们在Prometheus配置文件中定义单独的Job,同时使用kubernetes_sd定义整个服务发现过程。而在Prometheus Operator中,则可以直接生命一个ServiceMonitor对象,如下所示:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: ServiceMonitor
  3. metadata:
  4. name: example-app
  5. namespace: monitoring
  6. labels:
  7. team: frontend
  8. spec:
  9. namespaceSelector:
  10. matchNames:
  11. - default
  12. selector:
  13. matchLabels:
  14. app: example-app
  15. endpoints:
  16. - port: web

通过定义selector中的标签定义选择监控目标的Pod对象,同时在endpoints中指定port名称为web的端口。默认情况下ServiceMonitor和监控对象必须是在相同Namespace下的。在本示例中由于Prometheus是部署在Monitoring命名空间下,因此为了能够关联default命名空间下的example对象,需要使用namespaceSelector定义让其可以跨命名空间关联ServiceMonitor资源。保存以上内容到example-app-service-monitor.yaml文件中,并通过kubectl创建:

  1. $ kubectl create -f example-app-service-monitor.yaml
  2. servicemonitor.monitoring.coreos.com/example-app created

如果希望ServiceMonitor可以关联任意命名空间下的标签,则通过以下方式定义:

  1. spec:
  2. namespaceSelector:
  3. any: true

如果监控的Target对象启用了BasicAuth认证,那在定义ServiceMonitor对象时,可以使用endpoints配置中定义basicAuth如下所示:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: ServiceMonitor
  3. metadata:
  4. name: example-app
  5. namespace: monitoring
  6. labels:
  7. team: frontend
  8. spec:
  9. namespaceSelector:
  10. matchNames:
  11. - default
  12. selector:
  13. matchLabels:
  14. app: example-app
  15. endpoints:
  16. - basicAuth:
  17. password:
  18. name: basic-auth
  19. key: password
  20. username:
  21. name: basic-auth
  22. key: user
  23. port: web

其中basicAuth中关联了名为basic-auth的Secret对象,用户需要手动将认证信息保存到Secret中:

  1. apiVersion: v1
  2. kind: Secret
  3. metadata:
  4. name: basic-auth
  5. data:
  6. password: dG9vcg== # base64编码后的密码
  7. user: YWRtaW4= # base64编码后的用户名
  8. type: Opaque

关联Promethues与ServiceMonitor

Prometheus与ServiceMonitor之间的关联关系使用serviceMonitorSelector定义,在Prometheus中通过标签选择当前需要监控的ServiceMonitor对象。修改prometheus-inst.yaml中Prometheus的定义如下所示: 为了能够让Prometheus关联到ServiceMonitor,需要在Pormtheus定义中使用serviceMonitorSelector,我们可以通过标签选择当前Prometheus需要监控的ServiceMonitor对象。修改prometheus-inst.yaml中Prometheus的定义如下所示:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: Prometheus
  3. metadata:
  4. name: inst
  5. namespace: monitoring
  6. spec:
  7. serviceMonitorSelector:
  8. matchLabels:
  9. team: frontend
  10. resources:
  11. requests:
  12. memory: 400Mi

将对Prometheus的变更应用到集群中:

  1. $ kubectl -n monitoring apply -f prometheus-inst.yaml

此时,如果查看Prometheus配置信息,我们会惊喜的发现Prometheus中配置文件自动包含了一条名为monitoring/example-app/0的Job配置:

  1. global:
  2. scrape_interval: 30s
  3. scrape_timeout: 10s
  4. evaluation_interval: 30s
  5. external_labels:
  6. prometheus: monitoring/inst
  7. prometheus_replica: prometheus-inst-0
  8. alerting:
  9. alert_relabel_configs:
  10. - separator: ;
  11. regex: prometheus_replica
  12. replacement: $1
  13. action: labeldrop
  14. rule_files:
  15. - /etc/prometheus/rules/prometheus-inst-rulefiles-0/*.yaml
  16. scrape_configs:
  17. - job_name: monitoring/example-app/0
  18. scrape_interval: 30s
  19. scrape_timeout: 10s
  20. metrics_path: /metrics
  21. scheme: http
  22. kubernetes_sd_configs:
  23. - role: endpoints
  24. namespaces:
  25. names:
  26. - default
  27. relabel_configs:
  28. - source_labels: [__meta_kubernetes_service_label_app]
  29. separator: ;
  30. regex: example-app
  31. replacement: $1
  32. action: keep
  33. - source_labels: [__meta_kubernetes_endpoint_port_name]
  34. separator: ;
  35. regex: web
  36. replacement: $1
  37. action: keep
  38. - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
  39. separator: ;
  40. regex: Node;(.*)
  41. target_label: node
  42. replacement: ${1}
  43. action: replace
  44. - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
  45. separator: ;
  46. regex: Pod;(.*)
  47. target_label: pod
  48. replacement: ${1}
  49. action: replace
  50. - source_labels: [__meta_kubernetes_namespace]
  51. separator: ;
  52. regex: (.*)
  53. target_label: namespace
  54. replacement: $1
  55. action: replace
  56. - source_labels: [__meta_kubernetes_service_name]
  57. separator: ;
  58. regex: (.*)
  59. target_label: service
  60. replacement: $1
  61. action: replace
  62. - source_labels: [__meta_kubernetes_pod_name]
  63. separator: ;
  64. regex: (.*)
  65. target_label: pod
  66. replacement: $1
  67. action: replace
  68. - source_labels: [__meta_kubernetes_service_name]
  69. separator: ;
  70. regex: (.*)
  71. target_label: job
  72. replacement: ${1}
  73. action: replace
  74. - separator: ;
  75. regex: (.*)
  76. target_label: endpoint
  77. replacement: web
  78. action: replace

不过,如果细心的读者可能会发现,虽然Job配置有了,但是Prometheus的Target中并没包含任何的监控对象。查看Prometheus的Pod实例日志,可以看到如下信息:

  1. level=error ts=2018-12-15T12:52:48.452108433Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:300: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:default\" cannot list endpoints in the namespace \"default\""

自定义ServiceAccount

由于默认创建的Prometheus实例使用的是monitoring命名空间下的default账号,该账号并没有权限能够获取default命名空间下的任何资源信息。

为了修复这个问题,我们需要在Monitoring命名空间下为创建一个名为Prometheus的ServiceAccount,并且为该账号赋予相应的集群访问权限。

  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. name: prometheus
  5. namespace: monitoring
  6. ---
  7. apiVersion: rbac.authorization.k8s.io/v1beta1
  8. kind: ClusterRole
  9. metadata:
  10. name: prometheus
  11. rules:
  12. - apiGroups: [""]
  13. resources:
  14. - nodes
  15. - services
  16. - endpoints
  17. - pods
  18. verbs: ["get", "list", "watch"]
  19. - apiGroups: [""]
  20. resources:
  21. - configmaps
  22. verbs: ["get"]
  23. - nonResourceURLs: ["/metrics"]
  24. verbs: ["get"]
  25. ---
  26. apiVersion: rbac.authorization.k8s.io/v1beta1
  27. kind: ClusterRoleBinding
  28. metadata:
  29. name: prometheus
  30. roleRef:
  31. apiGroup: rbac.authorization.k8s.io
  32. kind: ClusterRole
  33. name: prometheus
  34. subjects:
  35. - kind: ServiceAccount
  36. name: prometheus
  37. namespace: monitoring

将以上内容保存到prometheus-rbac.yaml文件中,并且通过kubectl创建相应资源:

  1. $ kubectl -n monitoring create -f prometheus-rbac.yaml
  2. serviceaccount/prometheus created
  3. clusterrole.rbac.authorization.k8s.io/prometheus created
  4. clusterrolebinding.rbac.authorization.k8s.io/prometheus created

在完成ServiceAccount创建后,修改prometheus-inst.yaml,并添加ServiceAccount如下所示:

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: Prometheus
  3. metadata:
  4. name: inst
  5. namespace: monitoring
  6. spec:
  7. serviceAccountName: prometheus
  8. serviceMonitorSelector:
  9. matchLabels:
  10. team: frontend
  11. resources:
  12. requests:
  13. memory: 400Mi

保存Prometheus变更到集群中:

  1. $ kubectl -n monitoring apply -f prometheus-inst.yaml
  2. prometheus.monitoring.coreos.com/inst configured

等待Prometheus Operator完成相关配置变更后,此时查看Prometheus,我们就能看到当前Prometheus已经能够正常的采集实例应用的相关监控数据了。