1、operator介绍

operator是CoreOS公司开发,用于扩展k8s的api或特定应用程序的控制器,它用来创建、配置、管理复杂的有状态应用。
Prometheus Operator 是 CoreOS 开源的一套用于管理在 Kubernetes 集群上的 Prometheus 控制器,它是为了简化在 Kubernetes 上部署、管理和运行 Prometheus 和 Alertmanager 集群。
架构图如下:
image.png

Operator: Operator 资源会根据自定义资源(Custom Resource Definition / CRDs)来部署和管理 Prometheus Server,同时监控这些自定义资源事件的变化来做相应的处理,是整个系统的控制中心。
Prometheus: Prometheus 资源是声明性地描述 Prometheus 部署的期望状态。
Prometheus Server: Operator 根据自定义资源 Prometheus 类型中定义的内容而部署的 Prometheus Server 集群,这些自定义资源可以看作是用来管理 Prometheus Server 集群的 StatefulSets 资源。
ServiceMonitor: ServiceMonitor 也是一个自定义资源,它描述了一组被 Prometheus 监控的 targets 列表。该资源通过 Labels 来选取对应的 Service Endpoint,让 Prometheus Server 通过选取的 Service 来获取 Metrics 信息。
Service: Service 资源主要用来对应 Kubernetes 集群中的 Metrics Server Pod,来提供给 ServiceMonitor 选取让 Prometheus Server 来获取信息。简单的说就是 Prometheus 监控的对象,例如之前了解的 Node Exporter Service、Mysql Exporter Service 等等。
Alertmanager: Alertmanager 也是一个自定义资源类型,由 Operator 根据资源描述内容来部署 Alertmanager 集群。

2、operator部署

2.1 问题

1、对于有状态应用,怎么样指定机器安装?
2、local-pv如何配置?
3、alertmanager告警全局配置文件如何设定?
4、如何抓取相关服务的metrics?
5、高可用如何搭建、如何暴露服务ip及端口?
6、如何调取相关api接口增加、删除、更新相关告警项?

2.3 选取版本

查看官网:https://github.com/prometheus-operator/kube-prometheus,找到符合自己k8s集群版本
image.png
所以我们使用release-0.2
wget https://github.com/coreos/prometheus-operator/archive/v0.2.0.tar.gz

2.4 部署

2.4.1部署operator&Adapter

前提操作
查看污点

  1. 查看污点
  2. kubectl describe node master-01.aipaas.japan|grep Taints
  3. 消除污点
  4. kubectl taint node master-1 node-role.kubernetes.io/master-
  1. mkdir -pv operator node-exporter alertmanager grafana kube-state-metrics prometheus-k8s serviceMonitor adapter ingress local-pv
  2. mv *-serviceMonitor* serviceMonitor/
  3. mv 0prometheus-operator* operator/
  4. mv grafana-* grafana/
  5. mv kube-state-metrics-* kube-state-metrics/
  6. mv alertmanager-* alertmanager/
  7. mv node-exporter-* node-exporter/
  8. mv prometheus-adapter* adapter/
  9. mv prometheus-* prometheus-k8s/

安装operator&安装adapter
adapter
我们这里使用一个名为 Prometheus-Adapter 的 k8s 的 API 扩展应用,它可以使用用户自定义的 Prometheus 查询来使用 k8s 资源和自定义指标 API。
接下来我们将 Prometheus-Adapter 安装到集群中,并添加一个规则来跟踪每个 Pod 的请求。规则的定义我们可以参考官方文档,每个规则大致可以分为4个部分:
Discovery:它指定 Adapter 应该如何找到该规则的所有 Prometheus 指标
Association:指定 Adapter 应该如何确定和特定的指标关联的 Kubernetes 资源
Naming:指定 Adapter 应该如何在自定义指标 API 中暴露指标
Querying:指定如何将对一个获多个 Kubernetes 对象上的特定指标的请求转换为对 Prometheus 的查询

  1. cd operator/
  2. kubectl apply -f ./
  3. [root@hf-aipaas-172-31-243-137 operator]# kubectl get crd -n monitoring
  4. NAME CREATED AT
  5. alertmanagers.monitoring.coreos.com 2020-08-12T13:47:04Z
  6. podmonitors.monitoring.coreos.com 2020-08-12T13:47:04Z
  7. prometheuses.monitoring.coreos.com 2020-08-12T13:47:04Z
  8. prometheusrules.monitoring.coreos.com 2020-08-12T13:47:05Z
  9. servicemonitors.monitoring.coreos.com 2020-08-12T13:47:05Z
  10. cd adapter
  11. kubectl apply -f ./

2.4.2 部署local-pv

创建本地local-strorage给prometheus存储使用

  1. cd local-pv
  2. pv配置:
  3. apiVersion: v1
  4. kind: PersistentVolume
  5. metadata:
  6. name: prometheus-pv-243-137-k8s
  7. spec:
  8. capacity:
  9. storage: 10Gi
  10. volumeMode: Filesystem
  11. accessModes:
  12. - ReadWriteOnce
  13. persistentVolumeReclaimPolicy: Retain
  14. storageClassName: local-storage
  15. local:
  16. path: /prometheus-k8s
  17. nodeAffinity:
  18. required:
  19. nodeSelectorTerms:
  20. - matchExpressions:
  21. - key: kubernetes.io/hostname
  22. operator: In
  23. values:
  24. - hf-aipaas-172-31-243-137
  25. kubectl apply -f ./

2.4.3 部署prometheus服务

prometheus配置,prometheus-prometheus.yaml

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: Prometheus
  3. metadata:
  4. labels:
  5. prometheus: k8s
  6. name: k8s
  7. namespace: monitoring
  8. spec:
  9. alerting:
  10. alertmanagers:
  11. - name: alertmanager-k8s
  12. namespace: monitoring
  13. port: alert
  14. retention: 30d # 存储时间
  15. storage: # 增加本地设置的存储local-storage
  16. volumeClaimTemplate:
  17. spec:
  18. storageClassName: local-storage #local-pv 指定名字
  19. resources:
  20. requests:
  21. storage: 10Gi #存储大小
  22. baseImage: quay.io/prometheus/prometheus
  23. nodeSelector:
  24. prometheus: deployed # 选取节点名称
  25. podMonitorSelector: {}
  26. replicas: 2 # 部署两套
  27. evaluationInterval: 15s
  28. #secrets: #增加etcd监控使用
  29. #- etcd-certs
  30. ruleSelector:
  31. matchLabels:
  32. prometheus: k8s
  33. role: alert-rules
  34. ruleNamespaceSelector: {}
  35. enableAdminAPI: true
  36. query:
  37. maxSamples: 5000000000
  38. securityContext:
  39. fsGroup: 2000
  40. runAsNonRoot: true
  41. runAsUser: 1000
  42. #additionalScrapeConfigs: #增加的联邦配置
  43. # name: metrics-prometheus-additional-configs
  44. # key: prometheus-janus.yaml
  45. serviceAccountName: prometheus-k8s
  46. serviceMonitorNamespaceSelector: {}
  47. serviceMonitorSelector:
  48. matchLabels:
  49. monitor: k8s # 匹配抓取的名字
  50. version: v2.14.0

标记node标签

  1. # 添加标签
  2. kubectl label nodes master-02.aipaas.japan prometheus=deployed
  3. kubectl label nodes master-03.aipaas.japan prometheus=deployed
  4. kubectl label nodes master-03.aipaas.japan monitor=k8s
  5. kubectl label nodes master-02.aipaas.japan monitor=k8s
  6. kubectl label nodes master-01.aipaas.japan monitor=k8s
  7. kubectl label nodes master-02.aipaas.japan monitoring=operator
  8. kubectl label nodes master-03.aipaas.japan monitoring=operator
  9. # 查看标签
  10. kubectl describe node hf-aipaas-172-31-243-137
  11. kubectl describe node hf-aipaas-172-31-243-137


安装prometheus**

  1. cd prometheus-k8s
  2. kubectl apply -f ./
  3. # 查看状态
  4. [root@hf-aipaas-172-31-243-137 prometheus-k8s]# kubectl get pod -n monitoring
  5. NAME READY STATUS RESTARTS AGE
  6. prometheus-k8s-0 3/3 Running 0 28s
  7. prometheus-k8s-1 3/3 Running 0 46s
  8. prometheus-operator-789d9c99c7-mtt78 1/1 Running 0 26m
  9. [root@hf-aipaas-172-31-243-137 prometheus-k8s]# kubectl get svc -n monitoring
  10. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  11. prometheus-k8s ClusterIP 10.99.144.165 <none> 9090/TCP 3m34s
  12. prometheus-operated ClusterIP None <none> 9090/TCP 3m34s
  13. prometheus-operator ClusterIP None <none> 8080/TCP 26m

2.4.4 部署alartmanager服务
编辑alertmanager-config-HA1.yaml

configmap
Pod可以通过三种方式来使用ConfigMap,分别为:
将ConfigMap中的数据设置为环境变量
将ConfigMap中的数据设置为命令行参数
使用Volume将ConfigMap作为文件或目录挂载

  1. apiVersion: v1
  2. data:
  3. alertmanager.yml: |
  4. global: #全局配置
  5. smtp_smarthost: 'mail.iflytek.com:25'
  6. smtp_from: 'ifly_sre_monitor@iflytek.com'
  7. smtp_auth_username: 'ifly_sre_monitor@iflytek.com'
  8. smtp_auth_password: 'Z81c94c79#2018'
  9. smtp_require_tls: false
  10. templates:
  11. - '/etc/alertmanager/template/*.tmpl' #模板配置
  12. route:
  13. group_by: ['alertname']
  14. group_wait: 60s
  15. group_interval: 5m
  16. repeat_interval: 3h
  17. receiver: ycli15
  18. receivers:
  19. - name: 'ycli15'
  20. email_configs:
  21. - to: 'ycli15@iflytek.com'
  22. send_resolved: true
  23. webhook_configs: #告警webhook配置
  24. - url: 'http://172.21.210.98:80822/alarm'
  25. #- url: 'http://prom-alarm-dx.xfyun.cn/alarm'
  26. send_resolved: true
  27. kind: ConfigMap
  28. metadata:
  29. name: alertmanager-k8s-ha1-config
  30. namespace: monitoring

编辑alertmanager-deploy-HA1.yaml

  1. ---
  2. apiVersion: apps/v1beta2
  3. kind: Deployment
  4. metadata:
  5. labels:
  6. name: alertmanager-k8s-ha1-deployment
  7. name: alertmanager-k8s-ha1
  8. namespace: monitoring
  9. spec:
  10. replicas: 1
  11. selector:
  12. matchLabels:
  13. app: alertmanager-k8s-ha1
  14. template:
  15. metadata:
  16. labels:
  17. app: alertmanager-k8s-ha1
  18. spec:
  19. tolerations:
  20. - key: "node-role.kubernetes.io/master"
  21. operator: "Equal"
  22. value: ""
  23. effect: "NoSchedule"
  24. nodeSelector:
  25. kubernetes.io/hostname: hf-aipaas-172-31-243-137
  26. hostNetwork: true
  27. dnsPolicy: Default
  28. containers:
  29. - image: quay.io/prometheus/alertmanager:v0.20.0
  30. name: alertmanager-k8s-ha1
  31. imagePullPolicy: IfNotPresent
  32. command:
  33. - "/bin/alertmanager"
  34. args:
  35. - "--config.file=/etc/alertmanager/alertmanager.yml"
  36. - "--storage.path=/data"
  37. - "--web.listen-address=:9095"
  38. - "--cluster.listen-address=0.0.0.0:8003"
  39. - "--cluster.peer=172.31.243.137:8003"
  40. - "--cluster.peer-timeout=30s"
  41. - "--cluster.gossip-interval=50ms"
  42. - "--cluster.pushpull-interval=2s"
  43. - "--log.level=debug"
  44. ports:
  45. - containerPort: 9095
  46. name: web
  47. protocol: TCP
  48. - containerPort: 8003
  49. name: mesh
  50. protocol: TCP
  51. volumeMounts:
  52. - mountPath: "/etc/alertmanager"
  53. name: config-alert
  54. - mountPath: "/data"
  55. name: storage
  56. volumes:
  57. - name: config-alert
  58. configMap: # 配置alertmanager-config-HA1.yaml参数
  59. name: alertmanager-k8s-ha1-config
  60. - name: storage
  61. emptyDir: {}

部署alertmanager

  1. cd alertmanager-HA
  2. kubectl apply -f ./
  3. #查看配置
  4. [root@hf-aipaas-172-31-243-137 manifests]# kubectl get pod -n monitoring
  5. NAME READY STATUS RESTARTS AGE
  6. alertmanager-k8s-ha1-6445cd7944-tnrt9 1/1 Running 0 26s
  7. alertmanager-k8s-ha2-f98cbd597-hjrq7 1/1 Running 0 26s
  8. prometheus-k8s-0 3/3 Running 0 9m21s
  9. prometheus-k8s-1 3/3 Running 0 9m39s
  10. prometheus-operator-789d9c99c7-mtt78 1/1 Running 0 35m

2.4.5 kube-state-metrics

kube-state-metrics 收集k8s集群内资源对象数据
部署kube-state-metrics服务

  1. 注意kube-state-metrics-deployment.yaml
  2. name: kube-state-metrics
  3. resources:
  4. limits:
  5. cpu: 100m
  6. memory: 150Mi
  7. requests:
  8. cpu: 100m
  9. memory: 150Mi
  10. nodeSelector: #选择的标签
  11. monitoring: operator
  12. securityContext:
  13. runAsNonRoot: true
  14. runAsUser: 65534
  15. serviceAccountName: kube-state-metrics
  16. cd kube-state-metrics
  17. kubectl apply -f ./
  18. # 查看
  19. [root@hf-aipaas-172-31-243-137 kube-state-metrics]# kubectl get svc -n monitoring
  20. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  21. alertmanager-k8s ClusterIP None <none> 9095/TCP 10m
  22. kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 2m43s
  23. prometheus-k8s ClusterIP 10.99.144.165 <none> 9090/TCP 22m
  24. prometheus-operated ClusterIP None <none> 9090/TCP 22m
  25. prometheus-operator ClusterIP None <none> 8080/TCP 44m

2.4.6 部署serviceMonitor服务

serviceMonitor 类似于prometheus的scrape里面的选项

  1. cd serviceMonitor
  2. kubectl apply -f ./

2.4.7 部署etcd监控

https://www.yuque.com/nicechuan/pc096b/xsnlug

  1. [root@hf-aipaas-172-31-243-137 serviceMonitor]# cd etcd/
  2. [root@hf-aipaas-172-31-243-137 etcd]# kubectl apply -f ./

2.4.7 部署ingress服务

  1. cd ingress
  2. kubectl apply -f ./

2.4.8 绑定hosts访问服务

  1. 172.16.59.204 prometheus.minikube.local.com
  2. 172.16.59.204 alertmanager.minikube.local.com

3、operator接口调用

3.1 创建role账号, roleServiceaccount.yaml

  1. apiVersion: v1
  2. kind: ServiceAccount
  3. metadata:
  4. namespace: prometheus-k8s
  5. name: zprometheus

3.2 绑定role账号到namespace,roleBinding.yaml

  1. kind: RoleBinding
  2. apiVersion: rbac.authorization.k8s.io/v1
  3. metadata:
  4. name: zprometheus
  5. namespace: prometheus-k8s
  6. subjects:
  7. - kind: ServiceAccount
  8. name: zprometheus
  9. namespace: prometheus-k8s
  10. roleRef:
  11. kind: Role
  12. name: zprometheus
  13. apiGroup: rbac.authorization.k8s.io

3.3 指定账号所操作的范围、权限。role.yaml

  1. kind: Role
  2. apiVersion: rbac.authorization.k8s.io/v1
  3. metadata:
  4. namespace: prometheus-k8s ## 指定它能产生作用的Namespace
  5. name: zprometheus
  6. rules: ## 定义权限规则
  7. - apiGroups: [""]
  8. resources: ["services","endpoints"] ## 对mynamespace下面的Pod对象
  9. verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] ## 进行GET、WATCH、LIST操作
  10. - apiGroups: ["monitoring.coreos.com"]
  11. resources: ["prometheusrules","servicemonitors","alertmanagers","prometheuses","prometheuses/finalizers","alertmanagers/finalizers"] ## 对mynamespace下面的Pod对象
  12. verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] ## 进行GET、WATCH、LIST操作
  13. [root@hf-aipaas-172-31-243-137 role-zprometheus]# cat roleBinding.yaml
  14. kind: RoleBinding
  15. apiVersion: rbac.authorization.k8s.io/v1
  16. metadata:
  17. name: zprometheus
  18. namespace: prometheus-k8s
  19. subjects:
  20. - kind: ServiceAccount
  21. name: zprometheus
  22. namespace: prometheus-k8s
  23. roleRef:
  24. kind: Role
  25. name: zprometheus
  26. apiGroup: rbac.authorization.k8s.io

3.4 获取scret的token

  1. [root@hf-aipaas-172-31-243-137 role-zprometheus]# kubectl get secrets -n prometheus-k8s
  2. NAME TYPE DATA AGE
  3. default-token-bk8sn kubernetes.io/service-account-token 3 13h
  4. zprometheus-token-sfnmn kubernetes.io/service-account-token 3 7m53s
  5. [root@hf-aipaas-172-31-243-137 role-zprometheus]# kubectl get secret zprometheus-token-sfnmn -n prometheus-k8s -o jsonpath={.data.token} | base64 -d

3.5 alarm告警通过api进行调用

Python举例

  1. def action_api():
  2. prometheusurl = 'https://172.31.243.137:6443/apis/monitoring.coreos.com/v1/namespaces/prometheus-k8s/prometheusrules/'
  3. prometheustoken = 'eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJ6cHJvbWV0aGV1cy10b2tlbi1zZm5tbiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJ6cHJvbWV0aGV1cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjVkM2NmMGRiLWRkMGYtMTFlYS1iMTZiLWZhMTYzZThiYzRmMSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpwcm9tZXRoZXVzLWs4czp6cHJvbWV0aGV1cyJ9.n3L9oJpTDmrNGr2Gnj2Tpsgyi51ZUzRh2qe4NinOs12jDLRmInsd1HPz7pynfwQEAN2qHsqPs29Ole_aSlegw1Q-0MUYMgwTxOR2GzEnaRMTELe7OglIaUYnZ9oztzq1jhouAitGK2UD_FpUVg94jxTCeFUFjfq7RP7YAqdLP6oyR_eepQM2tF0g4u6H5wN-nZYSorHF0HUn0VUAXc7uPQyZXSeSOhBAy5qu5xKyKp4VJ9veaGSfM6hgW0bz_Gahlk8IOyiMGPo2-S7MwRlxHezWv9KsDad7LGxCK2JvLetch5ekVlJgFgGknzWMFsrojWt4dc-4H1sQoC2SicFG9g'
  4. create_header = {
  5. "Authorization": "Bearer " + prometheustoken,
  6. "Content-Type": "application/json"
  7. }
  8. role = ""
  9. rule_json = {
  10. "apiVersion": "monitoring.coreos.com/v1",
  11. "kind": "PrometheusRule",
  12. "metadata": {
  13. "labels": {
  14. "prometheus": "k8s",
  15. "role": "alert-rules"
  16. },
  17. "name": "ocr-pod-restart2",
  18. "namespace": "prometheus-k8s",
  19. },
  20. "spec": {
  21. "groups": [
  22. {
  23. "name": "ocr-pod-restart",
  24. "rules": [
  25. {
  26. "alert": "ocr-pod-restart",
  27. "annotations": {
  28. "summary": "pod {{$labels.pod}} 正在崩溃,三分钟之内崩溃了{{ $value }}次!所在物理节点为:{{$labels.node}}"
  29. },
  30. "expr": "changes(kube_pod_container_status_restarts_total{namespace=\"default\"}[3m]) * on (pod ) group_left(node) kube_pod_info > 0",
  31. "for": "3s",
  32. "labels": {
  33. "alarm_group": "新架构-OCR-短信-全天",
  34. "product_line": "新架构OCR",
  35. "severity": "warning"
  36. }
  37. }
  38. ]
  39. }
  40. ]
  41. }
  42. }
  43. try:
  44. rule_data = json.dumps(rule_json)
  45. res = requests.post(url=prometheusurl, data=rule_data, headers=create_header, verify=False)
  46. res_code = res.status_code
  47. if res_code != 201:
  48. return res_code, "create prometheusRule failed! the errBody is: " + str(res.json())
  49. else:
  50. return 0, "created"
  51. except Exception as e:
  52. return -2, str(e)

查看添加的规则:

  1. [root@hf-aipaas-172-31-243-137 role-zprometheus]# kubectl get prometheusrule -n prometheus-k8s
  2. NAME AGE
  3. ocr-pod-restart2 3s

4、AIpass进行api调用

go举例

  1. func main() {
  2. Query := `kube_replicaset_status_ready_replicas{replicaset="calico-kube-controllers-54cd585c9c"}`
  3. QueryUrl := "http://172.16.59.204:80/api/v1/query"
  4. req, err := http.NewRequest(http.MethodGet, QueryUrl, nil)
  5. if err != nil {
  6. panic(err)
  7. }
  8. url_q := req.URL.Query()
  9. url_q.Add("query", Query)
  10. req.URL.RawQuery = url_q.Encode()
  11. req.Host = "prometheus.minikube.local.com"
  12. resp, err := http.DefaultClient.Do(req)
  13. if err != nil {
  14. panic(err)
  15. }
  16. defer resp.Body.Close()
  17. body, err := ioutil.ReadAll(resp.Body)
  18. if err != nil {
  19. fmt.Println(222)
  20. panic(err)
  21. }
  22. fmt.Println(string(body))
  23. }

kubectl get prometheusrule -n prometheus-k8s test-nginx -o yaml