背景

线上kubernetes集群为1.16版本 Prometheus oprator 分支为0.4关于Prometheus oprator与kubernetes版本对应关系如下图。可见https://github.com/prometheus-operator/kube-prometheus.
注: Prometheus operator? kube-prometheus? kube-prometheus 就是 Prometheus的一种operator的部署方式….Prometheus-operator 已经改名为 Kube-promethues。
image.png
关于部署过程可以参考超级小豆丁大佬的笔记:http://www.mydlq.club/article/10/。Prometheus这种架构图,在各位大佬的文章中都可以看到的…….

Kubernetes 1.20.5 安装Prometheus-Oprator - 图2
先简单部署一下Prometheus oprator(or或者叫kube-promethus)。完成微信报警的集成,其他的慢慢在生成环境中研究。
基本过程就是Prometheus oprator 添加存储,增加微信报警,外部traefik代理应用。

1. prometheus环境的搭建

1. 克隆prometheus-operator仓库

  1. git clone https://github.com/prometheus-operator/kube-prometheus.git

image.png
网络原因,经常会搞不下来的,还是直接下载zip包吧,其实安装版本的支持列表kubernetes1.20的版本可以使用kube-prometheus的0.6 or 0.7 还有HEAD分支的任一分支。偷个懒直接用HEAD了。
记录一下tag,以后有修改了也好能快速修改了 让版本升级跟得上…..
image.png
上传zip包解压缩

  1. unzip kube-prometheus-main.zip

image.png

2. 按照快捷方式来一遍

  1. cd kube-prometheus-main/
  2. kubectl create -f manifests/setup
  3. until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
  4. kubectl create -f manifests/
  5. kubectl get pods -n monitoring

image.png
image.png

3. imagepullbackoff

由于网络原因会出现有些镜像下载不下来的问题,可墙外服务器下载镜像修改tag上传到harbor,修改yaml文件中镜像为对应私有镜像仓库的标签tag解决(由于我的私有仓库用的腾讯云的仓库,现在跨地域上传镜像应该个人版的不可以了,所以我使用了docker save导出镜像的方式):

image.png

  1. kubectl describe pods kube-state-metrics-56f988c7b6-qxqjn -n monitoring

image.png

1. 使用国外服务器下载镜像,并打包为tar包下载到本地。

  1. docker pull k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0
  2. docker save k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 -o kube-state-metrics.tar

image.png

2. ctr导入镜像

  1. ctr -n k8s.io i import kube-state-metrics.tar

image.png
导入的只是一个工作节点这样,但是kubernetes本来就是保证高可用用性,如果这个pod漂移调度到其他节点呢?难道要加上节点亲和性?这个节点如果就崩溃了呢?每个节点都导入此镜像?新加入的节点呢?还是老老实实的上传到镜像仓库吧!
正常的流程应该是这样吧?

  1. crictl images
  2. ctr image tag k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.0.0-rc.0 ccr.ccs.tencentyun.com/k8s_containers/kube-state-metrics:v2.0.0-rc.0

image.png
但是为什么是not found?不知道是不是标签格式问题….。反正就这样了 ,然后上传到镜像库,具体命令可以参考https://blog.csdn.net/tongzidane/article/details/114587138 https://blog.csdn.net/liumiaocn/article/details/103320426/
(上传我的仓库权限还是有问题(仓库里面可以下载啊但是我……..搞迷糊了),不先搞了直接导入了)
image.png
反正咋样把kube-state-metrics-XXXX启动起来就好了,有时间好好研究下ctr crictl命令 还是有点懵。

3. 验证服务都正常启动

  1. kubectl get pod -n monitoring
  2. kubectl get svc -n monitoring

image.png

4. 使用traefik代理应用

注: 参照前文Kubernetes 1.20.5 安装traefik在腾讯云下的实践https://www.yuque.com/duiniwukenaihe/ehb02i/odflm7#WT4ab。比较习惯了ingresroute的方式就保持这种了没有使用ingress 或者api的方式。

cat monitoring.com.yaml

  1. apiVersion: traefik.containo.us/v1alpha1
  2. kind: IngressRoute
  3. metadata:
  4. namespace: monitoring
  5. name: alertmanager-main-http
  6. spec:
  7. entryPoints:
  8. - web
  9. routes:
  10. - match: Host(\`alertmanager.saynaihe.com\`)
  11. kind: Rule
  12. services:
  13. - name: alertmanager-main
  14. port: 9093
  15. ---
  16. apiVersion: traefik.containo.us/v1alpha1
  17. kind: IngressRoute
  18. metadata:
  19. namespace: monitoring
  20. name: grafana-http
  21. spec:
  22. entryPoints:
  23. - web
  24. routes:
  25. - match: Host(\`monitoring.saynaihe.com\`)
  26. kind: Rule
  27. services:
  28. - name: grafana
  29. port: 3000
  30. ---
  31. apiVersion: traefik.containo.us/v1alpha1
  32. kind: IngressRoute
  33. metadata:
  34. namespace: monitoring
  35. name: prometheus
  36. spec:
  37. entryPoints:
  38. - web
  39. routes:
  40. - match: Host(\`prometheus.saynaihe.com\`)
  41. kind: Rule
  42. services:
  43. - name: prometheus-k8s
  44. port: 9090
  45. ---

kubectl apply -f monitoring.com.yaml

验证traefik代理应用是否成功:
image.png
image.png
修改密码
image.png
先随便演示一下,后面比较还要修改
image.png
image.png
仅用于演示,后面起码alertmanager Prometheus两个web要加个basic安全验证….

5. 添加 kubeControllerManager kubeScheduler监控

通过https://prometheus.saynaihe.com/targets 页面可以看到和前几个版本一样依然木有kube-scheduler 和 kube-controller-manager 的监控。
image.png

修改/etc/kubernetes/manifests/目录下kube-controller-manager.yaml kube-scheduler.yaml将 - —bind-address=127.0.0.1 修改为 - —bind-address=0.0.0.0

image.png
image.png

修改为配置文件 control manager scheduler服务会自动重启的。等待重启验证通过。

image.png
在manifests目录下(这一步一点要仔细看下新版的matchLabels发生了改变)

  1. grep -A2 -B2 selector kubernetes-serviceMonitor*

image.png

  1. cat <<EOF > kube-controller-manager-scheduler.yml
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. namespace: kube-system
  6. name: kube-controller-manager
  7. labels:
  8. app.kubernetes.io/name: kube-controller-manager
  9. spec:
  10. selector:
  11. component: kube-controller-manager
  12. type: ClusterIP
  13. clusterIP: None
  14. ports:
  15. - name: https-metrics
  16. port: 10257
  17. targetPort: 10257
  18. protocol: TCP
  19. ---
  20. apiVersion: v1
  21. kind: Service
  22. metadata:
  23. namespace: kube-system
  24. name: kube-scheduler
  25. labels:
  26. app.kubernetes.io/name: kube-scheduler
  27. spec:
  28. selector:
  29. component: kube-scheduler
  30. type: ClusterIP
  31. clusterIP: None
  32. ports:
  33. - name: https-metrics
  34. port: 10259
  35. targetPort: 10259
  36. protocol: TCP
  37. EOF
  38. kubectl apply -f kube-controller-manager-scheduler.yml

image.png
image.png

  1. cat <<EOF > kube-ep.yml
  2. apiVersion: v1
  3. kind: Endpoints
  4. metadata:
  5. labels:
  6. k8s-app: kube-controller-manager
  7. name: kube-controller-manager
  8. namespace: kube-system
  9. subsets:
  10. - addresses:
  11. - ip: 10.3.2.5
  12. - ip: 10.3.2.13
  13. - ip: 10.3.2.16
  14. ports:
  15. - name: https-metrics
  16. port: 10257
  17. protocol: TCP
  18. ---
  19. apiVersion: v1
  20. kind: Endpoints
  21. metadata:
  22. labels:
  23. k8s-app: kube-scheduler
  24. name: kube-scheduler
  25. namespace: kube-system
  26. subsets:
  27. - addresses:
  28. - ip: 10.3.2.5
  29. - ip: 10.3.2.13
  30. - ip: 10.3.2.16
  31. ports:
  32. - name: https-metrics
  33. port: 10259
  34. protocol: TCP
  35. EOF
  36. kubectl apply -f kube-ep.yml

image.png
登陆https://prometheus.saynaihe.com/targets进行验证:
image.png

6. ECTD的监控

  1. kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
  2. kubectl edit prometheus k8s -n monitoring

image.png

验证Prometheus是否正常挂载证书

  1. [root@sh-master-02 yaml]# kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring
  2. kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
  3. Defaulting container name to prometheus.
  4. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod.
  5. /prometheus $ ls /etc/prometheus/secrets/etcd-certs/
  6. ca.crt healthcheck-client.crt healthcheck-client.key
  1. cat <<EOF > kube-ep-etcd.yml
  2. apiVersion: v1
  3. kind: Service
  4. metadata:
  5. name: etcd-k8s
  6. namespace: kube-system
  7. labels:
  8. k8s-app: etcd
  9. spec:
  10. type: ClusterIP
  11. clusterIP: None
  12. ports:
  13. - name: etcd
  14. port: 2379
  15. protocol: TCP
  16. ---
  17. apiVersion: v1
  18. kind: Endpoints
  19. metadata:
  20. labels:
  21. k8s-app: etcd
  22. name: etcd-k8s
  23. namespace: kube-system
  24. subsets:
  25. - addresses:
  26. - ip: 10.3.2.5
  27. - ip: 10.3.2.13
  28. - ip: 10.3.2.16
  29. ports:
  30. - name: etcd
  31. port: 2379
  32. protocol: TCP
  33. ---
  34. EOF
  35. kubectl apply -f kube-ep-etcd.yml
  1. cat <<EOF > prometheus-serviceMonitorEtcd.yaml
  2. apiVersion: monitoring.coreos.com/v1
  3. kind: ServiceMonitor
  4. metadata:
  5. name: etcd-k8s
  6. namespace: monitoring
  7. labels:
  8. k8s-app: etcd
  9. spec:
  10. jobLabel: k8s-app
  11. endpoints:
  12. - port: etcd
  13. interval: 30s
  14. scheme: https
  15. tlsConfig:
  16. caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
  17. certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
  18. keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
  19. insecureSkipVerify: true
  20. selector:
  21. matchLabels:
  22. k8s-app: etcd
  23. namespaceSelector:
  24. matchNames:
  25. - kube-system
  26. EOF
  27. kubectl apply -f prometheus-serviceMonitorEtcd.yaml

image.png

7. prometheus配置文件修改为正式

1. 添加自动发现配置

网上随便抄 了一个

  1. cat <<EOF > prometheus-additional.yaml
  2. - job_name: 'kubernetes-endpoints'
  3. kubernetes_sd_configs:
  4. - role: endpoints
  5. relabel_configs:
  6. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  7. action: keep
  8. regex: true
  9. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  10. action: replace
  11. target_label: __scheme__
  12. regex: (https?)
  13. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  14. action: replace
  15. target_label: __metrics_path__
  16. regex: (.+)
  17. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  18. action: replace
  19. target_label: __address__
  20. regex: ([^:]+)(?::\d+)?;(\d+)
  21. replacement: $1:$2
  22. - action: labelmap
  23. regex: __meta_kubernetes_service_label_(.+)
  24. - source_labels: [__meta_kubernetes_namespace]
  25. action: replace
  26. target_label: kubernetes_namespace
  27. - source_labels: [__meta_kubernetes_service_name]
  28. action: replace
  29. target_label: kubernetes_name
  30. - source_labels: [__meta_kubernetes_pod_name]
  31. action: replace
  32. target_label: kubernetes_pod_name
  33. EOF
  34. kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring

2. 增加存储 保留时间 etcd secret

  1. cat <<EOF > prometheus-prometheus.yaml
  2. apiVersion: monitoring.coreos.com/v1
  3. kind: Prometheus
  4. metadata:
  5. labels:
  6. app.kubernetes.io/component: prometheus
  7. app.kubernetes.io/name: prometheus
  8. app.kubernetes.io/part-of: kube-prometheus
  9. app.kubernetes.io/version: 2.25.0
  10. prometheus: k8s
  11. name: k8s
  12. namespace: monitoring
  13. spec:
  14. alerting:
  15. alertmanagers:
  16. - apiVersion: v2
  17. name: alertmanager-main
  18. namespace: monitoring
  19. port: web
  20. externalLabels: {}
  21. image: quay.io/prometheus/prometheus:v2.25.0
  22. nodeSelector:
  23. kubernetes.io/os: linux
  24. podMetadata:
  25. labels:
  26. app.kubernetes.io/component: prometheus
  27. app.kubernetes.io/name: prometheus
  28. app.kubernetes.io/part-of: kube-prometheus
  29. app.kubernetes.io/version: 2.25.0
  30. podMonitorNamespaceSelector: {}
  31. podMonitorSelector: {}
  32. probeNamespaceSelector: {}
  33. probeSelector: {}
  34. replicas: 2
  35. resources:
  36. requests:
  37. memory: 400Mi
  38. ruleSelector:
  39. matchLabels:
  40. prometheus: k8s
  41. role: alert-rules
  42. secrets:
  43. - etcd-certs
  44. securityContext:
  45. fsGroup: 2000
  46. runAsNonRoot: true
  47. runAsUser: 1000
  48. additionalScrapeConfigs:
  49. name: additional-configs
  50. key: prometheus-additional.yaml
  51. serviceAccountName: prometheus-k8s
  52. retention: 60d
  53. serviceMonitorNamespaceSelector: {}
  54. serviceMonitorSelector: {}
  55. version: 2.25.0
  56. storage:
  57. volumeClaimTemplate:
  58. spec:
  59. storageClassName: cbs-csi
  60. resources:
  61. requests:
  62. storage: 50Gi
  63. EOF
  64. kubectl apply -f prometheus-prometheus.yaml

image.png

8. grafana添加存储

  1. 新建grafana pvc
    1. cat <<EOF > grafana-pv.yaml
    2. apiVersion: v1
    3. kind: PersistentVolumeClaim
    4. metadata:
    5. name: grafana
    6. namespace: monitoring
    7. spec:
    8. storageClassName: cbs-csi
    9. accessModes:
    10. - ReadWriteOnce
    11. resources:
    12. requests:
    13. storage: 20Gi
    14. EOF
    15. kubectl apply -f grafana-pv.yaml
    1. 修改manifests目录下grafana-deployment.yaml存储
    image.png

    9. grafana添加监控模板

    添加etcd traefik 模板,import模板号10906 3070.嗯 会发现traefik模板会出现Panel plugin not found: grafana-piechart-panel.
    1. 解决方法:重新构建grafana镜像,/usr/share/grafana/bin/grafana-cli plugins install grafana-piechart-panel安装缺失插件

image.png
image.png

10. 微信报警

image.png
image.png
将对应秘钥填入alertmanager.yaml

1. 配置alertmanager.yaml

  1. cat <<EOF > alertmanager.yaml
  2. global:
  3. resolve_timeout: 2m
  4. wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  5. route:
  6. group_by: ['alert']
  7. group_wait: 10s
  8. group_interval: 1m
  9. repeat_interval: 1h
  10. receiver: wechat
  11. receivers:
  12. - name: 'wechat'
  13. wechat_configs:
  14. - api_secret: 'XXXXXXXXXX'
  15. send_resolved: true
  16. to_user: '@all'
  17. to_party: 'XXXXXX'
  18. agent_id: 'XXXXXXXX'
  19. corp_id: 'XXXXXXXX'
  20. templates:
  21. - '/etc/config/alert/wechat.tmpl'
  22. inhibit_rules:
  23. - source_match:
  24. severity: 'critical'
  25. target_match:
  26. severity: 'warning'
  27. equal: ['alertname', 'production', 'instance']
  28. EOF

2. 个性化配置报警模板,这个随意了网上有很多例子

  1. cat <<EOF > wechat.tpl
  2. {{ define "wechat.default.message" }}
  3. {{- if gt (len .Alerts.Firing) 0 -}}
  4. {{- range $index, $alert := .Alerts -}}
  5. {{- if eq $index 0 }}
  6. ==========异常告警==========
  7. 告警类型: {{ $alert.Labels.alertname }}
  8. 告警级别: {{ $alert.Labels.severity }}
  9. 告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
  10. 故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  11. {{- if gt (len $alert.Labels.instance) 0 }}
  12. 实例信息: {{ $alert.Labels.instance }}
  13. {{- end }}
  14. {{- if gt (len $alert.Labels.namespace) 0 }}
  15. 命名空间: {{ $alert.Labels.namespace }}
  16. {{- end }}
  17. {{- if gt (len $alert.Labels.node) 0 }}
  18. 节点信息: {{ $alert.Labels.node }}
  19. {{- end }}
  20. {{- if gt (len $alert.Labels.pod) 0 }}
  21. 实例名称: {{ $alert.Labels.pod }}
  22. {{- end }}
  23. ============END============
  24. {{- end }}
  25. {{- end }}
  26. {{- end }}
  27. {{- if gt (len .Alerts.Resolved) 0 -}}
  28. {{- range $index, $alert := .Alerts -}}
  29. {{- if eq $index 0 }}
  30. ==========异常恢复==========
  31. 告警类型: {{ $alert.Labels.alertname }}
  32. 告警级别: {{ $alert.Labels.severity }}
  33. 告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
  34. 故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  35. 恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  36. {{- if gt (len $alert.Labels.instance) 0 }}
  37. 实例信息: {{ $alert.Labels.instance }}
  38. {{- end }}
  39. {{- if gt (len $alert.Labels.namespace) 0 }}
  40. 命名空间: {{ $alert.Labels.namespace }}
  41. {{- end }}
  42. {{- if gt (len $alert.Labels.node) 0 }}
  43. 节点信息: {{ $alert.Labels.node }}
  44. {{- end }}
  45. {{- if gt (len $alert.Labels.pod) 0 }}
  46. 实例名称: {{ $alert.Labels.pod }}
  47. {{- end }}
  48. ============END============
  49. {{- end }}
  50. {{- end }}
  51. {{- end }}
  52. {{- end }}
  53. EOF

3. 部署secret

  1. kubectl delete secret alertmanager-main -n monitoring
  2. kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring

image.png

4. 验证

image.png

11. 彩蛋

正好个人想试一下kubernetes的HPA ,

  1. [root@sh-master-02 yaml]# kubectl top pods -n qa
  2. W0330 16:00:54.657335 2622645 top_pod.go:265] Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145s
  3. error: Metrics not available for pod qa/dataloader-comment-5d975d9d57-p22w9, age: 2h3m13.657327145s

what Prometheus oprator不是有metrics吗 ?怎么回事

  1. kubectl logs -f prometheus-adapter-c96488cdd-vfm7h -n monitoring

如下图…. 我安装kubernete时候修改了集群的dnsDomain。没有修改配置文件,这样是有问题的
image.png
manifests目录下 修改prometheus-adapter-deployment.yaml中Prometheus-url
image.png
然后kubectl top nodes.可以使用了
image.png

12. 顺便讲一下hpa

参照https://blog.csdn.net/weixin_38320674/article/details/105460033。环境中有metrics。从第七步骤开始
image.png

1. 打包上传到镜像库

  1. docker build -t ccr.ccs.tencentyun.com/XXXXX/test1:0.1 .
  2. docker push ccr.ccs.tencentyun.com/XXXXX/test1:0.1

2. 通过deployment部署一个php-apache服务

cat php-apache.yaml

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: php-apache
  5. spec:
  6. selector:
  7. matchLabels:
  8. run: php-apache
  9. replicas: 1
  10. template:
  11. metadata:
  12. labels:
  13. run: php-apache
  14. spec:
  15. containers:
  16. - name: php-apache
  17. image: ccr.ccs.tencentyun.com/XXXXX/test1:0.1
  18. ports:
  19. - containerPort: 80
  20. resources:
  21. limits:
  22. cpu: 200m
  23. requests:
  24. cpu: 100m
  25. ---
  26. apiVersion: v1
  27. kind: Service
  28. metadata:
  29. name: php-apache
  30. labels:
  31. run: php-apache
  32. spec:
  33. ports:
  34. - port: 80
  35. selector:
  36. run: php-apache

kubectl apply -f php-apache.yaml

3. 创建hpa

  1. kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

下面是释意:

  1. kubectl autoscale deployment php-apache php-apache表示deployment的名字) --cpu-percent=50(表示cpu使用率不超过50%) --min=1(最少一个pod
  2. --max=10(最多10pod

4.压测php-apache服务,只是针对CPU做压测

启动一个容器,并将无限查询循环发送到php-apache服务(复制k8s的master节点的终端,也就是打开一个新的终端窗口):

  1. kubectl run v1 -it --image=busybox /bin/sh

登录到容器之后,执行如下命令

  1. while true; do wget -q -O- http://php-apache.default; done

image.png
这里只对cpu做了测试。简单demo.其他的单单讲了.

13. 其他坑爹的

无意间把pv,pvc删除了…. 以为我的storageclass有问题。然后重新部署吧 ? 个人觉得部署一下 prometheus-prometheus.yaml就好了,然后 并没有出现Prometheus的服务。瞄了一遍日志7.1整有了,重新执行下就好了,不记得自己具体哪里把这的secret 搞掉了….记录一下

  1. kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring