前提:

  • 可用的K8S集群
  • 安装好Helm3

(1)添加helm仓库

  1. helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  2. helm repo update

(2)下载prometheus-stack包

  1. helm pull prometheus-community/kube-prometheus-stack
  2. tar xf kube-prometheus-stack.tar.gz

(3)创建namespace

  1. kubectl create ns monitoring

(4)创建storageclass

PS:这里采用的local storageclass,磁盘是ssd磁盘

altermanager-storage.yaml

  1. apiVersion: storage.k8s.io/v1
  2. kind: StorageClass
  3. metadata:
  4. name: local-storage-alertmanager
  5. provisioner: kubernetes.io/no-provisioner
  6. reclaimPolicy: Delete
  7. volumeBindingMode: WaitForFirstConsumer
  8. ---
  9. apiVersion: v1
  10. kind: PersistentVolume
  11. metadata:
  12. finalizers:
  13. - kubernetes.io/pv-protection
  14. name: local-pv-alertmanager
  15. spec:
  16. accessModes:
  17. - ReadWriteOnce
  18. capacity:
  19. storage: 50Gi
  20. local:
  21. path: /opt/hipay/lib/altermanager
  22. nodeAffinity:
  23. required:
  24. nodeSelectorTerms:
  25. - matchExpressions:
  26. - key: category
  27. operator: In
  28. values:
  29. - monitoring
  30. persistentVolumeReclaimPolicy: Delete
  31. storageClassName: local-storage-alertmanager
  32. volumeMode: Filesystem

grafana-storage.yaml

  1. apiVersion: storage.k8s.io/v1
  2. kind: StorageClass
  3. metadata:
  4. name: local-storage-grafana
  5. provisioner: kubernetes.io/no-provisioner
  6. reclaimPolicy: Delete
  7. volumeBindingMode: WaitForFirstConsumer
  8. ---
  9. apiVersion: v1
  10. kind: PersistentVolume
  11. metadata:
  12. finalizers:
  13. - kubernetes.io/pv-protection
  14. name: local-grafana-pv
  15. spec:
  16. accessModes:
  17. - ReadWriteOnce
  18. capacity:
  19. storage: 20Gi
  20. local:
  21. path: /opt/hipay/lib/grafana
  22. nodeAffinity:
  23. required:
  24. nodeSelectorTerms:
  25. - matchExpressions:
  26. - key: category
  27. operator: In
  28. values:
  29. - monitoring
  30. persistentVolumeReclaimPolicy: Delete
  31. storageClassName: local-storage-grafana
  32. volumeMode: Filesystem

prometheus-storage.yaml

  1. apiVersion: storage.k8s.io/v1
  2. kind: StorageClass
  3. metadata:
  4. name: local-storage-prometheus
  5. provisioner: kubernetes.io/no-provisioner
  6. reclaimPolicy: Delete
  7. volumeBindingMode: WaitForFirstConsumer
  8. ---
  9. apiVersion: v1
  10. kind: PersistentVolume
  11. metadata:
  12. finalizers:
  13. - kubernetes.io/pv-protection
  14. name: local-pv-prometheus
  15. spec:
  16. accessModes:
  17. - ReadWriteOnce
  18. capacity:
  19. storage: 400Gi
  20. local:
  21. path: /opt/hipay/lib/prometheus
  22. nodeAffinity:
  23. required:
  24. nodeSelectorTerms:
  25. - matchExpressions:
  26. - key: category
  27. operator: In
  28. values:
  29. - monitoring
  30. persistentVolumeReclaimPolicy: Delete
  31. storageClassName: local-storage-prometheus
  32. volumeMode: Filesystem

其实创建一个Storageclass就足够了,不过我这里为了便与区分,每一个服务都创建了一个,无所谓了…….

(5)创建自定义配置文件

my-values.yaml

  1. alertmanager:
  2. config:
  3. global:
  4. resolve_timeout: 5m
  5. templates:
  6. - '/etc/alertmanager/config/*.tmpl'
  7. route:
  8. group_by: ['job']
  9. group_wait: 30s
  10. group_interval: 5m
  11. repeat_interval: 12h
  12. receiver: webhook
  13. routes:
  14. - match:
  15. alertname: Watchdog
  16. receiver: 'webhook'
  17. receivers:
  18. - name: webhook
  19. webhook_configs:
  20. - url: "http://prometheus-alter-webhook.monitoring.svc:9000"
  21. send_resolved: false
  22. ingress:
  23. enabled: true
  24. hosts:
  25. - altermanager.coolops.cn
  26. alertmanagerSpec:
  27. image:
  28. repository: quay.io/prometheus/alertmanager
  29. tag: v0.21.0
  30. sha: ""
  31. replicas: 1
  32. storage:
  33. volumeClaimTemplate:
  34. spec:
  35. storageClassName: local-storage-alertmanager
  36. accessModes: ["ReadWriteOnce"]
  37. resources:
  38. requests:
  39. storage: 50Gi
  40. affinity:
  41. nodeAffinity:
  42. requiredDuringSchedulingIgnoredDuringExecution:
  43. nodeSelectorTerms:
  44. - matchExpressions:
  45. - key: category
  46. operator: In
  47. values:
  48. - monitoring
  49. grafana:
  50. adminPassword: dmCE9M$PQt@Q%eht
  51. ingress:
  52. enabled: true
  53. hosts:
  54. - grafana.coolops.cn
  55. persistence:
  56. type: pvc
  57. enabled: true
  58. storageClassName: local-storage-grafana
  59. accessModes:
  60. - ReadWriteOnce
  61. size: 20Gi
  62. prometheusOperator:
  63. affinity:
  64. nodeAffinity:
  65. requiredDuringSchedulingIgnoredDuringExecution:
  66. nodeSelectorTerms:
  67. - matchExpressions:
  68. - key: category
  69. operator: In
  70. values:
  71. - monitoring
  72. prometheus:
  73. ingress:
  74. enabled: true
  75. hosts:
  76. - prometheus.coolops.cn
  77. prometheusSpec:
  78. affinity:
  79. nodeAffinity:
  80. requiredDuringSchedulingIgnoredDuringExecution:
  81. nodeSelectorTerms:
  82. - matchExpressions:
  83. - key: category
  84. operator: In
  85. values:
  86. - monitoring
  87. storageSpec:
  88. volumeClaimTemplate:
  89. spec:
  90. storageClassName: local-storage-prometheus
  91. accessModes: ["ReadWriteOnce"]
  92. resources:
  93. requests:
  94. storage: 400Gi

PS:这里使用了nodeAffinity做高级调度,所以需要给Node节点打标签 kubectl label node xxxxx category=monitoring

(6)安装

  1. helm install prometheus -n monitoring kube-prometheus-stack -f kube-prometheus-stack/my-values.yaml

然后可以在monitoring的namespace下看到创建的如下信息

  1. # kubectl get all -n monitoring
  2. NAME READY STATUS RESTARTS AGE
  3. pod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 4d20h
  4. pod/prometheus-grafana-6d644ff97b-hxdxq 2/2 Running 0 4d20h
  5. pod/prometheus-kube-prometheus-operator-db6d5c564-nwm82 1/1 Running 0 4d20h
  6. pod/prometheus-kube-state-metrics-c65b87574-4qx95 1/1 Running 0 4d20h
  7. pod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 1 4d20h
  8. pod/prometheus-prometheus-node-exporter-2t8pt 1/1 Running 0 4d20h
  9. pod/prometheus-prometheus-node-exporter-52bj4 1/1 Running 0 4d20h
  10. pod/prometheus-prometheus-node-exporter-xwsbw 1/1 Running 0 4d20h
  11. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  12. service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 4d20h
  13. service/prometheus-alter-webhook ClusterIP 192.168.12.211 <none> 9000/TCP 2d18h
  14. service/prometheus-grafana ClusterIP 192.168.11.93 <none> 80/TCP 4d20h
  15. service/prometheus-kube-prometheus-alertmanager ClusterIP 192.168.11.95 <none> 9093/TCP 4d20h
  16. service/prometheus-kube-prometheus-operator ClusterIP 192.168.10.102 <none> 443/TCP 4d20h
  17. service/prometheus-kube-prometheus-prometheus ClusterIP 192.168.7.216 <none> 9090/TCP 4d20h
  18. service/prometheus-kube-state-metrics ClusterIP 192.168.13.249 <none> 8080/TCP 4d20h
  19. service/prometheus-operated ClusterIP None <none> 9090/TCP 4d20h
  20. service/prometheus-prometheus-node-exporter ClusterIP 192.168.8.215 <none> 9100/TCP 4d20h
  21. NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
  22. daemonset.apps/prometheus-prometheus-node-exporter 15 15 15 15 15 <none> 4d20h
  23. NAME READY UP-TO-DATE AVAILABLE AGE
  24. deployment.apps/prometheus-alert-webhook 1/1 1 1 2d18h
  25. deployment.apps/prometheus-grafana 1/1 1 1 4d20h
  26. deployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 4d20h
  27. deployment.apps/prometheus-kube-state-metrics 1/1 1 1 4d20h
  28. NAME DESIRED CURRENT READY AGE
  29. replicaset.apps/prometheus-alert-webhook-7bd7766977 1 1 1 2d18h
  30. replicaset.apps/prometheus-grafana-6d644ff97b 1 1 1 4d20h
  31. replicaset.apps/prometheus-kube-prometheus-operator-db6d5c564 1 1 1 4d20h
  32. replicaset.apps/prometheus-kube-state-metrics-c65b87574 1 1 1 4d20h
  33. NAME READY AGE
  34. statefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager 1/1 4d20h
  35. statefulset.apps/prometheus-prometheus-kube-prometheus-prometheus 1/1 4d20h

然后就可以通过域名进行访问了。

(7)其他配置

自定义告警规则

如果要自定义监控告警规则,可以直接在my-values.yaml添加additionalPrometheusRules字段,如下:

  1. ## 自定义告警规则
  2. additionalPrometheusRules:
  3. - name: blackbox-monitoring-rule
  4. groups:
  5. - name: blackbox_check
  6. rules:
  7. - alert: "ssl证书过期警告"
  8. expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30
  9. for: 1h
  10. labels:
  11. severity: warn
  12. annotations:
  13. description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'
  14. summary: "ssl证书过期警告"
  15. - alert: "接口/主机/端口 可用性异常"
  16. expr: probe_success == 0
  17. for: 1m
  18. labels:
  19. severity: critical
  20. annotations:
  21. summary: "接口/主机/端口异常检测"
  22. description: "接口/主机/端口 {{ $labels.instance }} 无法联通"

自定义job_name

如果要自定义job抓取规则,则prometheus.prometheusSpec字段下新增additionalScrapeConfigs,如下:

  1. prometheus:
  2. ......
  3. prometheusSpec:
  4. ......
  5. additionalScrapeConfigs:
  6. - job_name: "ingress-endpoint-status"
  7. metrics_path: /probe
  8. params:
  9. module: [http_2xx] # Look for a HTTP 200 response.
  10. static_configs:
  11. - targets:
  12. - http://172.16.51.23/healthz
  13. - http://172.16.51.24/healthz
  14. - http://172.16.51.25/healthz
  15. labels:
  16. group: nginx-ingress
  17. relabel_configs:
  18. - source_labels: [__address__]
  19. target_label: __param_target
  20. - source_labels: [__param_target]
  21. target_label: instance
  22. - target_label: __address__
  23. replacement: blackbox.monitoring:9115

自定义serviceMonitor

如果要自定义serviceMonitor监控,则直接在prometheus下新增additionalServiceMonitors,如下:

  1. prometheus:
  2. ......
  3. ## 自定义监控
  4. additionalServiceMonitors:
  5. - name: ingress-nginx-controller
  6. selector:
  7. matchLabels:
  8. app: ingress-nginx-metrics
  9. namespaceSelector:
  10. matchNames:
  11. - ingress-nginx
  12. endpoints:
  13. - port: metrics
  14. interval: 30s