前提:
- 可用的K8S集群
- 安装好Helm3
(1)添加helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update
(2)下载prometheus-stack包
helm pull prometheus-community/kube-prometheus-stacktar xf kube-prometheus-stack.tar.gz
(3)创建namespace
kubectl create ns monitoring
(4)创建storageclass
PS:这里采用的local storageclass,磁盘是ssd磁盘
altermanager-storage.yaml
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:name: local-storage-alertmanagerprovisioner: kubernetes.io/no-provisionerreclaimPolicy: DeletevolumeBindingMode: WaitForFirstConsumer---apiVersion: v1kind: PersistentVolumemetadata:finalizers:- kubernetes.io/pv-protectionname: local-pv-alertmanagerspec:accessModes:- ReadWriteOncecapacity:storage: 50Gilocal:path: /opt/hipay/lib/altermanagernodeAffinity:required:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringpersistentVolumeReclaimPolicy: DeletestorageClassName: local-storage-alertmanagervolumeMode: Filesystem
grafana-storage.yaml
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:name: local-storage-grafanaprovisioner: kubernetes.io/no-provisionerreclaimPolicy: DeletevolumeBindingMode: WaitForFirstConsumer---apiVersion: v1kind: PersistentVolumemetadata:finalizers:- kubernetes.io/pv-protectionname: local-grafana-pvspec:accessModes:- ReadWriteOncecapacity:storage: 20Gilocal:path: /opt/hipay/lib/grafananodeAffinity:required:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringpersistentVolumeReclaimPolicy: DeletestorageClassName: local-storage-grafanavolumeMode: Filesystem
prometheus-storage.yaml
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:name: local-storage-prometheusprovisioner: kubernetes.io/no-provisionerreclaimPolicy: DeletevolumeBindingMode: WaitForFirstConsumer---apiVersion: v1kind: PersistentVolumemetadata:finalizers:- kubernetes.io/pv-protectionname: local-pv-prometheusspec:accessModes:- ReadWriteOncecapacity:storage: 400Gilocal:path: /opt/hipay/lib/prometheusnodeAffinity:required:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringpersistentVolumeReclaimPolicy: DeletestorageClassName: local-storage-prometheusvolumeMode: Filesystem
其实创建一个Storageclass就足够了,不过我这里为了便与区分,每一个服务都创建了一个,无所谓了…….
(5)创建自定义配置文件
my-values.yaml
alertmanager:config:global:resolve_timeout: 5mtemplates:- '/etc/alertmanager/config/*.tmpl'route:group_by: ['job']group_wait: 30sgroup_interval: 5mrepeat_interval: 12hreceiver: webhookroutes:- match:alertname: Watchdogreceiver: 'webhook'receivers:- name: webhookwebhook_configs:- url: "http://prometheus-alter-webhook.monitoring.svc:9000"send_resolved: falseingress:enabled: truehosts:- altermanager.coolops.cnalertmanagerSpec:image:repository: quay.io/prometheus/alertmanagertag: v0.21.0sha: ""replicas: 1storage:volumeClaimTemplate:spec:storageClassName: local-storage-alertmanageraccessModes: ["ReadWriteOnce"]resources:requests:storage: 50Giaffinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringgrafana:adminPassword: dmCE9M$PQt@Q%ehtingress:enabled: truehosts:- grafana.coolops.cnpersistence:type: pvcenabled: truestorageClassName: local-storage-grafanaaccessModes:- ReadWriteOncesize: 20GiprometheusOperator:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringprometheus:ingress:enabled: truehosts:- prometheus.coolops.cnprometheusSpec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: categoryoperator: Invalues:- monitoringstorageSpec:volumeClaimTemplate:spec:storageClassName: local-storage-prometheusaccessModes: ["ReadWriteOnce"]resources:requests:storage: 400Gi
PS:这里使用了nodeAffinity做高级调度,所以需要给Node节点打标签 kubectl label node xxxxx category=monitoring
(6)安装
helm install prometheus -n monitoring kube-prometheus-stack -f kube-prometheus-stack/my-values.yaml
然后可以在monitoring的namespace下看到创建的如下信息
# kubectl get all -n monitoringNAME READY STATUS RESTARTS AGEpod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 4d20hpod/prometheus-grafana-6d644ff97b-hxdxq 2/2 Running 0 4d20hpod/prometheus-kube-prometheus-operator-db6d5c564-nwm82 1/1 Running 0 4d20hpod/prometheus-kube-state-metrics-c65b87574-4qx95 1/1 Running 0 4d20hpod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 1 4d20hpod/prometheus-prometheus-node-exporter-2t8pt 1/1 Running 0 4d20hpod/prometheus-prometheus-node-exporter-52bj4 1/1 Running 0 4d20hpod/prometheus-prometheus-node-exporter-xwsbw 1/1 Running 0 4d20hNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 4d20hservice/prometheus-alter-webhook ClusterIP 192.168.12.211 <none> 9000/TCP 2d18hservice/prometheus-grafana ClusterIP 192.168.11.93 <none> 80/TCP 4d20hservice/prometheus-kube-prometheus-alertmanager ClusterIP 192.168.11.95 <none> 9093/TCP 4d20hservice/prometheus-kube-prometheus-operator ClusterIP 192.168.10.102 <none> 443/TCP 4d20hservice/prometheus-kube-prometheus-prometheus ClusterIP 192.168.7.216 <none> 9090/TCP 4d20hservice/prometheus-kube-state-metrics ClusterIP 192.168.13.249 <none> 8080/TCP 4d20hservice/prometheus-operated ClusterIP None <none> 9090/TCP 4d20hservice/prometheus-prometheus-node-exporter ClusterIP 192.168.8.215 <none> 9100/TCP 4d20hNAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEdaemonset.apps/prometheus-prometheus-node-exporter 15 15 15 15 15 <none> 4d20hNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/prometheus-alert-webhook 1/1 1 1 2d18hdeployment.apps/prometheus-grafana 1/1 1 1 4d20hdeployment.apps/prometheus-kube-prometheus-operator 1/1 1 1 4d20hdeployment.apps/prometheus-kube-state-metrics 1/1 1 1 4d20hNAME DESIRED CURRENT READY AGEreplicaset.apps/prometheus-alert-webhook-7bd7766977 1 1 1 2d18hreplicaset.apps/prometheus-grafana-6d644ff97b 1 1 1 4d20hreplicaset.apps/prometheus-kube-prometheus-operator-db6d5c564 1 1 1 4d20hreplicaset.apps/prometheus-kube-state-metrics-c65b87574 1 1 1 4d20hNAME READY AGEstatefulset.apps/alertmanager-prometheus-kube-prometheus-alertmanager 1/1 4d20hstatefulset.apps/prometheus-prometheus-kube-prometheus-prometheus 1/1 4d20h
然后就可以通过域名进行访问了。
(7)其他配置
自定义告警规则
如果要自定义监控告警规则,可以直接在my-values.yaml添加additionalPrometheusRules字段,如下:
## 自定义告警规则additionalPrometheusRules:- name: blackbox-monitoring-rulegroups:- name: blackbox_checkrules:- alert: "ssl证书过期警告"expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30for: 1hlabels:severity: warnannotations:description: '域名{{$labels.instance}}的证书还有{{ printf "%.1f" $value }}天就过期了,请尽快更新证书'summary: "ssl证书过期警告"- alert: "接口/主机/端口 可用性异常"expr: probe_success == 0for: 1mlabels:severity: criticalannotations:summary: "接口/主机/端口异常检测"description: "接口/主机/端口 {{ $labels.instance }} 无法联通"
自定义job_name
如果要自定义job抓取规则,则prometheus.prometheusSpec字段下新增additionalScrapeConfigs,如下:
prometheus:......prometheusSpec:......additionalScrapeConfigs:- job_name: "ingress-endpoint-status"metrics_path: /probeparams:module: [http_2xx] # Look for a HTTP 200 response.static_configs:- targets:- http://172.16.51.23/healthz- http://172.16.51.24/healthz- http://172.16.51.25/healthzlabels:group: nginx-ingressrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: blackbox.monitoring:9115
自定义serviceMonitor
如果要自定义serviceMonitor监控,则直接在prometheus下新增additionalServiceMonitors,如下:
prometheus:......## 自定义监控additionalServiceMonitors:- name: ingress-nginx-controllerselector:matchLabels:app: ingress-nginx-metricsnamespaceSelector:matchNames:- ingress-nginxendpoints:- port: metricsinterval: 30s
