- 使用Prometheus和Grafana监控kubernetes集群
k8s-centos8u2-集群-kubernetes集群监控与告警
使用Prometheus和Grafana监控kubernetes集群
1 prometheus概念
由于docker容器的特殊性,传统的zabbix无法对k8s集群内的docker状态进行监控,所以需要使用prometheus来进行监控。
prometheus官网:官网地址
1.1 Prometheus的特点
- 多维度数据模型,使用时间序列数据库TSDB而不使用mysql。
- 灵活的查询语言PromQL。
- 不依赖分布式存储,单个服务器节点是自主的。
- 主要基于HTTP的pull方式主动采集时序数据。
- 也可通过pushgateway获取主动推送到网关的数据。
- 通过服务发现或者静态配置来发现目标服务对象。
- 支持多种多样的图表和界面展示,比如Grafana等。
1.2 基本原理
1.2.1 原理说明
- Prometheus的基本原理是通过各种exporter提供的HTTP协议接口周期性抓取被监控组件的状态,任意组件只要提供对应的HTTP接口就可以接入监控。
- 不需要任何SDK或者其他的集成过程,非常适合做虚拟化环境监控系统,比如VM、Docker、Kubernetes等。
- 互联网公司常用的组件大部分都有exporter可以直接使用,如Nginx、MySQL、Linux系统信息等。
1.2.2 架构图
1.2.3 三大套件
Server
主要负责数据采集和存储,提供PromQL查询语言的支持。Alertmanager
警告管理器,用来进行报警。Push Gateway
支持临时性Job主动推送指标的中间网关。
1.2.4 架构服务过程
- Prometheus Daemon负责定时去目标上抓取metrics(指标)数据
每个抓取目标需要暴露一个http服务的接口给它定时抓取。
支持通过配置文件、文本文件、Zookeeper、DNS SRV Lookup等方式指定抓取目标。 - PushGateway用于Client主动推送metrics到PushGateway
而Prometheus只是定时去Gateway上抓取数据。
适合一次性、短生命周期的服务。 - Prometheus在TSDB数据库存储抓取的所有数据
通过一定规则进行清理和整理数据,并把得到的结果存储到新的时间序列中。 - Prometheus通过PromQL和其他API可视化地展示收集的数据。
支持Grafana、Promdash等方式的图表数据可视化。
Prometheus还提供HTTP API的查询方式,自定义所需要的输出。 - Alertmanager是独立于Prometheus的一个报警组件
支持Prometheus的查询语句,提供十分灵活的报警方式。
1.2.5 常用的exporter
prometheus不同于zabbix,没有agent,使用的是针对不同服务的exporter。正常情况下,监控k8s集群及node,pod,常用的exporter有四个:
kube-state-metrics
收集k8s集群master&etcd等基本状态信息node-exporter
收集k8s集群node信息cadvisor
收集k8s集群docker容器内部使用资源信息blackbox-exporter
收集k8s集群docker容器服务是否存活
1.3 部署软件说明
镜像 | 官方地址 | github地址 | 部署 |
---|---|---|---|
quay.io/coreos/kube-state-metrics:v1.5.0 | https://quay.io/repository/coreos/kube-state-metrics?tab=info | https://github.com/kubernetes/kube-state-metrics | Deployment |
prom/node-exporter:v0.15.0 | https://hub.docker.com/r/prom/node-exporter | https://github.com/prometheus/node_exporter | DaemonSet |
google/cadvisor:v0.28.3 | https://hub.docker.com/r/google/cadvisor | https://github.com/google/cadvisor | DaemonSet |
prom/blackbox-exporter:v0.15.1 | https://hub.docker.com/r/prom/blackbox-exporter | https://github.com/prometheus/blackbox_exporter | Deployment |
prom/prometheus:v2.14.0 | https://hub.docker.com/r/prom/prometheus | https://github.com/prometheus/prometheus | nodeName: vms21.cos.com |
grafana/grafana:5.4.2 | https://grafana.com/ https://hub.docker.com/r/grafana/grafana | https://github.com/grafana/grafana | nodeName: vms22.cos.com |
docker.io/prom/alertmanager:v0.14.0 | https://hub.docker.com/r/prom/alertmanager | https://github.com/prometheus/alertmanager | Deployment |
在使用较新版本出现问题时,使用较低版本,参考以上镜像版本。
2 部署kube-state-metrics
运维主机vms200
准备kube-state-metrics镜像
官方quay.io地址:https://quay.io/repository/coreos/kube-state-metrics?tab=info github地址:https://github.com/kubernetes/kube-state-metrics
[root@vms200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.5.0
v1.5.0: Pulling from coreos/kube-state-metrics
cd784148e348: Pull complete
f622528a393e: Pull complete
Digest: sha256:b7a3143bd1eb7130759c9259073b9f239d0eeda09f5210f1cd31f1a530599ea1
Status: Downloaded newer image for quay.io/coreos/kube-state-metrics:v1.5.0
quay.io/coreos/kube-state-metrics:v1.5.0
[root@vms200 ~]# docker pull quay.io/coreos/kube-state-metrics:v1.9.7
...
[root@vms200 ~]# docker images | grep kube-state-metrics
quay.io/coreos/kube-state-metrics v1.9.7 6497f02dbdad 3 months ago 32.8MB
quay.io/coreos/kube-state-metrics v1.5.0 91599517197a 20 months ago 31.8MB
[root@vms200 ~]# docker tag 6497f02dbdad harbor.op.com/public/kube-state-metrics:v1.9.7
[root@vms200 ~]# docker tag quay.io/coreos/kube-state-metrics:v1.5.0 harbor.op.com/public/kube-state-metrics:v1.5.0
[root@vms200 ~]# docker push harbor.op.com/public/kube-state-metrics:v1.9.7
The push refers to repository [harbor.op.com/public/kube-state-metrics]
d1ce60962f06: Pushed
0d1435bd79e4: Mounted from public/metrics-server
v1.9.7: digest: sha256:2f82f0da199c60a7699c43c63a295c44e673242de0b7ee1b17c2d5a23bec34cb size: 738
[root@vms200 ~]# docker push harbor.op.com/public/kube-state-metrics:v1.5.0
The push refers to repository [harbor.op.com/public/kube-state-metrics]
5b3c36501a0a: Pushed
7bff100f35cb: Pushed
v1.5.0: digest: sha256:16e9a1d63e80c19859fc1e2727ab7819f89aeae5f8ab5c3380860c2f88fe0a58 size: 739
准备资源配置清单
yaml下载要使用raw格式的地址
v1.9.7版本
v1.9.7:https://github.com/kubernetes/kube-state-metrics/tree/v1.9.7/examples/standard
[root@vms200 ~]# cd /data/k8s-yaml/
[root@vms200 k8s-yaml]# mkdir kube-state-metrics
[root@vms200 k8s-yaml]# cd kube-state-metrics
[root@vms200 kube-state-metrics]# mkdir v1.9.7
[root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/service-account.yaml
[root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/service.yaml
[root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/deployment.yaml
[root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/cluster-role.yaml
[root@vms200 v1.9.7]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/v1.9.7/examples/standard/cluster-role-binding.yaml
v1.9.7
资源配置清单文件:/data/k8s-yaml/kube-state-metrics目录
rbac-v1.9.7.yaml
(合并service-account.yaml
、cluster-role.yaml
、cluster-role-binding.yaml
)
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.9.7
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.9.7
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
- ingresses
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- authentication.k8s.io
resources:
- tokenreviews
verbs:
- create
- apiGroups:
- authorization.k8s.io
resources:
- subjectaccessreviews
verbs:
- create
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- volumeattachments
verbs:
- list
- watch
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.9.7
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
deployment-v1.9.7.yaml
(修改deployment.yaml
)
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
grafanak8sapp: "true"
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.9.7
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
grafanak8sapp: "true"
app.kubernetes.io/name: kube-state-metrics
template:
metadata:
labels:
grafanak8sapp: "true"
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: v1.9.7
spec:
containers:
- image: harbor.op.com/public/kube-state-metrics:v1.9.7
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
imagePullSecrets:
- name: harbor
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: kube-state-metrics
说明:修改image
、增加imagePullPolicy
、增加imagePullSecrets
v1.5.0版本
v1.5.0:https://github.com/kubernetes/kube-state-metrics/tree/release-1.5/kubernetes
[root@vms200 kube-state-metrics]# mkdir v1.5.0
[root@vms200 kube-state-metrics]# cd v1.5.0/
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-cluster-role-binding.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-cluster-role.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-deployment.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-role-binding.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-role.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-service-account.yaml
[root@vms200 v1.5.0]# wget https://raw.githubusercontent.com/kubernetes/kube-state-metrics/release-1.5/kubernetes/kube-state-metrics-service.yaml
v1.5.0
资源配置清单文件:/data/k8s-yaml/kube-state-metrics目录
rbac.yaml
[root@vms200 ~]# cd /data/k8s-yaml/
[root@vms200 k8s-yaml]# mkdir kube-state-metrics
[root@vms200 k8s-yaml]# cd kube-state-metrics
[root@vms200 kube-state-metrics]# vi rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
rules:
- apiGroups:
- ""
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
labels:
grafanak8sapp: "true"
app: kube-state-metrics
name: kube-state-metrics
namespace: kube-system
spec:
selector:
matchLabels:
grafanak8sapp: "true"
app: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
grafanak8sapp: "true"
app: kube-state-metrics
spec:
containers:
- image: harbor.op.com/public/kube-state-metrics:v1.5.0
name: kube-state-metrics
ports:
- containerPort: 8080
name: http-metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: harbor
restartPolicy: Always
serviceAccount: kube-state-metrics
serviceAccountName: kube-state-metrics
应用资源配置清单
任意一台运算节点上:
v1.5.0
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/rbac.yaml
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/deployment.yaml
deployment.apps/kube-state-metrics created
v1.9.7
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/rbac-v1.9.7.yaml
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/kube-state-metrics/deployment-v1.9.7.yaml
deployment.apps/kube-state-metrics created
检查启动情况
v1.5.0
[root@vms22 ~]# kubectl get pods -n kube-system -o wide |grep kube-state-metrics
kube-state-metrics-5ff77848c6-9grj9 1/1 Running 0 101s 172.26.22.3 vms22.cos.com <none> <none>
[root@vms22 ~]# curl http://172.26.22.3:8080/healthz
ok
v1.9.7
[root@vms21 ~]# kubectl get pods -n kube-system -o wide |grep kube-state-metrics
kube-state-metrics-5776ff76f-4f6dk 1/1 Running 0 20s 172.26.22.3 vms22.cos.com <none> <none>
[root@vms21 ~]# curl http://172.26.22.3:8080/healthz
OK[root@vms21 ~]#
[root@vms21 ~]# curl http://172.26.22.3:8081
<html>
<head><title>Kube-State-Metrics Metrics Server</title></head>
<body>
<h1>Kube-State-Metrics Metrics</h1>
<ul>
<li><a href='/metrics'>metrics</a></li>
</ul>
</body>
</html>
3 部署node-exporter
运维主机vms200
上:
准备node-exporter镜像
官方dockerhub地址:https://hub.docker.com/r/prom/node-exporter github地址:https://github.com/prometheus/node_exporter
[root@vms200 ~]# docker pull prom/node-exporter:v1.0.1
v1.0.1: Pulling from prom/node-exporter
86fa074c6765: Pull complete
ed1cd1c6cd7a: Pull complete
ff1bb132ce7b: Pull complete
Digest: sha256:cf66a6bbd573fd819ea09c72e21b528e9252d58d01ae13564a29749de1e48e0f
Status: Downloaded newer image for prom/node-exporter:v1.0.1
docker.io/prom/node-exporter:v1.0.1
[root@vms200 ~]# docker tag docker.io/prom/node-exporter:v1.0.1 harbor.op.com/public/node-exporter:v1.0.1
[root@vms200 ~]# docker push harbor.op.com/public/node-exporter:v1.0.1
准备资源配置清单
[root@vms200 ~]# mkdir /data/k8s-yaml/node-exporter && cd /data/k8s-yaml/node-exporter
- 由于node-exporter是监控node的,需要每个节点启动一个,所以使用ds控制器
- 主要用途就是将宿主机的
/proc
、sys
目录挂载给容器,使容器能获取node节点宿主机信息
/data/k8s-yaml/node-exporter/node-exporter-ds.yaml
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: node-exporter
namespace: kube-system
labels:
daemon: "node-exporter"
grafanak8sapp: "true"
spec:
selector:
matchLabels:
daemon: "node-exporter"
grafanak8sapp: "true"
template:
metadata:
name: node-exporter
labels:
daemon: "node-exporter"
grafanak8sapp: "true"
spec:
containers:
- name: node-exporter
image: harbor.op.com/public/node-exporter:v1.0.1
imagePullPolicy: IfNotPresent
args:
- --path.procfs=/host_proc
- --path.sysfs=/host_sys
ports:
- name: node-exporter
hostPort: 9100
containerPort: 9100
protocol: TCP
volumeMounts:
- name: sys
readOnly: true
mountPath: /host_sys
- name: proc
readOnly: true
mountPath: /host_proc
imagePullSecrets:
- name: harbor
restartPolicy: Always
hostNetwork: true
volumes:
- name: proc
hostPath:
path: /proc
type: ""
- name: sys
hostPath:
path: /sys
type: ""
应用资源配置清单
任意运算节点上:
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/node-exporter/node-exporter-ds.yaml
daemonset.apps/node-exporter created
- 检查
[root@vms21 ~]# netstat -luntp | grep 9100
tcp6 0 0 :::9100 :::* LISTEN 3711/node_exporter
[root@vms21 ~]# kubectl get pod -n kube-system -o wide|grep node-exporter
node-exporter-vrpfn 1/1 Running 0 2m8s 192.168.26.21 vms21.cos.com <none> <none>
node-exporter-xw9k6 1/1 Running 0 2m8s 192.168.26.22 vms22.cos.com <none> <none>
[root@vms21 ~]# curl -s http://192.168.26.21:9100/metrics | more
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
...
[root@vms21 ~]# curl -s http://192.168.26.22:9100/metrics | more
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
...
- 在
dashboard
查看
4 部署cadvisor
运维主机vms200
上:
准备cadvisor镜像
官方dockerhub地址:https://hub.docker.com/r/google/cadvisor
官方github地址:https://github.com/google/cadvisor
[root@vms200 ~]# docker pull google/cadvisor:v0.33.0
v0.33.0: Pulling from google/cadvisor
169185f82c45: Pull complete
bd29476a29dd: Pull complete
a2eb18ca776e: Pull complete
Digest: sha256:47f1f8c02a3acfab77e74e2ec7acc0d475adc180ddff428503a4ce63f3d6061b
Status: Downloaded newer image for google/cadvisor:v0.33.0
docker.io/google/cadvisor:v0.33.0
[root@vms200 ~]# docker pull google/cadvisor
Using default tag: latest
latest: Pulling from google/cadvisor
ff3a5c916c92: Pull complete
44a45bb65cdf: Pull complete
0bbe1a2fe2a6: Pull complete
Digest: sha256:815386ebbe9a3490f38785ab11bda34ec8dacf4634af77b8912832d4f85dca04
Status: Downloaded newer image for google/cadvisor:latest
docker.io/google/cadvisor:latest
[root@vms200 ~]# docker images | grep cadvisor
google/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
google/cadvisor latest eb1210707573 22 months ago 69.6MB
[root@vms200 ~]# docker tag 752d61707eac harbor.op.com/public/cadvisor:v0.33.0
[root@vms200 ~]# docker tag eb1210707573 harbor.op.com/public/cadvisor:v200912
[root@vms200 ~]# docker images | grep cadvisor
google/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
harbor.op.com/public/cadvisor v0.33.0 752d61707eac 18 months ago 68.6MB
google/cadvisor latest eb1210707573 22 months ago 69.6MB
harbor.op.com/public/cadvisor v200912 eb1210707573 22 months ago 69.6MB
[root@vms200 ~]# docker push harbor.op.com/public/cadvisor:v0.33.0
The push refers to repository [harbor.op.com/public/cadvisor]
09c656718504: Pushed
6a395a55089d: Pushed
767f936afb51: Pushed
v0.33.0: digest: sha256:47f1f8c02a3acfab77e74e2ec7acc0d475adc180ddff428503a4ce63f3d6061b size: 952
[root@vms200 ~]# docker push harbor.op.com/public/cadvisor:v200912
The push refers to repository [harbor.op.com/public/cadvisor]
66b3c2e84199: Pushed
9ea477e6d99e: Pushed
cd7100a72410: Pushed
v200912: digest: sha256:815386ebbe9a3490f38785ab11bda34ec8dacf4634af77b8912832d4f85dca04 size: 952
准备资源配置清单
[root@vms200 ~]# mkdir /data/k8s-yaml/cadvisor && cd /data/k8s-yaml/cadvisor
该exporter是通过和kubelet交互,取到Pod运行时的资源消耗情况,并将接口暴露给Prometheus。
- cadvisor由于要获取每个node上的pod信息,因此也需要使用daemonset方式运行
- cadvisor采用daemonset方式运行在node节点上,通过污点的方式排除master
- 同时将部分宿主机目录挂载到本地,如docker的数据目录
daemonset.yaml下载:https://github.com/google/cadvisor/tree/release-v0.33/deploy/kubernetes/base
[root@vms200 cadvisor]# vi /data/k8s-yaml/cadvisor/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
namespace: kube-system
labels:
app: cadvisor
spec:
selector:
matchLabels:
name: cadvisor
template:
metadata:
labels:
name: cadvisor
spec:
hostNetwork: true
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: cadvisor
image: harbor.op.com/public/cadvisor:v200912
imagePullPolicy: IfNotPresent
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
- name: sys
mountPath: /sys
readOnly: true
- name: docker
mountPath: /var/lib/docker
readOnly: true
- name: disk
mountPath: /dev/disk
readOnly: true
ports:
- name: http
containerPort: 4194
protocol: TCP
readinessProbe:
tcpSocket:
port: 4194
initialDelaySeconds: 5
periodSeconds: 10
args:
- --housekeeping_interval=10s
- --port=4194
terminationGracePeriodSeconds: 30
volumes:
- name: rootfs
hostPath:
path: /
- name: var-run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /data/docker
- name: disk
hostPath:
path: /dev/disk
修改运算节点软连接
所有运算节点上:vms21、vms22
[root@vms21 ~]# mount -o remount,rw /sys/fs/cgroup/
[root@vms21 ~]# ln -s /sys/fs/cgroup/cpu,cpuacct/ /sys/fs/cgroup/cpuacct,cpu
[root@vms21 ~]# ll /sys/fs/cgroup/ | grep cpu
lrwxrwxrwx 1 root root 11 Sep 11 19:21 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Sep 11 19:21 cpuacct -> cpu,cpuacct
lrwxrwxrwx 1 root root 27 Sep 12 10:25 cpuacct,cpu -> /sys/fs/cgroup/cpu,cpuacct/
dr-xr-xr-x 6 root root 0 Sep 11 19:21 cpu,cpuacct
dr-xr-xr-x 4 root root 0 Sep 11 19:21 cpuset
[root@vms22 ~]# mount -o remount,rw /sys/fs/cgroup/
[root@vms22 ~]# ln -s /sys/fs/cgroup/cpu,cpuacct/ /sys/fs/cgroup/cpuacct,cpu
[root@vms22 ~]# ll /sys/fs/cgroup/ | grep cpu
lrwxrwxrwx 1 root root 11 Sep 11 19:22 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Sep 11 19:22 cpuacct -> cpu,cpuacct
lrwxrwxrwx 1 root root 27 Sep 12 10:25 cpuacct,cpu -> /sys/fs/cgroup/cpu,cpuacct/
dr-xr-xr-x 6 root root 0 Sep 11 19:22 cpu,cpuacct
dr-xr-xr-x 4 root root 0 Sep 11 19:22 cpuset
- 原本是只读,现在改为可读可写;应用清单前,先在每个node上做好软连接,否则pod可能报错。
应用资源配置清单
任意运算节点上:
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/cadvisor/daemonset.yaml
daemonset.apps/cadvisor created
[root@vms21 ~]# kubectl -n kube-system get pod -o wide|grep cadvisor
cadvisor-q2z2g 0/1 Running 0 3s 192.168.26.22 vms22.cos.com <none> <none>
cadvisor-xqg6k 0/1 Running 0 3s 192.168.26.21 vms21.cos.com <none> <none>
[root@vms21 ~]# netstat -luntp|grep 4194
tcp6 0 0 :::4194 :::* LISTEN 301579/cadvisor
[root@vms21 ~]# kubectl get pod -n kube-system -l name=cadvisor -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cadvisor-q2z2g 1/1 Running 0 2m38s 192.168.26.22 vms22.cos.com <none> <none>
cadvisor-xqg6k 1/1 Running 0 2m38s 192.168.26.21 vms21.cos.com <none> <none>
[root@vms21 ~]# curl -s http://192.168.26.22:4194/metrics | more
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor re
vision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="8949c822",cadvisorVersion="v0.32.0",dockerVersion="19.03.12",kernelVersion="4.18.0-193.el8.x86_64",osVersion=
"Alpine Linux v3.7"} 1
# HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals.
# TYPE container_cpu_cfs_periods_total counter
5 部署blackbox-exporter
运维主机vms200
上:
准备blackbox-exporter镜像
官方dockerhub地址:https://hub.docker.com/r/prom/blackbox-exporter
官方github地址:https://github.com/prometheus/blackbox_exporter
[root@vms200 ~]# docker pull prom/blackbox-exporter:v0.17.0
v0.17.0: Pulling from prom/blackbox-exporter
0f8c40e1270f: Pull complete
626a2a3fee8c: Pull complete
d018b30262bb: Pull complete
2b24e2b7f642: Pull complete
Digest: sha256:1d8a5c9ff17e2493a39e4aea706b4ea0c8302ae0dc2aa8b0e9188c5919c9bd9c
Status: Downloaded newer image for prom/blackbox-exporter:v0.17.0
docker.io/prom/blackbox-exporter:v0.17.0
[root@vms200 ~]# docker tag docker.io/prom/blackbox-exporter:v0.17.0 harbor.op.com/public/blackbox-exporter:v0.17.0
[root@vms200 ~]# docker push harbor.op.com/public/blackbox-exporter:v0.17.0
The push refers to repository [harbor.op.com/public/blackbox-exporter]
d072d0db0848: Pushed
42430a6dfa0e: Pushed
7a151fe67625: Pushed
1da8e4c8d307: Pushed
v0.17.0: digest: sha256:d3e823580333ceedceadaa2bfea10c8efd4700c8ec0415df72f83c34e1f93314 size: 1155
准备资源配置清单
[root@vms200 ~]# mkdir /data/k8s-yaml/blackbox-exporter && cd /data/k8s-yaml/blackbox-exporter
ConfigMap
[root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app: blackbox-exporter
name: blackbox-exporter
namespace: kube-system
data:
blackbox.yml: |-
modules:
http_2xx:
prober: http
timeout: 2s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200,301,302]
method: GET
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 2s
Deployment
[root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
name: blackbox-exporter
namespace: kube-system
labels:
app: blackbox-exporter
annotations:
deployment.kubernetes.io/revision: "1"
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-exporter
template:
metadata:
labels:
app: blackbox-exporter
spec:
volumes:
- name: config
configMap:
name: blackbox-exporter
defaultMode: 420
containers:
- name: blackbox-exporter
image: harbor.op.com/public/blackbox-exporter:v0.17.0
args:
- --config.file=/etc/blackbox_exporter/blackbox.yml
- --log.level=debug
- --web.listen-address=:9115
ports:
- name: blackbox-port
containerPort: 9115
protocol: TCP
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 50Mi
volumeMounts:
- name: config
mountPath: /etc/blackbox_exporter
readinessProbe:
tcpSocket:
port: 9115
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: harbor
restartPolicy: Always
Service
[root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/service.yaml
kind: Service
apiVersion: v1
metadata:
name: blackbox-exporter
namespace: kube-system
spec:
selector:
app: blackbox-exporter
ports:
- protocol: TCP
port: 9115
name: http
Ingress
[root@vms200 blackbox-exporter]# vi /data/k8s-yaml/blackbox-exporter/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: blackbox-exporter
namespace: kube-system
spec:
rules:
- host: blackbox.op.com
http:
paths:
- backend:
serviceName: blackbox-exporter
servicePort: 9115
解析域名
vms11
上
[root@vms11 ~]# vi /var/named/op.com.zone
...
blackbox A 192.168.26.10
注意serial前滚一个序号
[root@vms11 ~]# systemctl restart named
检查:vms21
上
[root@vms21 ~]# dig -t A blackbox.op.com 172.26.0.2 +short
192.168.26.10
应用资源配置清单
任意运算节点上:
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/configmap.yaml
configmap/blackbox-exporter created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/deployment.yaml
deployment.apps/blackbox-exporter created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/service.yaml
service/blackbox-exporter created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/blackbox-exporter/ingress.yaml
ingress.extensions/blackbox-exporter created
浏览器访问
http://blackbox.op.com/
显示如下界面,表示blackbox已经运行成功。
6 部署prometheus
运维主机vms200
上:
准备prometheus镜像
官方dockerhub地址:https://hub.docker.com/r/prom/prometheus
官方github地址:https://github.com/prometheus/prometheus
[root@vms200 ~]# docker pull prom/prometheus:v2.21.0
v2.21.0: Pulling from prom/prometheus
...
Digest: sha256:d43417c260e516508eed1f1d59c10c49d96bbea93eafb4955b0df3aea5908971
Status: Downloaded newer image for prom/prometheus:v2.21.0
docker.io/prom/prometheus:v2.21.0
[root@vms200 ~]# docker tag prom/prometheus:v2.21.0 harbor.op.com/infra/prometheus:v2.21.0
[root@vms200 ~]# docker push harbor.op.com/infra/prometheus:v2.21.0
The push refers to repository [harbor.op.com/infra/prometheus]
...
v2.21.0: digest: sha256:f3ada803723ccbc443ebea19f7ab24d3323def496e222134bf9ed54ae5b787bd size: 2824
准备资源配置清单
运维主机vms200
上:
[root@vms200 ~]# mkdir -p /data/k8s-yaml/prometheus && cd /data/k8s-yaml/prometheus
RBAC
[root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
namespace: infra
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: infra
Deployment
[root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/deployment.yaml
Prometheus
在生产环境中,一般采用一个单独的大内存node部署,采用污点让其它pod不会调度上来。本实验使用nodeName: vms21.cos.com
来指定到192.168.26.21
。--storage.tsdb.min-block-duration
内存中缓存最新多少分钟的TSDB数据,生产中会缓存更多的数据。storage.tsdb.min-block-duration=10m
只加载10分钟数据到内。--storage.tsdb.retention
TSDB数据保留的时间,生产中会保留更多的数据。storage.tsdb.retention=72h
保留72小时数据。 加上--web.enable-lifecycle
启用远程热加载配置文件,配置文件改变后不用重启prometheus。
调用指令是curl -X POST http://localhost:9090/-/reload
。
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "5"
labels:
name: prometheus
name: prometheus
namespace: infra
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 7
selector:
matchLabels:
app: prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: prometheus
spec:
nodeName: vms21.cos.com
containers:
- image: harbor.op.com/infra/prometheus:v2.21.0
args:
- --config.file=/data/etc/prometheus.yml
- --storage.tsdb.path=/data/prom-db
- --storage.tsdb.retention=72h
- --storage.tsdb.min-block-duration=10m
- --web.enable-lifecycle
command:
- /bin/prometheus
name: prometheus
ports:
- containerPort: 9090
protocol: TCP
resources:
limits:
cpu: 500m
memory: 2500Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- mountPath: /data
name: data
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: harbor
securityContext:
runAsUser: 0
dnsPolicy: ClusterFirst
restartPolicy: Always
serviceAccount: prometheus
serviceAccountName: prometheus
volumes:
- name: data
nfs:
server: vms200
path: /data/nfs-volume/prometheus
Service
[root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/service.yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: infra
spec:
ports:
- port: 9090
protocol: TCP
name: prometheus
selector:
app: prometheus
type: ClusterIP
Ingress
[root@vms200 prometheus]# vi /data/k8s-yaml/prometheus/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: traefik
name: prometheus
namespace: infra
spec:
rules:
- host: prometheus.op.com
http:
paths:
- backend:
serviceName: prometheus
servicePort: 9090
准备prometheus的配置文件
运维主机vms200
上:
- 创建目录与拷贝证书
[root@vms200 ~]# mkdir -pv /data/nfs-volume/prometheus/{etc,prom-db}
...
[root@vms200 ~]# cd /data/nfs-volume/prometheus/etc
[root@vms200 etc]# cp /opt/certs/{ca.pem,client.pem,client-key.pem} /data/nfs-volume/prometheus/etc/
[root@vms200 etc]# ll
total 12
-rw-r--r-- 1 root root 1338 Sep 12 16:22 ca.pem
-rw------- 1 root root 1675 Sep 12 16:22 client-key.pem
-rw-r--r-- 1 root root 1363 Sep 12 16:22 client.pem
- 准备配置
配置文件说明:此配置为通用配置,除第一个job
etcd
是做的静态配置外,其他8个job都是做的自动发现。因此只需要修改etcd
的配置后,就可以直接用于生产环境。
[root@vms200 etc]# vi /data/nfs-volume/prometheus/etc/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'etcd'
tls_config:
ca_file: /data/etc/ca.pem
cert_file: /data/etc/client.pem
key_file: /data/etc/client-key.pem
scheme: https
static_configs:
- targets:
- '192.168.26.12:2379'
- '192.168.26.21:2379'
- '192.168.26.22:2379'
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-kubelet'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:10255
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:4194
- job_name: 'kubernetes-kube-state'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
regex: .*true.*
action: keep
- source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
regex: 'node-exporter;(.*)'
action: replace
target_label: nodename
- job_name: 'blackbox_http_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: http
- source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+);(.+)
replacement: $1:$2$3
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'blackbox_tcp_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [tcp_connect]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: tcp
- source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'traefik'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: keep
regex: traefik
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
应用资源配置清单
任意运算节点上:
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/deployment.yaml
deployment.apps/prometheus created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/service.yaml
service/prometheus created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/prometheus/ingress.yaml
ingress.extensions/prometheus created
解析域名
vms11
上
[root@vms11 ~]# vi /var/named/op.com.zone
...
prometheus A 192.168.26.10
注意serial前滚一个序号
[root@vms11 ~]# systemctl restart named
[root@vms11 ~]# dig -t A prometheus.op.com 192.168.26.11 +short
192.168.26.10
浏览器访问
- 先在
dashboard
查看pod
- 访问:http://prometheus.op.com/ 如果能成功访问的话,表示启动成功
- 点击
Status
>Configuration
就是配置文件
Prometheus配置文件解析
官方文档: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- vms200:/data/nfs-volume/prometheus/etc/prometheus.yml
global:
scrape_interval: 15s # 数据抓取周期,默认1m
evaluation_interval: 15s # 估算规则周期,默认1m
scrape_configs: # 抓取指标的方式,一个job就是一类指标的获取方式
- job_name: 'etcd' # 指定etcd的指标获取方式,没指定scrape_interval会使用全局配置
tls_config:
ca_file: /data/etc/ca.pem
cert_file: /data/etc/client.pem
key_file: /data/etc/client-key.pem
scheme: https # 默认是http方式获取
static_configs:
- targets:
- '192.168.26.12:2379'
- '192.168.26.21:2379'
- '192.168.26.22:2379'
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints # 目标资源类型,支持node、endpoints、pod、service、ingress等
scheme: https # tls,bearer_token_file都是与apiserver通信时使用
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs: # 对目标标签修改时使用
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep # action支持:
# keep,drop,replace,labelmap,labelkeep,labeldrop,hashmod
regex: default;kubernetes;https
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-kubelet'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:10255
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __address__
replacement: ${1}:4194
- job_name: 'kubernetes-kube-state'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_label_grafanak8sapp]
regex: .*true.*
action: keep
- source_labels: ['__meta_kubernetes_pod_label_daemon', '__meta_kubernetes_pod_node_name']
regex: 'node-exporter;(.*)'
action: replace
target_label: nodename
- job_name: 'blackbox_http_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: http
- source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+);(.+)
replacement: $1:$2$3
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'blackbox_tcp_pod_probe'
metrics_path: /probe
kubernetes_sd_configs:
- role: pod
params:
module: [tcp_connect]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
action: keep
regex: tcp
- source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __param_target
- action: replace
target_label: __address__
replacement: blackbox-exporter.kube-system:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'traefik'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: keep
regex: traefik
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
alerting: # Alertmanager配置
alertmanagers:
- static_configs:
- targets: ["alertmanager"]
rule_files: # 引用外部的告警或者监控规则,类似于include
- "/data/etc/rules.yml"
Prometheus监控内容及方法
Pod接入Exporter
- 当前实验部署的是通用的Exporter,其中
Kube-state-metrics
是通过Kubernetes API采集信息,Node-exporter
用于收集主机信息,这两个Exporter与Pod无关,部署完毕后直接使用即可。 - 根据Prometheus配置文件,可以看出Pod监控信息获取是通过标签(注释)选择器来实现的,给资源添加对应的标签或者注释来实现数据的监控。
Targets - job-name
- 点击
Status
>Targets
,展示的就是在prometheus.yml
中配置的job-name
,这些targets
基本可以满足监控收集数据的需求。
vms12服务器没启动,监控到了一个etcd为DOWN状态
- Targets(jobs):
show less
总共有9个job_name
,有5个job-name
已经被发现并获取数据;接下来就需要将剩下的4个job_name
对应的服务纳入监控;纳入监控的方式是给需要收集数据的服务添加annotations
。
etcd
监控etcd服务
key | value |
---|---|
etcd_server_has_leader | 1 |
etcd_http_failed_total | 1 |
… | … |
kubernetes-apiserver
监控apiserver服务
kubernetes-kubelet
监控kubelet服务
kubernetes-kube-state
监控基本信息
- node-exporter> 监控Node节点信息
- kube-state-metrics> 监控pod信息
traefik
- 监控traefik-ingress-controller | key | value | | :—- | :—- | | traefik_entrypoint_requests_total{code=”200”,entrypoint=”http”,method=”PUT”,protocol=”http”} | 138 | | traefik_entrypoint_requests_total{code=”200”,entrypoint=”http”,method=”GET”,protocol=”http”} | 285 | | traefik_entrypoint_open_connections{entrypoint=”http”,method=”PUT”,protocol=”http”} | 1 | | … | … |
- Traefik接入:
在traefik的pod控制器上加annotations,并重启pod,监控生效。 (JSON格式)
"annotations": {
"prometheus_io_scheme": "traefik",
"prometheus_io_path": "/metrics",
"prometheus_io_port": "8080"
}
或在traefik的部署yaml文件中的spec.template.metadata
加入注释,然后重启Pod。(yaml格式)
annotations:
prometheus_io_scheme: traefik
prometheus_io_path: /metrics
prometheus_io_port: "8080"
blackbox*
blackbox是检测容器内服务存活性的,也就是端口健康状态检查,分为tcp和http两种方法。能用http的情况尽量用http,没有提供http接口的服务才用tcp。
监控服务是否存活,检测TCP/HTTP服务状态
- blackbox_tcp_pod_porbe 监控tcp协议服务是否存活 | key | value | | :—- | :—- | | probe_success | 1 | | probe_ip_protocol | 4 | | probe_failed_due_to_regex | 0 | | probe_duration_seconds | 0.000597546 | | probe_dns_lookup_time_seconds | 0.00010898 |
接入Blackbox监控: 在pod控制器上加annotations,并重启pod,监控生效。(JSON格式)
"annotations": {
"blackbox_port": "20880",
"blackbox_scheme": "tcp"
}
- blackbox_http_pod_probe 监控http协议服务是否存活 | key | value | | :—- | :—- | | probe_success | 1 | | probe_ip_protocol | 4 | | probe_http_version | 1.1 | | probe_http_status_code | 200 | | probe_http_ssl | 0 | | probe_http_redirects | 1 | | probe_http_last_modified_timestamp_seconds | 1.553861888e+09 | | probe_http_duration_seconds{phase=”transfer”} | 0.000238343 | | probe_http_duration_seconds{phase=”tls”} | 0 | | probe_http_duration_seconds{phase=”resolve”} | 5.4095e-05 | | probe_http_duration_seconds{phase=”processing”} | 0.000966104 | | probe_http_duration_seconds{phase=”connect”} | 0.000520821 | | probe_http_content_length | 716 | | probe_failed_due_to_regex | 0 | | probe_duration_seconds | 0.00272609 | | probe_dns_lookup_time_seconds | 5.4095e-05 |
接入Blackbox监控: 在pod控制器上加annotations,并重启pod,监控生效。(JSON格式)
"annotations": {
"blackbox_path": "/",
"blackbox_port": "8080",
"blackbox_scheme": "http"
}
接入Blackbox监控:(yaml格式)在对应pod添加annotations
,以下分别是TCP
探测和HTTP
探测。
annotations:
blackbox_port: "20880"
blackbox_scheme: tcp
annotations:
blackbox_port: "8080"
blackbox_scheme: http
blackbox_path: /hello?name=health
kubernetes-pods*
- 监控JVM信息 | key | value | | :—- | :—- | | jvm_info{version=”1.7.0_80-b15”,vendor=”Oracle Corporation”,runtime=”Java(TM) SE Runtime Environment”,} | 1.0 | | jmx_config_reload_success_total | 0.0 | | process_resident_memory_bytes | 4.693897216E9 | | process_virtual_memory_bytes | 1.2138840064E10 | | process_max_fds | 65536.0 | | process_open_fds | 123.0 | | process_start_time_seconds | 1.54331073249E9 | | process_cpu_seconds_total | 196465.74 | | jvm_buffer_pool_used_buffers{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_used_buffers{pool=”direct”,} | 150.0 | | jvm_buffer_pool_capacity_bytes{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_capacity_bytes{pool=”direct”,} | 6216688.0 | | jvm_buffer_pool_used_bytes{pool=”mapped”,} | 0.0 | | jvm_buffer_pool_used_bytes{pool=”direct”,} | 6216688.0 | | jvm_gc_collection_seconds_sum{gc=”PS MarkSweep”,} | 1.867 | | … | … |
Pod接入监控:(JSON格式)在pod控制器上加annotations,并重启pod,监控生效。
"annotations": {
"prometheus_io_scrape": "true",
"prometheus_io_port": "12346",
"prometheus_io_path": "/"
}
Pod接入监控:(yaml格式)在对应pod添加annotations,并重启pod。该信息是jmx_javaagent-0.3.1.jar收集的,端口是12346。true是字符串!
annotations:
prometheus_io_scrape: "true"
prometheus_io_port: "12346"
prometheus_io_path: /
修改traefik服务接入prometheus监控
dashboard
上:kube-system
名称空间 > daemonset
> traefik-ingress-controller
: spec
> template
> metadata
下,添加(JSON格式)
"annotations": {
"prometheus_io_scheme": "traefik",
"prometheus_io_path": "/metrics",
"prometheus_io_port": "8080"
}
删除pod,重启traefik,观察监控
继续添加blackbox监控配置项(JSON格式)
"annotations": {
"prometheus_io_scheme": "traefik",
"prometheus_io_path": "/metrics",
"prometheus_io_port": "8080",
"blackbox_path": "/",
"blackbox_port": "8080",
"blackbox_scheme": "http"
}
也可以在vms200上修改traefik的yaml文件:
[root@vms200 ~]# vi /data/k8s-yaml/traefik/traefik-deploy.yaml
跟labels
同级,添加annotations
配置
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: traefik-ingress-controller
labels:
app: traefik
spec:
selector:
matchLabels:
app: traefik
template:
metadata:
name: traefik
labels:
app: traefik
annotations:
prometheus_io_scheme: "traefik"
prometheus_io_path: "/metrics"
prometheus_io_port: "8080"
blackbox_path: "/"
blackbox_port: "8080"
blackbox_scheme: "http"
...
任意节点重新应用配置
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/traefik/traefik-deploy.yaml -n kube-system
service/traefik unchanged
daemonset.apps/traefik-ingress-controller created
等待pod重启以后,再在prometheus上查看traefik、blackbox是否能正常获取数据了
dubbo服务接入prometheus监控
使用测试环境FAT的dubbo服务来做演示,其他环境类似(注意启动vms11上的zk)
- dashboard中开启apollo-portal(infra空间)和test空间中的apollo
- dubbo-demo-service使用tcp的annotation
- dubbo-demo-consumer使用HTTP的annotation
以上环境比较耗资源,可以不使用apollo
- 修改dubbo-monitor(infra空间)的zk为zk_test(vms11上的zk)
- 修改app空间的dubbo-demo-service的镜像为master版本(不使用apollo配置),添加tcp的annotation
- 修改app空间的dubbo-demo-consumer的镜像为master版本(不使用apollo配置),添加HTTP的annotation
修改dubbo-service服务接入prometheus监控
dashboard
上:
- 首先在dubbo-demo-service资源中添加一个TCP的annotations
- 添加监控jvm信息,以便监控pod中的jvm信息(12346是dubbo的POD启动命令中使用jmx_javaagent用到的端口,因此可以用来收集jvm信息)
test名称空间 > deployment > dubbo-demo-service Edit:spec > template > metadata下,添加 (JSON格式)
"annotations": {
"prometheus_io_scrape": "true",
"prometheus_io_path": "/",
"prometheus_io_port": "12346",
"blackbox_port": "20880",
"blackbox_scheme": "tcp"
}
删除pod,重启应用,观察监控(见后文监控观察
)
也可以修改部署配置清单文件(增加annotations),然后在任意节点重新应用配置
spec:
replicas: 0
selector:
matchLabels:
name: dubbo-demo-service
template:
metadata:
creationTimestamp: null
labels:
app: dubbo-demo-service
name: dubbo-demo-service
annotations:
blackbox_port: '20880'
blackbox_scheme: tcp
prometheus_io_path: /
prometheus_io_port: '12346'
prometheus_io_scrape: 'true'
...
修改dubbo-consumer服务接入prometheus监控
dashboard
上:
- 在dubbo-demo-consumer资源中添加一个HTTP的annotations;
- 添加监控jvm信息,以便监控pod中的jvm信息(12346是dubbo的POD启动命令中使用jmx_javaagent用到的端口,因此可以用来收集jvm信息)
test名称空间 > deployment > dubbo-demo-consumer Edit:spec > template > metadata下,添加 (JSON格式)
"annotations": {
"prometheus_io_scrape": "true",
"prometheus_io_path": "/",
"prometheus_io_port": "12346",
"blackbox_path": "/hello",
"blackbox_port": "8080",
"blackbox_scheme": "http"
}
删除pod,重启应用,观察监控(见后文监控观察
)
也可以修改部署配置清单文件(增加annotations),然后在任意节点重新应用配置
spec:
replicas: 1
selector:
matchLabels:
name: dubbo-demo-consumer
template:
metadata:
creationTimestamp: null
labels:
app: dubbo-demo-consumer
name: dubbo-demo-consumer
annotations:
blackbox_path: /hello
blackbox_port: '8080'
blackbox_scheme: http
prometheus_io_path: /
prometheus_io_port: '12346'
prometheus_io_scrape: 'true'
...
监控观察
浏览器中查看http://blackbox.op.com和http://prometheus.op.com/targets运行的dubbo-demo-server服务,tcp端口20880已经被发现并在监控中
至此,所有9个job_name
都成功完美获取了监控数据
7 部署Grafana
运维主机vms200
上:
准备grafana镜像
官方dockerhub地址:https://hub.docker.com/r/grafana/grafana
官方github地址:https://github.com/grafana/grafana
grafana官网:https://grafana.com/
[root@vms200 ~]# docker pull grafana/grafana:7.1.5
7.1.5: Pulling from grafana/grafana
df20fa9351a1: Pull complete
9942118288f3: Pull complete
1fb6e3df6e68: Pull complete
7e3d0d675cf3: Pull complete
4c1eb3303598: Pull complete
a5ec11eae53c: Pull complete
Digest: sha256:579044d31fad95f015c78dff8db25c85e2e0f5fdf37f414ce850eb045dd47265
Status: Downloaded newer image for grafana/grafana:7.1.5
docker.io/grafana/grafana:7.1.5
[root@vms200 ~]# docker tag docker.io/grafana/grafana:7.1.5 harbor.op.com/infra/grafana:7.1.5
[root@vms200 ~]# docker push harbor.op.com/infra/grafana:7.1.5
The push refers to repository [harbor.op.com/infra/grafana]
9c957ea29f01: Pushed
7fcdc437fb25: Pushed
5c98ed105d7e: Pushed
43376507b219: Pushed
cb596e3b6acf: Pushed
50644c29ef5a: Pushed
7.1.5: digest: sha256:dfd940ed4dd82a6369cb057fe5ab4cc8c774c1c5b943b2f4b618302a7979de61 size: 1579
准备资源配置清单
[root@vms200 ~]# mkdir /data/k8s-yaml/grafana && cd /data/k8s-yaml/grafana
RBAC
[root@vms200 grafana]# vi /data/k8s-yaml/grafana/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: grafana
rules:
- apiGroups:
- "*"
resources:
- namespaces
- deployments
- pods
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: grafana
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: grafana
subjects:
- kind: User
name: k8s-node
Deployment
[root@vms200 grafana]# vi /data/k8s-yaml/grafana/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: grafana
name: grafana
name: grafana
namespace: infra
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 7
selector:
matchLabels:
name: grafana
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: grafana
name: grafana
spec:
containers:
- image: harbor.op.com/infra/grafana:7.1.5
imagePullPolicy: IfNotPresent
name: grafana
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- mountPath: /var/lib/grafana
name: data
imagePullSecrets:
- name: harbor
nodeName: vms22.cos.com
restartPolicy: Always
securityContext:
runAsUser: 0
volumes:
- nfs:
server: vms200
path: /data/nfs-volume/grafana
name: data
创建grafana数据目录
[root@vms200 grafana]# mkdir /data/nfs-volume/grafana
Service
[root@vms200 grafana]# vi /data/k8s-yaml/grafana/service.yaml
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: infra
spec:
ports:
- port: 3000
protocol: TCP
selector:
app: grafana
type: ClusterIP
Ingress
[root@vms200 grafana]# vi /data/k8s-yaml/grafana/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: infra
spec:
rules:
- host: grafana.op.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
应用资源配置清单
任意运算节点上:
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/rbac.yaml
clusterrole.rbac.authorization.k8s.io/grafana created
clusterrolebinding.rbac.authorization.k8s.io/grafana created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/deployment.yaml
deployment.apps/grafana created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/service.yaml
service/grafana created
[root@vms22 ~]# kubectl apply -f http://k8s-yaml.op.com/grafana/ingress.yaml
ingress.extensions/grafana created
[root@vms22 ~]# kubectl get pod -l name=grafana -o wide -n infra
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
grafana-7677b5db6b-hrf87 1/1 Running 0 38m 172.26.22.8 vms22.cos.com <none> <none>
解析域名
vms11
上
[root@vms11 ~]# vi /var/named/op.com.zone
...
grafana A 192.168.26.10
注意serial前滚一个序号
[root@vms11 ~]# systemctl restart named
[root@vms11 ~]# dig -t A grafana.op.com 192.168.26.11 +short
192.168.26.10
浏览器访问
http://grafana.op.com (用户名:admin 密 码:admin)
登录后需要修改管理员密码(admin123),进入:
配置grafana页面
外观
Configuration
> Preferences
UI Theme
:Light
Home Dashboard
:Default
Timezone
:Browser time
save
保存
插件
Configuration > Plugins 查看插件,需要安装以下5个插件:
- grafana-kubernetes-app
- grafana-clock-panel
- grafana-piechart-panel
- briangann-gauge-panel
- natel-discrete-panel
插件安装有两种方式:(插件包下载地址:https://github.com/swbook/k8s-grafana-plugins)
- 方式一:进入Container中,执行 grafana-cli plugins install $plugin_name
- 方式二:手动下载插件zip包并解压
1、 查询插件版本号$version
:访问 https://grafana.com/api/plugins/repo/$plugin_name
2、下载zip包:wget https://grafana.com/api/plugins/$plugin_name
/versions/$version
/download
3、将zip包解压到 /var/lib/grafana/plugins 下 - 两种方式的插件安装完毕后,重启Grafana的Pod。
grafana确认启动好以后,进入grafana容器内部,进行插件安装(方式一)
任一运算节点 vms22
上
[root@vms22 ~]# kubectl -n infra exec -it grafana-7677b5db6b-hrf87 -- /bin/bash
bash-5.0# grafana-cli plugins install grafana-kubernetes-app
installing grafana-kubernetes-app @ 1.0.1
from: https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download
into: /var/lib/grafana/plugins
✔ Installed grafana-kubernetes-app successfully
Restart grafana after installing plugins . <service grafana-server restart>
bash-5.0# grafana-cli plugins install grafana-clock-panel
installing grafana-clock-panel @ 1.1.1
from: https://grafana.com/api/plugins/grafana-clock-panel/versions/1.1.1/download
into: /var/lib/grafana/plugins
✔ Installed grafana-clock-panel successfully
Restart grafana after installing plugins . <service grafana-server restart>
bash-5.0# grafana-cli plugins install grafana-piechart-panel
installing grafana-piechart-panel @ 1.6.0
from: https://grafana.com/api/plugins/grafana-piechart-panel/versions/1.6.0/download
into: /var/lib/grafana/plugins
✔ Installed grafana-piechart-panel successfully
Restart grafana after installing plugins . <service grafana-server restart>
bash-5.0# grafana-cli plugins install briangann-gauge-panel
installing briangann-gauge-panel @ 0.0.6
from: https://grafana.com/api/plugins/briangann-gauge-panel/versions/0.0.6/download
into: /var/lib/grafana/plugins
✔ Installed briangann-gauge-panel successfully
Restart grafana after installing plugins . <service grafana-server restart>
bash-5.0# grafana-cli plugins install natel-discrete-panel
installing natel-discrete-panel @ 0.1.0
from: https://grafana.com/api/plugins/natel-discrete-panel/versions/0.1.0/download
into: /var/lib/grafana/plugins
✔ Installed natel-discrete-panel successfully
Restart grafana after installing plugins . <service grafana-server restart>
安装完后查看: vms200
上
[root@vms200 grafana]# cd /data/nfs-volume/grafana/plugins
[root@vms200 plugins]# ll
total 0
drwxr-xr-x 4 root root 253 Sep 12 20:39 briangann-gauge-panel
drwxr-xr-x 5 root root 253 Sep 12 20:34 grafana-clock-panel
drwxr-xr-x 4 root root 198 Sep 12 20:31 grafana-kubernetes-app
drwxr-xr-x 4 root root 233 Sep 12 20:35 grafana-piechart-panel
drwxr-xr-x 5 root root 216 Sep 12 20:41 natel-discrete-panel
安装方法二 (已按方法一进行安装,这里仅做示例)
vms200
上
[root@vms200 grafana]# cd /data/nfs-volume/grafana/plugins
- Kubernetes App
下载地址:https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download
[root@vms200 plugins]# wget https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download -O grafana-kubernetes-app.zip
...
[root@vms200 plugins]# unzip grafana-kubernetes-app.zip
- Clock Pannel
下载地址:https://grafana.com/api/plugins/grafana-clock-panel/versions/1.1.1/download
- Pie Chart
下载地址:https://grafana.com/api/plugins/grafana-piechart-panel/versions/1.6.0/download
- D3 Gauge
下载地址:https://grafana.com/api/plugins/briangann-gauge-panel/versions/0.0.6/download
- Discrete
下载地址:https://grafana.com/api/plugins/natel-discrete-panel/versions/0.1.0/download
插件安装完成后,重启grafana的pod
Configuration > Plugins:从插件列表中选择
Kubernetes
:
出现
enable
后点击
左侧菜单出现
Kubernetes
图标
Kubernetes插件安装好后,会有4个dashboard
配置grafana数据源
Configuration
>Data Sources
,选择prometheus
HTTP
| key | value | | :—- | :—- | | URL | http://prometheus.op.com | | Access | Server(Default) | | HTTP Method | GET | | TLS Client Auth | 勾选 | | With CA Cert | 勾选 | | CA Cert | CV:vms200:/opt/certs/ca.pem 复制粘贴文件内容 | | Client Cert | CV:vms200:/opt/certs/client.pem | | Client Key | CV:vms200:/opt/certs/client-key.pem |
Save & Test
多点几次
配置Kubernetes集群Dashboard
kubernetes
> +New Cluster
- Add a new cluster | key | value | | :—- | :—- | | Name | myk8s |
- Prometheus Read | Key | value | | —- | —- | | Datasource | Prometheus |
选择Datasource后,继续填写以下选项。
- HTTP | key | value | | :—- | :—- | | URL | https://192.168.26.10:8443 (api-server的VIP地址) | | Access | Server(Default) (这里要选择一下) |
- Auth | key | value | | :—- | :—- | | TLS Client Auth | 勾选 | | With Ca Cert | 勾选 |
将ca.pem
、client.pem
和client-key.pem
内容分别粘贴至CA Cert
、Client Cert
、Client Key
对应的文本框内
- Save
添加完成后,进入Configuration
>Data Sources
:
选择并点击myk8s
,进入页面后,在页面底部点击Save & Test
,不用管出现HTTP Error Forbidden
(可能是设置的https请求),多点击几次Save & Test
,测试发现grafana就可以获取数据了。
点击kubernetes
,出现:(如果这里不出现,则切换低版本的grafana:5.4.2)
注意:
- K8S Container中,所有Pannel的pod_name 替换成 container_label_io_kubernetes_pod_name
选择Total Memory Usage
下拉菜单项Edit
将pod_name替换成
container_label_io_kubernetes_pod_name`,就有图了:
将所有Pannel的pod_name
替换成 container_label_io_kubernetes_pod_name
- K8S-Cluster
- K8S-Node
- K8S-Deployments
这4个dashboard根据实际进行优化。
配置自定义dashboard
根据Prometheus数据源里的数据,配置如下dashboard:
- etcd dashboard
- traefik dashboard
- generic dashboard
- JMX dashboard
- blackbox dashboard
这些dashboard的JSON配置文件下载地址:https://github.com/swbook/k8s-GrafanaDashboard
下载配置文件后,然后进行导入(Import dashboard from file or Grafana.com)。
示例:
- JMX dashboard
- blackbox dashboard
8 部署alertmanager告警插件
运维主机vms200
上:
准备镜像
[root@vms200 ~]# docker pull docker.io/prom/alertmanager:v0.14.0
[root@vms200 ~]# docker tag prom/alertmanager:v0.14.0 harbor.op.com/infra/alertmanager:v0.14.0
[root@vms200 ~]# docker push harbor.op.com/infra/alertmanager:v0.14.0
[root@vms200 ~]# docker pull prom/alertmanager:v0.21.0
[root@vms200 ~]# docker tag prom/alertmanager:v0.21.0 harbor.op.com/infra/alertmanager:v0.21.0
[root@vms200 ~]# docker push harbor.op.com/infra/alertmanager:v0.21.0
准备资源配置清单
[root@vms200 ~]# mkdir /data/k8s-yaml/alertmanager && cd /data/k8s-yaml/alertmanager
- configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: infra
data:
config.yml: |-
global:
# 在没有报警的情况下声明为已解决的时间
resolve_timeout: 5m
# 配置邮件发送信息
mtp_smarthost: 'smtp.qq.com:25'
smtp_from: '385314590@qq.com'
smtp_auth_username: '385314590@qq.com'
smtp_auth_password: 'XXXX'
smtp_require_tls: false
templates:
- '/etc/alertmanager/*.tmpl'
# 所有报警信息进入后的根路由,用来设置报警的分发策略
route:
# 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
group_by: ['alertname', 'cluster']
# 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
group_wait: 30s
# 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
group_interval: 5m
# 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
repeat_interval: 5m
# 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
receiver: default
receivers:
- name: 'default'
email_configs:
- to: 'k8s_cloud@126.com'
send_resolved: true
注意改成自己的邮箱!网上有可以配置发送中文告警邮件:
...
receivers:
- name: 'default'
email_configs:
- to: 'xxxx@qq.com'
send_resolved: true
html: '{{ template "email.to.html" . }}'
headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }
email.tmpl: |
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}
{{- end }}
- deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: infra
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: harbor.op.com/infra/alertmanager:v0.21.0
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: alertmanager-cm
mountPath: /etc/alertmanager
volumes:
- name: alertmanager-cm
configMap:
name: alertmanager-config
imagePullSecrets:
- name: harbor
- service.yaml
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: infra
spec:
selector:
app: alertmanager
ports:
- port: 80
targetPort: 9093
Prometheus调用alert采用service nam,不走ingress
应用资源配置清单
vms21或vms22上
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/configmap.yaml
configmap/alertmanager-config created
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/deployment.yaml
deployment.apps/alertmanager created
[root@vms21 ~]# kubectl apply -f http://k8s-yaml.op.com/alertmanager/service.yaml
service/alertmanager created
添加告警规则
- rules.yml
[root@vms200 ~]# vi /data/nfs-volume/prometheus/etc/rules.yml
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
- alert: OutOfInodes
expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Out of inodes (instance {{ $labels.instance }})"
description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
- alert: OutOfDiskSpace
expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
- alert: UnusualNetworkThroughputIn
expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual network throughput in (instance {{ $labels.instance }})"
description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
- alert: UnusualNetworkThroughputOut
expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual network throughput out (instance {{ $labels.instance }})"
description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
- alert: UnusualDiskReadRate
expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual disk read rate (instance {{ $labels.instance }})"
description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
- alert: UnusualDiskWriteRate
expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual disk write rate (instance {{ $labels.instance }})"
description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
- alert: UnusualDiskReadLatency
expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual disk read latency (instance {{ $labels.instance }})"
description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
- alert: UnusualDiskWriteLatency
expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Unusual disk write latency (instance {{ $labels.instance }})"
description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
- name: http_status
rules:
- alert: ProbeFailed
expr: probe_success == 0
for: 1m
labels:
severity: error
annotations:
summary: "Probe failed (instance {{ $labels.instance }})"
description: "Probe failed (current value: {{ $value }})"
- alert: StatusCode
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 1m
labels:
severity: error
annotations:
summary: "Status Code (instance {{ $labels.instance }})"
description: "HTTP status code is not 200-399 (current value: {{ $value }})"
- alert: SslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
description: "SSL certificate expires in 30 days (current value: {{ $value }})"
- alert: SslCertificateHasExpired
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 5m
labels:
severity: error
annotations:
summary: "SSL certificate has expired (instance {{ $labels.instance }})"
description: "SSL certificate has expired already (current value: {{ $value }})"
- alert: BlackboxSlowPing
expr: probe_icmp_duration_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox slow ping (instance {{ $labels.instance }})"
description: "Blackbox ping took more than 2s (current value: {{ $value }})"
- alert: BlackboxSlowRequests
expr: probe_http_duration_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox slow requests (instance {{ $labels.instance }})"
description: "Blackbox request took more than 2s (current value: {{ $value }})"
- alert: PodCpuUsagePercent
expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
- 在prometheus配置文件中追加配置:在末尾追加,关联告警规则
[root@vms200 ~]# vi /data/nfs-volume/prometheus/etc/prometheus.yml
...
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager"]
rule_files:
- "/data/etc/rules.yml"
- 重载配置:> 可以重启Prometheus的pod,但生产商因为Prometheus太庞大,删掉容易拖垮集群,所以要采用平滑加载方法(Prometheus支持),有三种方法:
任一主机:
[root@vms200 ~]# curl -X POST http://prometheus.op.com/-/reload
或:(作一运算节点)
[root@vms21 ~]# kubectl get pod -n infra | grep prom
prometheus-76fc88fbcc-bqznx 1/1 Running 0 5h59m
[root@vms21 ~]# kubectl exec -n infra prometheus-76fc88fbcc-bqznx -it -n -- kill -HUP 1
或:(Prometheus在vms21)
[root@vms21 ~]# ps aux|grep prometheus | grep -v grep
root 192560 26.2 10.0 1801172 401368 ? Ssl 06:19 0:47 /bin/prometheus --config.file=/data/etc/prometheus.yml --storage.tsdb.path=/data/prom-db --storage.tsdb.retention=72h --storage.tsdb.min-block-duration=10m --web.enable-lifecycle
[root@vms21 ~]# kill -SIGHUP 192560
/data/etc/rules.yml > hostStatsAlert
OutOfDiskSpace (0 active)
OutOfInodes (0 active)
UnusualDiskReadLatency (0 active)
UnusualDiskReadRate (0 active)
UnusualDiskWriteLatency (0 active)
UnusualDiskWriteRate (0 active)
UnusualNetworkThroughputIn (0 active)
UnusualNetworkThroughputOut (0 active)
hostCpuUsageAlert (0 active)
hostMemUsageAlert (0 active)
/data/etc/rules.yml > http_status
BlackboxSlowPing (0 active)
BlackboxSlowRequests (0 active)
PodCpuUsagePercent (0 active)
ProbeFailed (0 active)
SslCertificateHasExpired (0 active)
SslCertificateWillExpireSoon (0 active)
StatusCode (0 active)
告警测试
当停掉dubbo-demo-service的Pod后,blackbox的HTTP会探测失败,然后触发告警:
仔细观察,先是Pending(2)
,alert中项目变为黄色,然后Firing(2)
。
等到alert中项目变为红色的时候就会发邮件告警。
- 查看邮箱
如果需要自己定制告警规则和告警内容,需要研究promql,修改配置文件。
至此,promethus监控与告警完美交付成功!