一、Promethus介绍
Prometheus是由SoundCloud开发的开源监控告警系统并且带有时序数据库,基于Go语言,是Google BorgMon监控系统的开源版本。
可以通过安装Prometheus来监控集群,但这种方式有一些缺陷,比如对于Prometheus、AlertManager这些服务组件本身的高可用,虽然完全可以用自定义的方式来实现,但是不够灵活,不具有通用性。
庆幸的是,我们完全可以采用一种更高级、更云原生的方式来实现Kubernetes集群监控,即采用Prometheus Operator来实现Kubernetes集群监控。
Operator是由CoreOS公司开发的用来扩展Kubernetes API的特定应用程序控制器,用来创建、配置和管理复杂的有状态应用,例如数据库、缓存和监控系统。Prometheus Operator就是基于Operator框架开发的管理Prometheus集群的控制器。
Prometheus Operator会创建Prometheus、ServiceMonitor、AlertManager和PrometheusRule这4个CRD资源对象,然后一直监控并维持这4个CRD资源对象的状态。
- Prometheus:作为Prometheus Server存在的
- Service Monitor:是专门提供metrics数据接口的exporter的抽象
- AlertManager:对应AlertManager组件的抽象
- PrometheusRule:告警规则文件的抽象
二、快速部署Promethus Operator
2.1 目标
- 部署Prometheus Operator到monitor ns完成对集群的监控
- 对promethus和alertmanager web界面提供http basic保护
- 默认情况下安装的prometheus无法完成对etcd数据的监控,修正数据采集错误
- 修改对集群kube-proxy组件数据采集错误
2.2 使用Helm3快速安装
官方 Chart: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
Ingress安装Http Basic验证
https://kubernetes.github.io/ingress-nginx/examples/auth/basic/
安装htpasswd库
yum install httpd-tools
按照ingress文档生成basic auth secret ```yaml htpasswd -c auth foo
kubectl create secret generic basic-auth —from-file=auth -n monitor
<a name="vrLee"></a>
#### 创建etcd数据采集secret
```yaml
cd /etc/kubernetes/pki/etcd/
kubectl create secret generic etcd-client -n monitor --from-file=./
修改kube-proxy的指标监听端口
# 编辑kube-proxy配置,修改绑定的地址从127.0.0.1到0.0.0.0
kubectl -n kube-system edit cm kube-proxy
# 修改完成后重启kube-proxy
kubectl -n kube-system delete po -l app=kube-proxy
apiVersion: v1
data:
config.conf: |-
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
# ...
# metricsBindAddress: 127.0.0.1:10249
metricsBindAddress: 0.0.0.0:10249
# ...
kubeconfig.conf: |-
# ...
kind: ConfigMap
metadata:
labels:
app: kube-proxy
name: kube-proxy
namespace: kube-system
完成安装
helm install g-kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitor -f value.yaml
value.yaml配置示例:
alertmanager:
ingress:
enabled: true
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: basic-auth
nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Account Please!'
labels: {}
hosts:
- alertmanager.3incloud.cn
paths:
- /
tls:
- secretName: tls-3incloudcn
hosts:
- alertmanager.3incloud.cn
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: glusterfs-storage
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
hosts:
- grafana.3incloud.cn
path: /
tls:
- secretName: tls-3incloudcn
hosts:
- grafana.3incloud.cn
kubeProxy:
service:
selector:
k8s-app: kube-proxy
kubeEtcd:
serviceMonitor:
scheme: https
insecureSkipVerify: true
caFile: /etc/prometheus/secrets/etcd-client/ca.crt
certFile: /etc/prometheus/secrets/etcd-client/healthcheck-client.crt
keyFile: /etc/prometheus/secrets/etcd-client/healthcheck-client.key
prometheus:
ingress:
enabled: true
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: basic-auth
nginx.ingress.kubernetes.io/auth-realm: 'Authentication Required - Account Please!'
三、Grafana
grafana详细配置 https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
更多Grafana使用技巧慢慢探索中
四、添加对外部应用的监控-以Ingress-Nginx示例
上面提到过Prometheus、ServiceMonitor、AlertManager和PrometheusRule这4个CRD资源对象。
Prometheus通过自动发现ServiceMonitor和PrometheusRule对象资源变化来完成对新应用的监控。
4.1 发现规则
# kubectl get Prometheus -A
NAMESPACE NAME VERSION REPLICAS AGE
monitor g-kube-prometheus-stack-prometheus v2.21.0 1 11h
#kubectl edit Prometheus -n monitor g-kube-prometheus-stack-prometheus
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchLabels:
release: g-kube-prometheus-stack
portName: web
probeNamespaceSelector: {}
probeSelector:
matchLabels:
release: g-kube-prometheus-stack
replicas: 1
retention: 10d
routePrefix: /
# !!!!!PrometheusRule的发现规则是匹配所有Namespace下的label为app=kube-prometheus-stack和release=g-kube-prometheus-stack
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
app: kube-prometheus-stack
release: g-kube-prometheus-stack
secrets:
- etcd-client
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: g-kube-prometheus-stack-prometheus
# !!!!ServiceMonitor的发现规则是匹配所有Namespace下的label为release=g-kube-prometheus-stack的ServiceMonitor
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
release: g-kube-prometheus-stack
说明:ServiceMonitor和PrometheusRule资源对象只有包含上述label才能被Prometheus捕捉到
4.2 为Ingress-Nginx开启指标
由于我们Ingress-Nginx也是通过Helm Charts完成安装的。在配置value中只需要开启相关的指标开关即可。
开启后的value示例:
controller:
image:
repository: registry.cn-hangzhou.aliyuncs.com/vcors/ingress-nginx_controller
digest: ""
hostNetwork: true
kind: DaemonSet
hostPort:
enabled: true
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
metrics:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: g-kube-prometheus-stack
prometheusRule:
enabled: true
additionalLabels:
app: kube-prometheus-stack
release: g-kube-prometheus-stack
rules:
- alert: NGINXConfigFailed
expr: count(nginx_ingress_controller_config_last_reload_successful == 0) > 0
for: 1s
labels:
severity: critical
annotations:
description: bad ingress config - nginx config test failed
summary: uninstall the latest ingress changes to allow config reloads to resume
- alert: NGINXCertificateExpiry
expr: (avg(nginx_ingress_controller_ssl_expire_time_seconds) by (host) - time()) < 604800
for: 1s
labels:
severity: critical
annotations:
description: ssl certificate(s) will expire in less then a week
summary: renew expiring certificates to avoid downtime
- alert: NGINXTooMany500s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: warning
annotations:
description: Too many 5XXs
summary: More than 5% of all requests returned 5XX, this requires your attention
- alert: NGINXTooMany400s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: warning
annotations:
description: Too many 4XXs
summary: More than 5% of all requests returned 4XX, this requires your attention
defaultBackend:
enabled: true
image:
repository: registry.cn-hangzhou.aliyuncs.com/vcors/defaultbackend-amd64
在这个示例中,需要注意的是additionalLabels里面的值,需要与Prometheus选择的标签对应上。
更新ingress-nginx的配置后,相关的指标在Prometheus Target中已经被捕捉到。
Rules
4.3 简单的看下ServiceMonitor和PrometheuRule对象
内容自己体会,对其它应用的监控也是编辑以下资源对象
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: g-nginx-ingress
meta.helm.sh/release-namespace: ingress-nginx
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: g-nginx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/version: 0.40.2
helm.sh/chart: ingress-nginx-3.7.1
release: g-kube-prometheus-stack
manager: Go-http-client
operation: Update
name: g-nginx-ingress-ingress-nginx-controller
namespace: ingress-nginx
spec:
endpoints:
- interval: 30s
port: metrics
namespaceSelector:
matchNames:
- ingress-nginx
selector:
matchLabels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: g-nginx-ingress
app.kubernetes.io/name: ingress-nginx
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
meta.helm.sh/release-name: g-nginx-ingress
meta.helm.sh/release-namespace: ingress-nginx
prometheus-operator-validated: "true"
creationTimestamp: "2020-10-23T13:23:55Z"
generation: 1
labels:
app: kube-prometheus-stack
app.kubernetes.io/component: controller
app.kubernetes.io/instance: g-nginx-ingress
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/version: 0.40.2
helm.sh/chart: ingress-nginx-3.7.1
release: g-kube-prometheus-stack
name: g-nginx-ingress-ingress-nginx-controller
namespace: ingress-nginx
spec:
groups:
- name: ingress-nginx
rules:
- alert: NGINXConfigFailed
annotations:
description: bad ingress config - nginx config test failed
summary: uninstall the latest ingress changes to allow config reloads to resume
expr: count(nginx_ingress_controller_config_last_reload_successful == 0) > 0
for: 1s
labels:
severity: critical
- alert: NGINXCertificateExpiry
annotations:
description: ssl certificate(s) will expire in less then a week
summary: renew expiring certificates to avoid downtime
expr: (avg(nginx_ingress_controller_ssl_expire_time_seconds) by (host) - time())
< 604800
for: 1s
labels:
severity: critical
- alert: NGINXTooMany500s
annotations:
description: Too many 5XXs
summary: More than 5% of all requests returned 5XX, this requires your attention
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests)
) > 5
for: 1m
labels:
severity: warning
- alert: NGINXTooMany400s
annotations:
description: Too many 4XXs
summary: More than 5% of all requests returned 4XX, this requires your attention
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests)
) > 5
for: 1m
labels:
severity: warning
4.4 为Ingress Nginx添加Grafana看板
一个下载量比较多的nginx-ingress dashboard模版
https://grafana.com/grafana/dashboards/9614