image.png

1 Grafana

Grafana 官网地址: https://grafana.com/
Grafana Dashboards 地址: https://grafana.com/grafana/dashboards
Grafana Plugins 地址: https://grafana.com/grafana/plugins
Docker Hub 地址: https://registry.hub.docker.com/r/grafana/grafana

  1. # 拉取镜像
  2. docker pull grafana/grafana:latest
  3. # 启动镜像
  4. docker run -d \
  5. -p 3000:3000 \
  6. --name=grafana \
  7. --restart=always \
  8. grafana/grafana:latest
  9. # 验证效果
  10. 打开网址 http://localhost:3000/
  11. 默认账号密码 admin/admin

2 Prometheus

Prometheus 官网地址: https://prometheus.io/
Prometheus 文档地址: https://prometheus.io/docs/introduction/overview/
Docker Hub 地址: https://registry.hub.docker.com/r/prom/prometheus

  1. # 拉取镜像
  2. docker pull prom/prometheus:latest
  3. # 启动镜像
  4. docker run -d \
  5. -p 9090:9090 \
  6. -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
  7. --name=prometheus \
  8. --restart=always \
  9. prom/prometheus:latest
  10. docker run -d \
  11. -p 9090:9090 \
  12. -v $(pwd)/prom/prometheus.yml:/etc/prometheus/prometheus.yml \
  13. --name=prometheus \
  14. --restart=always \
  15. prom/prometheus:latest
  16. # 验证效果
  17. 打开网址 http://localhost:9090/

配置文件

配置文档: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
配置样例: https://github.com/prometheus/prometheus/blob/release-2.21/config/testdata/conf.good.yml

  1. global:
  2. scrape_interval: 15s
  3. evaluation_interval: 15s
  4. scrape_configs:
  5. - job_name: 'prometheus'
  6. static_configs:
  7. - targets: ['192.168.100.15:9090']
  8. labels:
  9. instance: prometheus
  10. - job_name: 'node'
  11. static_configs:
  12. - targets: ['192.168.100.15:9100']
  13. labels:
  14. instance: node
  15. - job_name: 'dcgm'
  16. static_configs:
  17. - targets: ['192.168.100.15:9400']
  18. labels:
  19. instance: dcgm
  20. - job_name: 'cadvisor'
  21. static_configs:
  22. - targets: ['192.168.100.15:8085']
  23. labels:
  24. instance: cadvisor
  25. - job_name: 'spring'
  26. metrics_path: '/actuator/prometheus/metrics'
  27. static_configs:
  28. - targets: ['192.168.100.7:8080']
  29. labels:
  30. instance: spring

3 PushGateway

Docker Hub 地址: https://hub.docker.com/r/prom/pushgateway
GitHub地址: https://github.com/prometheus/pushgateway

  1. # 拉取镜像
  2. docker pull prom/pushgateway:latest
  3. # 启动镜像
  4. docker run -d \
  5. -p 9091:9091 \
  6. --name=pushgateway \
  7. --restart=always \
  8. prom/pushgateway
  9. # 验证效果
  10. http://localhost:9091/metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>}

4 Exporter

4.1 cAdvisor 容器监控

Docker Hub 地址: https://registry.hub.docker.com/r/google/cadvisor
GitHub地址: https://github.com/google/cadvisor
Grafana Dashboard:
https://grafana.com/grafana/dashboards/893
https://grafana.com/grafana/dashboards/315

  1. # 拉取镜像
  2. docker pull google/cadvisor:latest
  3. # 启动镜像
  4. docker run -d \
  5. --volume=/:/rootfs:ro \
  6. --volume=/var/run:/var/run:ro \
  7. --volume=/sys:/sys:ro \
  8. --volume=/var/lib/docker/:/var/lib/docker:ro \
  9. --volume=/dev/disk/:/dev/disk:ro \
  10. --publish=8085:8080 \
  11. --detach=true \
  12. --name=cadvisor \
  13. --privileged \
  14. --device=/dev/kmsg \
  15. google/cadvisor:latest
  16. # 验证效果
  17. curl localhost:8080/metric

4.2 node 主机监控(CPU/MEM/DISK/NET…)

Docker Hub 地址: https://registry.hub.docker.com/r/prom/node-exporter
GitHub地址: https://github.com/prometheus/node_exporter
Grafana Dashboard:
https://grafana.com/grafana/dashboards/1860
https://grafana.com/grafana/dashboards/11074

  1. # 拉取镜像
  2. docker pull prom/node-exporter:latest
  3. # 启动镜像
  4. docker run -d \
  5. --net="host" \
  6. --pid="host" \
  7. -v "/:/host:ro,rslave" \
  8. --name=node \
  9. prom/node-exporter \
  10. --path.rootfs=/host
  11. # 验证效果
  12. curl localhost:9100/metric

4.3 dcgm 显卡监控(GPU)

Docker Hub 地址: https://registry.hub.docker.com/r/nvidia/dcgm-exporter
GitHub地址: https://github.com/NVIDIA/gpu-monitoring-tools
Grafana Dashboard: https://grafana.com/grafana/dashboards/12239

  1. # 拉取镜像
  2. docker pull nvidia/dcgm-exporter:latest
  3. # 启动镜像
  4. docker run -d \
  5. -p 9400:9400 \
  6. --gpus all \
  7. --name=dcgm \
  8. nvidia/dcgm-exporter:latest
  9. # 验证效果
  10. curl localhost:9400/metrics

4.4 Spring Boot

Grafana Dashboard: https://grafana.com/grafana/dashboards/4701

5 Kubernetes部署

prometheus-cfg.yaml

  1. kind: ConfigMap
  2. apiVersion: v1
  3. metadata:
  4. labels:
  5. app: prometheus
  6. name: prometheus-config
  7. namespace: monitor-sa
  8. data:
  9. prometheus.yml: |
  10. global:
  11. scrape_interval: 15s
  12. scrape_timeout: 10s
  13. evaluation_interval: 1m
  14. scrape_configs:
  15. - job_name: 'kubernetes-node'
  16. kubernetes_sd_configs:
  17. - role: node
  18. relabel_configs:
  19. - source_labels: [__address__]
  20. regex: '(.*):10250'
  21. replacement: '${1}:9100'
  22. target_label: __address__
  23. action: replace
  24. - action: labelmap
  25. regex: __meta_kubernetes_node_label_(.+)
  26. - job_name: 'kubernetes-node-cadvisor'
  27. kubernetes_sd_configs:
  28. - role: node
  29. scheme: https
  30. tls_config:
  31. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  32. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  33. relabel_configs:
  34. - action: labelmap
  35. regex: __meta_kubernetes_node_label_(.+)
  36. - target_label: __address__
  37. replacement: kubernetes.default.svc:443
  38. - source_labels: [__meta_kubernetes_node_name]
  39. regex: (.+)
  40. target_label: __metrics_path__
  41. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  42. - job_name: 'kubernetes-apiserver'
  43. kubernetes_sd_configs:
  44. - role: endpoints
  45. scheme: https
  46. tls_config:
  47. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  48. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  49. relabel_configs:
  50. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  51. action: keep
  52. regex: default;kubernetes;https
  53. - job_name: 'kubernetes-service-endpoints'
  54. kubernetes_sd_configs:
  55. - role: endpoints
  56. relabel_configs:
  57. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
  58. action: keep
  59. regex: true
  60. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
  61. action: replace
  62. target_label: __scheme__
  63. regex: (https?)
  64. - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
  65. action: replace
  66. target_label: __metrics_path__
  67. regex: (.+)
  68. - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
  69. action: replace
  70. target_label: __address__
  71. regex: ([^:]+)(?::\d+)?;(\d+)
  72. replacement: $1:$2
  73. - action: labelmap
  74. regex: __meta_kubernetes_service_label_(.+)
  75. - source_labels: [__meta_kubernetes_namespace]
  76. action: replace
  77. target_label: kubernetes_namespace
  78. - source_labels: [__meta_kubernetes_service_name]
  79. action: replace
  80. target_label: kubernetes_name

prometheus-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server     //控制器名称
  namespace: monitor-sa   //命名空间
  labels:
    app: prometheus  //标签
spec:
  replicas: 1   //副本数
  selector:
    matchLabels:
      app: prometheus   //pod标签
      component: server
    #matchExpressions:
    #- {key: app, operator: In, values: [prometheus]}
    #- {key: component, operator: In, values: [server]}
  template:  //模板
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'   //是否收集数据
    spec:
      nodeName: k8s-node   //指定节点
      serviceAccountName: monitor   //指定sa
      containers:
      - name: prometheus
        image: prom/prometheus:v2.2.1
        imagePullPolicy: IfNotPresent    //拉取镜像规则,本地有载在本地获取,如果本地没有从仓库拉取
        command:
          - prometheus
          - --config.file=/etc/prometheus/prometheus.yml
          - --storage.tsdb.path=/prometheus 
          - --storage.tsdb.retention=720h
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-config
          subPath: prometheus.yml
        - mountPath: /prometheus/
          name: prometheus-storage-volume
      volumes:   //将configmap存储卷挂在到上面的容器对应的目录中
        - name: prometheus-config
          configMap:
            name: prometheus-config
            items:
              - key: prometheus.yml
                path: prometheus.yml
                mode: 0644
        - name: prometheus-storage-volume
          hostPath:
           path: /data
           type: Directory

prometheus-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor-sa
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30000
      protocol: TCP
  selector:
    app: prometheus
    component: server
kubectl apply -f prometheus-cfg.yaml

kubectl get pod -n monitor-sa

https://blog.csdn.net/liuchao666888/article/details/107636647