1、prometheus服务

prometheus流程图

prometheus基础 - 图1

  • 多维度数据模型(时序列数据又metrics名和一组key/value组成)
  • 灵活的查询语言PromQL
  • 不依赖分布式存储,单节点工作
  • 通过基于HTTP的pull方式采集数据
  • 还可以通过push gateway进行时序列数据推送(pushing)
  • 支持grafana多图标展示
  • prometheus通过安装在远程机器上的export来监控数据

1.1下载镜像

  1. #下载镜像
  2. docker pull prom/prometheus:v2.11.0

1.2编辑配置文件

  1. #创建文件夹
  2. mkdir /etc/prometheus
  3. #编辑/etc/prometheus/prometheus.yml文件
  4. # my global config
  5. global:
  6. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. 设置抓取间隔,默认为1分钟
  7. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. 估算规则的默认周期,每15秒计算一次规则。默认1分钟
  8. # scrape_timeout is set to the global default (10s). 默认抓取超时,默认为10s
  9. # Alertmanager configuration Alertmanager相关配置
  10. # alerting:
  11. # alertmanagers:
  12. # - static_configs:
  13. # - targets:
  14. # # - alertmanager:9093
  15. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. 规则文件列表,使用'evaluation_interval' 参数去抓取
  16. rule_files:
  17. - "/etc/prometheus/*.rules"
  18. # - "second_rules.yml"
  19. # A scrape configuration containing exactly one endpoint to scrape:
  20. # Here it's Prometheus itself. 抓取配置列表
  21. scrape_configs:
  22. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  23. - job_name: 'prometheus'
  24. # metrics_path defaults to '/metrics'
  25. # scheme defaults to 'http'.
  26. static_configs:
  27. - targets: ['localhost:9090']
  28. - job_name: 'server'
  29. static_configs:
  30. - targets: ['172.31.243.137:9100']
  31. alerting: #告警相关配置
  32. alertmanagers:
  33. - scheme: http
  34. - static_configs:
  35. - targets: ["172.31.243.137:9093"]

1.3启动prometheus

  1. #启动容器镜像, 此目录/data/prometheus/为本地存储路径
  2. docker run -d -p 9090:9090 -v /etc/prometheus/:/etc/prometheus/ -v /data/prometheus/:/prometheus prom/prometheus:v2.11.0
  3. #查看启动端口
  4. netstat -nltup | grep 9090
  5. tcp6 0 0 :::9090 :::* LISTEN 3737/docker-proxy
  6. #查看进程
  7. docker ps -a | grep prometheus
  8. 4af17fb8fe53 prom/prometheus:v2.11.0 "/bin/prometheus -..." About an hour ago Up About an hour 0.0.0.0:9090->9090/tcp kind_leakey

打开ip:9090

image.png

3、监控node-export服务

  1. vim node_export.yaml
  2. apiVersion: extensions/v1beta1
  3. kind: DaemonSet
  4. metadata:
  5. name: node-exporter
  6. namespace: kube-system
  7. labels:
  8. name: node-exporter
  9. spec:
  10. template:
  11. metadata:
  12. labels:
  13. name: node-exporter
  14. spec:
  15. hostPID: true
  16. hostIPC: true
  17. hostNetwork: true
  18. containers:
  19. - name: node-exporter
  20. image: quay.io/prometheus/node-exporter:v0.18.1
  21. ports:
  22. - containerPort: 9100
  23. resources:
  24. requests:
  25. cpu: 0.15
  26. securityContext:
  27. privileged: true
  28. args:
  29. - --path.procfs
  30. - /host/proc
  31. - --path.sysfs
  32. - /host/sys
  33. - --collector.filesystem.ignored-mount-points
  34. - '"^/(sys|proc|dev|host|etc)($|/)"'
  35. volumeMounts:
  36. - name: dev
  37. mountPath: /host/dev
  38. - name: proc
  39. mountPath: /host/proc
  40. - name: sys
  41. mountPath: /host/sys
  42. - name: rootfs
  43. mountPath: /rootfs
  44. tolerations:
  45. - key: "node-role.kubernetes.io/master"
  46. operator: "Exists"
  47. effect: "NoSchedule"
  48. volumes:
  49. - name: proc
  50. hostPath:
  51. path: /proc
  52. - name: dev
  53. hostPath:
  54. path: /dev
  55. - name: sys
  56. hostPath:
  57. path: /sys
  58. - name: rootfs
  59. hostPath:
  60. path: /
  61. #启动kubectl apply -f ./node_export.yaml
  62. [root@hf-aipaas-172-31-243-137 home]# netstat -nltup |grep 9100
  63. tcp6 0 0 :::9100 :::* LISTEN 17093/node_exporter

2、alertmanager服务

2.1 alertmanager下载

  1. #下载镜像
  2. docker pull quay.io/prometheus/alertmanager:v0.20.0

2.2 编辑配置文件

  1. vim /etc/alertmanager/config.yml
  2. # 全局配置项
  3. global:
  4. resolve_timeout: 5m #处理超时时间,默认为5min
  5. smtp_smarthost: 'smtp.sina.com:25' # 邮箱smtp服务器代理
  6. smtp_from: '******@sina.com' # 发送邮箱名称
  7. smtp_auth_username: '******@sina.com' # 邮箱名称
  8. smtp_auth_password: '******' # 邮箱密码或授权码
  9. smtp_require_tls: false
  10. # 定义模板信息
  11. templates:
  12. - 'template/*.tmpl'
  13. # 定义路由树信息
  14. route:
  15. group_by: ['alertname'] # 报警分组依据
  16. group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知
  17. group_interval: 1m # 在发送新警报前的等待时间
  18. repeat_interval: 1m # 发送重复警报的周期 对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝
  19. receiver: 'mail-receiver' # 发送警报的接收者的名称,以下receivers name的名称
  20. # 定义警报接收者信息
  21. receivers:
  22. - name: 'mail-receiver' #名字
  23. email_configs: #配置
  24. - to: 'ycli15@iflytek.com' # 接收警报的email配置
  25. html: '{{ template "test.html" . }}' # 设定邮箱的内容模板
  26. headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题
  27. #webhook_configs: # webhook配置
  28. #- url: 'http://127.0.0.1:5001'
  29. #send_resolved: true
  30. # 一个inhibition规则是在与另一组匹配器匹配的警报存在的条件下,使匹配一组匹配器的警报失效的规则。两个警报必须具有一组相同的标签。
  31. inhibit_rules:
  32. - source_match:
  33. severity: 'critical'
  34. target_match:
  35. severity: 'warning'
  36. equal: ['alertname', 'dev', 'instance']
  37. #模板
  38. vim /etc/alertmanager/template/test.tmpl
  39. {{ define "test.html" }}
  40. <table border="1">
  41. <tr>
  42. <td>报警项</td>
  43. <td>实例</td>
  44. <td>报警阀值</td>
  45. <td>开始时间</td>
  46. </tr>
  47. {{ range $i, $alert := .Alerts }}
  48. <tr>
  49. <td>{{ index $alert.Labels "alertname" }}</td>
  50. <td>{{ index $alert.Labels "instance" }}</td>
  51. <td>{{ index $alert.Annotations "value" }}</td>
  52. <td>{{ $alert.StartsAt }}</td>
  53. </tr>
  54. {{ end }}
  55. </table>
  56. {{ end }}

2.3启动alertmanager服务

  1. docker run -d -p 9093:9093 -v /etc/alertmanager/:/etc/alertmanager/ quay.io/prometheus/alertmanager:v0.20.0 --config.file=/etc/alertmanager/config.yml

image.png

2.4配置prometheus告警规则

  1. vim /etc/prometheus/alert.rules
  2. groups:
  3. - name: example
  4. rules:
  5. # Alert for any instance that is unreachable for >5 minutes.
  6. - alert: InstanceDown
  7. expr: up == 1
  8. for: 5m
  9. labels:
  10. severity: page
  11. annotations:
  12. summary: "Instance {{ $labels.instance }} down"
  13. description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
  14. # Alert for any instance that has a median request latency >1s.
  15. - alert: NODE_NETWORK_UP_DOWN
  16. expr: node_network_up{device="docker0",instance="172.31.243.137:9100",job="server"} > 0
  17. for: 10m
  18. annotations:
  19. summary: "High request latency on {{ $labels.instance }}"
  20. description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

触发告警

image.png

查看prometheus界面Ip:9090

image.png

查看alertmanager告警

image.png

3、pushgateway服务

3.1 pushgateway下载

  1. docker pull prom/pushgateway:v1.2.0

3.2 安装部署

  1. docker run -d --name=pg -p 9091:9091 prom/pushgateway:v1.2.0

3.2 配置prometheus

增加scrapeconfigs配置:job->pushgateway
记得重启prometheus容器_

  1. scrape_configs:
  2. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  3. - job_name: 'prometheus'
  4. # metrics_path defaults to '/metrics'
  5. # scheme defaults to 'http'.
  6. static_configs:
  7. - targets: ['localhost:9090']
  8. - job_name: 'server'
  9. static_configs:
  10. - targets: ['172.31.243.137:9100']
  11. - job_name: pushgateway
  12. static_configs:
  13. - targets: ['172.31.243.137:9091']
  14. labels:
  15. instance: pushgateway

查看prometheus监控界面IP:9090

image.png

3.3 pushgateway抛送数据举例

使用curl的post功能抛送数据
  1. cat <<EOF | curl --data-binary @- http://172.31.243.137:9091/metrics/job/some_job/instance/some_instance
  2. > # TYPE some_metric counter
  3. > some_metric{label="val1"} 42
  4. > # TYPE another_metric gauge
  5. > # HELP another_metric Just an example.
  6. > another_metric 2398.283
  7. > EOF

查看页面

image.png