简介:

  • node_exporter: 负责收集客户端各项指标数据,如CPU、内存、硬盘等实时状态数据;
  • prometheus:获取node_exporter的数据并处理,对原始数据加工配置各种告警规则;
  • grafana:从prometheus获取数据并根据自定义的仪表盘展示监控数据;

部署:

  1. node_exporter ```shell

    下载安装包

    wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz

解压并移动文件

tar zxvf node_exporter-1.1.2.linux-amd64.tar.gz mv node_exporter-1.1.2.linux-amd64 /usr/local/bin/node_exporter

加入systemd服务

cat > /etc/systemd/system/node_exporter.service << EOF [Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target

[Service] Type=simple ExecStart=/usr/local/bin/node_exporter Restart=on-failure

[Install] WantedBy=multi-user.target EOF

启动服务

systemctl enable —now node_exporter

查看状态

systemctl status node_exporter

  1. ![image.png](https://cdn.nlark.com/yuque/0/2021/png/21704071/1622603382622-7e3f08fc-f7c8-4362-90f4-d39484719a12.png#clientId=u6783cde2-221c-4&from=paste&height=445&id=u50456b74&margin=%5Bobject%20Object%5D&name=image.png&originHeight=445&originWidth=1248&originalType=binary&size=79820&status=done&style=none&taskId=u5e53ff24-0db3-4e2d-8879-4c87a640b4b&width=1248)
  2. 访问9100端口<br />![image.png](https://cdn.nlark.com/yuque/0/2021/png/21704071/1622604081398-e0053a88-ee26-40a3-a803-b2948abb5153.png#clientId=u6783cde2-221c-4&from=paste&height=914&id=u02c9217d&margin=%5Bobject%20Object%5D&name=image.png&originHeight=914&originWidth=921&originalType=binary&size=85782&status=done&style=none&taskId=u027fc22f-713b-4a9e-8788-cfb112e0f90&width=921)
  3. 2. prometheus
  4. ```shell
  5. #下载包
  6. wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
  7. #解压
  8. tar -xvf prometheus-*.tar.gz
  9. cd prometheus-*
  10. #加入systemd
  11. cat > /etc/systemd/system/prometheus.service << EOF
  12. [Unit]
  13. Description=Prometheus
  14. Documentation=https://prometheus.io/
  15. After=network.target
  16. [Service]
  17. Type=simple
  18. ExecStart=/home/prometheus/prometheus --config.file=/home/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus/data --storage.tsdb.retention=60d
  19. Restart=on-failure
  20. [Install]
  21. WantedBy=multi-user.target
  22. EOF
  23. #设置开机启动
  24. systemctl enable --now prometheus

配置文件参考:

  1. cat prometheus.yml
  1. # 全局配置
  2. global:
  3. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  4. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  5. # 告警端口
  6. alerting:
  7. alertmanagers:
  8. - static_configs:
  9. - targets:
  10. - 10.10.16.218:9093
  11. # 告警规则文件
  12. rule_files:
  13. - "rules.yml"
  14. # 数据源配置
  15. scrape_configs:
  16. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  17. - job_name: '普罗米修斯'
  18. static_configs:
  19. - targets: ['localhost:9090']
  20. - job_name: '猫的测试主机'
  21. static_configs:
  22. - targets: ['localhost:9100']
  1. cat rules.yml
  1. groups:
  2. - name: 通用预警模板
  3. rules:
  4. - alert: "实例丢失"
  5. expr: up == 0
  6. for: 1m
  7. labels:
  8. severity: page
  9. annotations:
  10. summary: "服务器实例 {{ $labels.instance }} 丢失"
  11. description: "{{ $labels.instance }} 上的任务 {{ $labels.job }} 已经停止了 1 分钟以上了"
  12. - alert: "磁盘容量小于 10%"
  13. expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 90
  14. for: 30s
  15. annotations:
  16. summary: "服务器实例 {{ $labels.instance }} 磁盘不足 告警通知"
  17. description: "{{ $labels.instance }}磁盘 {{ $labels.device }} 资源 已不足 10%, 当前值: {{ $value }}"
  18. - alert: "内存容量小于 20%"
  19. expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
  20. for: 30s
  21. labels:
  22. severity: warning
  23. annotations:
  24. summary: "服务器实例 {{ $labels.instance }} 内存不足 告警通知"
  25. description: "{{ $labels.instance }}内存资源已不足 20%,当前值: {{ $value }}"
  26. - alert: "CPU 平均负载大于 4 "
  27. expr: node_load5 > 4
  28. for: 30s
  29. annotations:
  30. sumary: "服务器实例 {{ $labels.instance }} CPU 负载 告警通知"
  31. description: "{{ $labels.instance }}CPU 平均负载(5 分钟) 已超过 4 ,当前值: {{ $value }}"
  32. - alert: "磁盘读 I/O 超过 30MB/s"
  33. expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
  34. for: 30s
  35. annotations:
  36. sumary: "服务器实例 {{ $labels.instance }} I/O 读负载 告警通知"
  37. description: "{{ $labels.instance }}I/O 每分钟读已超过 3MB/s,当前值: {{ $value }}"
  38. - alert: "磁盘写 I/O 超过 30MB/s"
  39. expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
  40. for: 30s
  41. annotations:
  42. sumary: "服务器实例 {{ $labels.instance }} I/O 写负载 告警通知"
  43. description: "{{ $labels.instance }}I/O 每分钟写已超过 30MB/s,当前值: {{ $value }}"
  44. - alert: "网卡流出速率大于 10MB/s"
  45. expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 10000000
  46. for: 30s
  47. annotations:
  48. sumary: "服务器实例 {{ $labels.instance }} 网卡流量负载 告警通知"
  49. description: "{{ $labels.instance }}网卡 {{ $labels.device }} 流量已经超过 10MB/s, 当前值: {{ $value }}"
  50. - alert: "CPU 使用率大于 90%"
  51. expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
  52. for: 30s
  53. annotations:
  54. sumary: "服务器实例 {{ $labels.instance }} CPU 使用率 告警通知"
  55. description: "{{ $labels.instance }}CPU 使用率已超过 90%, 当前值: {{ $value }}"

prometheus告警钉钉插件配置文件参考:

  1. cat alertmanager/alertmanager.yml
  1. global:
  2. resolve_timeout: 5m #5分钟后未收到告警将信息标记为已解决
  3. route:
  4. group_by: [alertname] #采用哪个标签作为分组
  5. group_wait: 10s #等待10秒 一起发送报警
  6. group_interval: 10s #组报警间隔时间
  7. repeat_interval: 2m #重复报警间隔时间
  8. receiver: ops_notify #设置默认接收人
  9. #routes:
  10. #- receiver: ops_notify
  11. # match_re:
  12. # alertname: 实例存活告警|磁盘使用率告警 # 匹配告警规则中的名称发送
  13. receivers:
  14. - name: ops_notify
  15. webhook_configs:
  16. - url: http://localhost:8060/dingtalk/webhook_legacy/send #这里的webhook_legacy 为告警通知的路由
  17. send_resolved: true #报警解除通知
  18. inhibit_rules:
  19. - source_match:
  20. severity: 'critical'
  21. target_match:
  22. severity: 'firing'
  23. equal: ['alertname', 'dev', 'instance']
  24. #templates:
  25. # - '/home/prometheus/alertmanager/template/default.tmpl'
  1. cat prometheus-webhook-dingtalk-1.4.0.linux-amd64/config.yml
  1. ## Request timeout
  2. # timeout: 5s
  3. ## 通知模板路径
  4. templates:
  5. - /home/prometheus/prometheus-webhook-dingtalk-1.4.0.linux-amd64/contrib/templates/legacy/template2.tmpl
  6. ## You can also override default template using `default_message`
  7. ## The following example to use the 'legacy' template from v0.3.0
  8. # default_message:
  9. # title: '{{ template "legacy.title" . }}'
  10. # text: '{{ template "legacy.content" . }}'
  11. ## Targets, previously was known as "profiles"
  12. targets:
  13. webhook1:
  14. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  15. # secret for signature
  16. secret: SEC000000000000000000000
  17. #alertmanager.yml 配置的通知路由
  18. webhook_legacy:
  19. url: https://oapi.dingtalk.com/robot/send?access_token='your token'
  20. # Customize template content
  21. message:
  22. # Use legacy template
  23. title: '{{ template "ding.link.title" . }}'
  24. text: '{{ template "ding.link.content" . }}'
  25. webhook_mention_all:
  26. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  27. mention:
  28. all: true
  29. webhook_mention_users:
  30. url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
  31. mention:
  32. mobiles: ['156xxxx8827', '189xxxx8325']
  1. cat prometheus-webhook-dingtalk-1.4.0.linux-amd64/contrib/templates/legacy/template2.tmpl
  1. {{ define "__subject" }}[Linux 基础监控告警:{{ .Alerts.Firing | len }}] {{ end }}
  2. {{ define "__text_list" }}{{ range . }}
  3. {{ range .Labels.SortedPairs }}
  4. {{ if eq .Name "instance" }}
  5. * 实例:
  6. {{ .Value | html }}{{ end }}
  7. {{ end }}
  8. {{ range .Labels.SortedPairs }}
  9. {{ if eq .Name "serverity" }}
  10. * 告警级别:
  11. {{ .Value | html }}{{ end }}
  12. {{ if eq .Name "hostname" }}
  13. * 主机名称:
  14. {{ .Value | html }}{{ end }}
  15. {{ end }}
  16. {{ range .Annotations.SortedPairs }}
  17. {{ if eq .Name "description" }}
  18. * 告警详情:
  19. {{ .Value | html }}{{ end }}
  20. {{ end }}
  21. *触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
  22. {{"-------------------------------------------"}}
  23. {{ end }}{{ end }}
  24. {{ define "ding.link.title" }}{{ template "__subject" . }}{{ end }}
  25. {{ define "ding.link.content" }}
  26. {{ if gt (len .Alerts.Firing) 0 }}### <font color=#FF0000>报警触发通知</font>】 [{{ .Alerts.Firing | len }}]
  27. {{ template "__text_list" .Alerts.Firing }}{{ end }}
  28. {{ if gt (len .Alerts.Resolved) 0 }}### <font color=#32CD32>报警恢复</font>】 [{{ .Alerts.Resolved | len }}]
  29. {{ end }}
  30. {{ end }}
  1. grafana

    提供多种安装方式:https://grafana.com/docs/grafana/latest/installation/
    推荐仪表盘插件:https://grafana.com/grafana/dashboards/8919
    COPY ID 即可导入仪表盘:
    image.png
    image.png