Ansible部署prometheus

  1. ansible-playbook -i host_file service_deploy.yaml -e "tgz=prometheus-2.25.2.linux-amd64.tar.gz" -e "app=prometheus"
  2. 查看页面

image.png

prometheus配置文件解析

  1. # 全局配置段
  2. global:
  3. # 采集间隔
  4. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  5. # 计算报警和预聚合间隔
  6. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  7. # 采集超时时间
  8. scrape_timeout: 10s
  9. # 查询日志,包含各阶段耗时统计
  10. query_log_file: /opt/logs/prometheus_query_log
  11. # 全局标签组
  12. # 通过本实例采集的数据都会叠加下面的标签
  13. external_labels:
  14. account: 'huawei-main'
  15. region: 'beijng-01'
  16. # Alertmanager信息段
  17. alerting:
  18. alertmanagers:
  19. - scheme: http
  20. static_configs:
  21. - targets:
  22. - "localhost:9093"
  23. # 告警、预聚合配置文件段
  24. rule_files:
  25. - /etc/prometheus/rules/record.yml
  26. - /etc/prometheus/rules/alert.yml
  27. # 采集配置段
  28. scrape_configs:
  29. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  30. - job_name: 'prometheus'
  31. # metrics_path defaults to '/metrics'
  32. # scheme defaults to 'http'.
  33. static_configs:
  34. - targets: ['localhost:9090']
  35. # 远程查询段
  36. remote_read:
  37. # prometheus
  38. - url: http://prometheus/v1/read
  39. read_recent: true
  40. # m3db
  41. - url: "http://m3coordinator-read:7201/api/v1/prom/remote/read"
  42. read_recent: true
  43. # 远程写入段
  44. remote_write:
  45. - url: "http://m3coordinator-write:7201/api/v1/prom/remote/write"
  46. queue_config:
  47. capacity: 10000
  48. max_samples_per_send: 60000
  49. write_relabel_configs:
  50. - source_labels: [__name__]
  51. separator: ;
  52. # 标签key前缀匹配到的drop
  53. regex: '(kubelet_|apiserver_|container_fs_).*'
  54. replacement: $1
  55. action: drop

所以prometheus实例可以用来做下列用


对应的配置段
用途
采集配置段 做采集器,数据保存在本地
采集配置段 + 远程写入段 做采集器+传输器,数据保存在本地+远端存储
远程查询段 做查询器,查询远端存储数据
采集配置段 + 远程查询段 做采集器+查询器,查询本地数据+远端存储数据
采集配置段 + Alertmanager信息段 + 告警配置文件段 做采集器+告警触发器,查询本地数据生成报警发往Alertmanager
远程查询段 + Alertmanager信息段 + 告警配置文件段 做远程告警触发器,查询远端数据生成报警发往Alertmanager
远程查询段+远程写入段 + 预聚合配置文件段 做预聚合指标,生成的结果集指标写入远端存储
  1. 准备prometheus配置文件,配置采集两个node_exporter
  1. global:
  2. scrape_interval: 15s
  3. scrape_timeout: 10s
  4. evaluation_interval: 15s
  5. alerting:
  6. alertmanagers:
  7. - scheme: http
  8. timeout: 10s
  9. api_version: v1
  10. static_configs:
  11. - targets: []
  12. scrape_configs:
  13. - job_name: prometheus
  14. honor_timestamps: true
  15. scrape_interval: 15s
  16. scrape_timeout: 10s
  17. metrics_path: /metrics
  18. scheme: http
  19. static_configs:
  20. - targets:
  21. - localhost:9090
  22. - job_name: node_exporter
  23. honor_timestamps: true
  24. scrape_interval: 15s
  25. scrape_timeout: 10s
  26. metrics_path: /metrics
  27. scheme: http
  28. static_configs:
  29. - targets:
  30. - 172.16.58.79:9100
  31. - 172.16.58.78:9100
  32. 热更新配置
  33. # 命令行开启 --web.enable-lifecycle
  34. curl -X POST http://localhost:9090/-/reload
  35. 页面查看targets up情况

image.png
解说targets页面

  • job分组情况
  • endpoint实例地址
  • state采集是否成功
  • label标签组
  • Last Scrape 上次采集到现在的间隔时间
  • Scrape Duration 上次采集耗时
  • Error 采集错误

    通过api获取targets 详情

    ```yaml

    coding=UTF-8

import requests

def print_targets(targets): index = 1 all = len(targets) for i in targets: scrapeUrl = i.get(“scrapeUrl”) state = i.get(“health”) labels = i.get(“labels”) lastScrape = i.get(“lastScrape”) lastScrapeDuration = i.get(“lastScrapeDuration”) lastError = i.get(“lastError”) if state==”up”: up_type = “正常” else: up_type = “异常” msg = “状态:{} num:{}/{} endpoint:{} state:{} labels:{} lastScrape:{} lastScrapeDuration:{} lastError:{}”.format(

  1. up_type,
  2. index,
  3. all,
  4. scrapeUrl,
  5. state,
  6. str(labels),
  7. lastScrape,
  8. lastScrapeDuration,
  9. lastError,
  10. )
  11. print(msg)
  12. index+=1

def get_targets(t): f_data = {} try: uri = ‘http://{}/api/v1/targets'.format(t) res = requests.get(uri)

  1. data = res.json().get("data")
  2. activeTargets = data.get("activeTargets")
  3. droppedTargets = data.get("droppedTargets")
  4. ups = []
  5. downs = []
  6. print_targets(activeTargets)
  7. print_targets(droppedTargets)
  8. except Exception as e:
  9. print(e)

get_targets(“prometheus.master01.wiswrt.com:9090”)

``` image.png