普罗米修斯 Prometheus - 入门 Getting Started - 《Prometheus中文文档》

下载并运行Prometheus
配置Prometheus监控自己
启动Prometheus
用表达式查看采集到的数据
使用图形界面
启动一些示例的目标
配置Prometheus监控示例目标（targets）
为新收集到的时间序列数据配置规则

   本指南是一个“helloworld”风格的教程，演示如何安装、配置和使用一个简单的Prometheus实例。你将下载并运行Prometheus在本地，配置它去采集自己和实例程序的数据，然后用收集到的数据进行查询、配置规则、图形化查看等实践。

下载并运行Prometheus

下载最新版本的Prometheus到你的服务器上，然后解压并运行它：

tar xvfz prometheus-*.tar.gz
cd prometheus-*

在启动Prometheus之前，让我们来配置一下它。

配置Prometheus监控自己

Prometheus采集指标通过HTTP 端点，去采集目标的数据。由于Prometheus以同样的方式暴露自己的数据，所以它可以采集和监控自己的健康状态。
虽然只收集自身数据的Prometheus服务器不是很有用，但它是一个很好的开始示例。
将以下Prometheus的配置另存为名为prometheus.yml的文件：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

有关配置选项的完整规范，请参阅配置文档。

启动Prometheus

     要使用新创建的配置文件启动Prometheus，请切换到包含Prometheus二进制文件的目录并运行：

# Start Prometheus.
# By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
./prometheus --config.file=prometheus.yml

Prometheus应该已经启动了。你应该还能够浏览它自己的一个状态页`localhost:9090`。给它几秒钟的时间让它通过http端点收集自己的一些数据。<br />你还能直接访问他的metric endpoit：`localhost:9090/metric`，去验证Prometheus是否提供了关于自身的指标。

用表达式查看采集到的数据

让我们来探索Prometheus收集到关于自身的数据。使用Prometheus内置的表达式浏览器，访问http://localhost:9090/graph。选择“graph”选项卡中的“Console”视图。
你可以访问localhost:9090/metrics，能够查看到一个Prometheus自己暴露的指标：prometheus_target_interval_length_seconds(目标采集间隔的实际时间)，在表达式控制台输入以下内容，并单击”Execute”：

prometheus_target_interval_length_seconds

回车后将返回很多不同的时间序列（以及每个时间序列记录的最新值），每个时间序列的metric name都为`prometheus_target_interval_length_seconds`，但是标签不同。这些标签指定不同的分位数和目标组间隔。

示例：此处的interval相同，都为1s prometheus_target_interval_length_seconds{instance=”localhost:9090”, interval=”1s”, job=”prometheus”, quantile=”0.01”} 0.999806468 prometheus_target_interval_length_seconds{instance=”localhost:9090”, interval=”1s”, job=”prometheus”, quantile=”0.05”} 0.999885933 prometheus_target_interval_length_seconds{instance=”localhost:9090”, interval=”1s”, job=”prometheus”, quantile=”0.5”} 1.000016556 prometheus_target_interval_length_seconds{instance=”localhost:9090”, interval=”1s”, job=”prometheus”, quantile=”0.9”} 1.000114677 prometheus_target_interval_length_seconds{instance=”localhost:9090”, interval=”1s”, job=”prometheus”, quantile=”0.99”} 1.000221273

如果我们只对0.99分位数感兴趣的话，我们可以这样查询：

prometheus_target_interval_length_seconds{quantile="0.99"}rate(prometheus_tsdb_head_chunks_created_total[1m])

去统计有多少个时间序列，可以这样查询：

count(prometheus_target_interval_length_seconds)

关于更多的表达式语法，请查看[表达式语法文档](https://www.yuque.com/qinjunhang/cn-prometheus/fg1c9p)。

使用图形界面

要使用表达式绘图，请访问http://localhost:9090/graph，并使用”Graph”选项卡。
例如：使用以下表达式可以绘制Prometheus每秒创建块的速度：

rate(prometheus_tsdb_head_chunks_created_total[1m])

请自己探索图形的范围参数和其它设置。

启动一些示例的目标

让我们为Prometheus添加其它的采集目标（targets）
Node Exporter 可以作为示例目标，有关它的详细信息，请查看以下说明。

tar -xzvf node_exporter-*.*.tar.gz
cd node_exporter-*.*
# Start 3 example targets in separate terminals:
./node_exporter --web.listen-address 127.0.0.1:8080
./node_exporter --web.listen-address 127.0.0.1:8081
./node_exporter --web.listen-address 127.0.0.1:8082

你现在应该有了示例目标，分别监听在：` http://localhost:8080/metrics, http://localhost:8081/metrics, http://localhost:8082/metrics`。

配置Prometheus监控示例目标（targets）

现在我们将配置Prometheus采集新的targets的数据。让我们把三个endpoint分组到一个叫做node的job中。我们将设想前两个endpoint是生产目标，而第三个endpoint表示一个金丝雀实例。为了在Prometheus中对此进行建模，我们可以将多组端点添加到单个作业中，为每组目标添加额外的标签。在这个示例中：我们将添加group="production"这个lable到第一组的targets，添加group="canary"到第二组。
为此，请将以下job定义添加到prometheus.yml中的scrape_configs部分，然后重新启动prometheus实例：

scrape_configs:
  - job_name:       'node'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'
      - targets: ['localhost:8082']
        labels:
          group: 'canary'

进入表达式浏览器页面，验证Prometheus现在是否具有示例端点公开的时间序列信息，例如：`node_cpu_seconds_tota`。

为新收集到的时间序列数据配置规则

     虽然在我们的示例中不是问题，但是当计算时，聚合数千个时间序列，然后查询可能会变慢。为了提高效率，普罗米修斯可以通过配置的记录规则`recording rules`将表达式预先记录到新的持久化时间序列中。假设我们感兴趣的是5分钟内每个实例上所有cpu的平均cpu时间速率。我们可以这样写：

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

可以尝试画它的表达式的图。<br />将此表达式产生的时间序列记录到名为`job_instance_mode:node_cpu_seconds:avg_rate5m`，使用以下记录规则创建一个文件并将其另存为`prometheus.rules.yml`：

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

要使Prometheus可以接受这个新规则，请在prometheus.yml中添加一个`rule_files`语句，配置应该如下所示：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.
  # Attach these extra labels to all timeseries collected by this Prometheus instance.
  external_labels:
    monitor: 'codelab-monitor'
rule_files:
  - 'prometheus.rules.yml'
scrape_configs:
  - job_name: 'prometheus'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name:       'node'
    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'
      - targets: ['localhost:8082']
        labels:
          group: 'canary'

使用新的配置重新启动Prometheus，并查看这个新的时间序列，他的matric name为：`job_instance_mode:node_cpu_seconds:avg_rate5m`，现在就可以通过表达式浏览器查询或者图形查看。