监控 - 监控数据来源 - 《the site reliability workbook 中文翻译》

Sources of Monitoring Data
监控数据来源
- Examples

Sources of Monitoring Data

监控数据来源

Your choice of monitoring system(s) will be informed by the specific sources of monitoring data you’ll use. This section discusses two common sources of monitoring data: logs and metrics. There are other valuable monitoring sources that we won’t cover here, like distributed tracing and runtime introspection.

您将使用的具体监控数据源将被收集到您选择的监控系统上。本节讨论监视数据的两个常见来源: 日志和指标。还有其他有价值的监控源, 我们不会在这里讨论, 比如分布式跟踪和运行时自省。

Metrics are numerical measurements representing attributes and events, typically harvested via many data points at regular time intervals. Logs are an append-only record of events. This chapter’s discussion focuses on structured logs that enable rich query and aggregation tools as opposed to plain-text logs.

指标是表示属性和事件的数值度量, 通常是通过在固定的时间间隔内收集多个数据点。日志是事件的追加记录。本章讨论的重点是结构化日志, 它们支持丰富的查询和聚合, 而不是纯文本日志。

Google’s logs-based systems process large volumes of highly granular data. There’s some inherent delay between when an event occurs and when it is visible in logs. For analysis that’s not time-sensitive, these logs can be processed with a batch system, interrogated with ad hoc queries, and visualized with dashboards. An example of this workflow would be using Cloud Dataflow to process logs, BigQuery for ad hoc queries, and Data Studio for the dashboards.

谷歌的日志基础系统能处理大量的高粒度数据。事件发生的时间和可以查看日志的时间之间存在一些固有的延迟。对于没有时间敏感性的分析, 这些日志可以用批处理系统进行处理, 实现实时查询, 并用仪表板进行可视化。此工作流的一个示例是使用Cloud Dataflow处理日志、BigQuery进行查询查询、Data Studio制作仪表盘。

By contrast, our metrics-based monitoring system, which collects a large number of metrics from every service at Google, provides much less granular information, but in near real time. These characteristics are fairly typical of other logs and metrics-based monitoring systems, although there are exceptions, such as real-time logs systems or high-cardinality metrics.

相比之下, 我们基于度量的监控系统, 从 Google 的每项服务中收集了大量的度量标准, 提供了更少的粒度信息, 但在时间上更加实时。这些特性在其他日志和基于度量的监视系统中是相当典型的, 尽管有一些例外, 例如实时日志系统或高基数指标。

Our alerts and dashboards typically use metrics. The real-time nature of our metrics based monitoring system means that engineers can be notified of problems very rapidly. We tend to use logs to find the root cause of an issue, as the information we need is often not available as a metric.

我们的警报和仪表盘通常使用度量指标。我们基于度量的监控系统具备的实时性意味着工程师可以很快地收到问题通知。我们倾向于使用日志来查找问题的根本原因, 因为我们需要的信息通常无法作为度量值提供。

When reporting isn’t time-sensitive, we often generate detailed reports using logs processing systems because logs will nearly always produce more accurate data than metrics.

当报告对时间不敏感时, 我们通常会使用日志处理系统生成详细报告, 因为日志几乎总是能生成比度量指标更准确的数据。

If you’re alerting based on metrics, it might be tempting to add more alerting based on logs—for example, if you need to be notified when even a single exceptional event happens. We still recommend metrics-based alerting in such cases: you can increment a counter metric when a particular event happens, and configure an alert based on that metric’s value. This strategy keeps all alert configuration in one place, making it easier to manage (see “Managing Your Monitoring System” on page 67).

如果您基于度量指标进行报警, 则很容易根据日志添加更多警报, 例如, 如果在出现单个异常事件时需要通知您。而在另外一种情况下, 我们仍然建议您基于度量指标的报警: 您可以在特定事件发生时递增计数器, 并根据该度量值配置报警。此策略将所有报警配置保持在一个地方, 从而更易于管理 (请参阅67页上的”管理您的监控系统”)。

Examples

The following real-world examples illustrate how to reason through the process of choosing between monitoring systems.

下面的实际示例说明了如何在监控系统之间进行选择。

Move information from logs to metrics

将信息从日志反应到度量指标

Problem. The HTTP status code is an important signal to App Engine customers debugging their errors. This information was available in logs, but not in metrics. The metrics dashboard could provide only a global rate of all errors, and did not include any information about the exact error code or the cause of the error. As a result, the workflow to debug an issue involved:

问题. HTTP 状态码是应用程序开发人员调试他们错误的重要信息。此信息在日志中可用, 但不在度量指标中。”度量指标” 面板只能提供所有错误的全局速率, 并且不包含有关确切错误代码或错误原因。因此, 以下是调试所涉及的问题的工作流:

Looking at the global error graph to find a time when an error occurred.
Reading log files to look for lines containing an error.
Attempting to correlate errors in the log file to the graph.
查看全局错误图以确定发生错误的时间。
读取日志文件查找包含错误的行。
试图将日志文件中的错误与图表关联。

The logging tools did not give a sense of scale, making it hard to know if an error seen in one log line was occurring frequently. The logs also contained many other irrelevant lines, making it hard to track down the root cause.

日志记录工具没有给出趋势感, 很难知道在一个日志行中看到的错误是否经常发生。日志中还包含许多其他不相关的行, 因此很难找到根本原因。

Proposed solution. The App Engine dev team chose to export the HTTP status code as a label on the metric (e.g., requests_total{status=404} versus requests_total {status=500}). Because the number of different HTTP status codes is relatively limited, this did not increase the volume of metric data to an impractical size, but did make the most pertinent data available for graphing and alerting.

建议的解决方案. 应用程序开发团队可以选择将 HTTP 状态代码导出为通用标签 (例如, requests_total{status=404} 与 requests_total {status=500})。由于不同 HTTP 状态代码的数量相对有限, 因此不会将度量指标数据的标签数量增加到不切实际的大小, 从而使最相关的数据可用于图形和报警。

Outcome. This new label meant the team could upgrade the graphs to show separate lines for different error categories and types. Customers could now quickly form conjectures about possible problems based on the exposed error codes. We could now also set different alerting thresholds for client and server errors, making the alerts trigger more accurately.

结果. 此新标签意味着团队可以升级图表以显示不同的错误类别和类型。现在, 用户可以根据暴露的错误代码快速地对可能出现的问题进行推测。我们现在也可以为客户端和服务器错误设置不同的警报阈值, 从而使警报更准确地触发。

Improve both logs and metrics

改进日志和度量指标

Problem. One Ads SRE team maintained ~50 services, which were written in a number of different languages and frameworks. The team used logs as the canonical source of truth for SLO compliance. To calculate the error budget, each service used a logs processing script with many service-specific special cases. Here’s an example script to process a log entry for a single service:

问题. 一个广告团队维护了50项服务, 服务用了多种不同的语言和框架编写。团队使用日志作为规范来源的SLO目标。为了评估错误量, 每个服务都使用日志处理脚本, 其中有许多服务于业务的特殊处理。下面是一个用于处理单个服务的日志项的示例脚本:

  If the HTTP status code was in the range (500, 599)
    AND the 'SERVER ERROR' field of the log is populated
    AND DEBUG cookie was not set as part of the request
    AND the url did not contain '/reports'
    AND the 'exception' field did not contain 'com.google.ads.PasswordException'
    THEN increment the error counter by 1

These scripts were hard to maintain and also used data that wasn’t available to the metrics-based monitoring system. Because metrics drove alerts, sometimes the alerts would not correspond to user-facing errors. Every alert required an explicit triage step to determine if it was user-facing, which slowed down response time.

这些脚本很难维护, 而且用了度量指标监控系统无法使用的数据。由于度量指标驱动警报, 因此警报有时和用户遇到的问题对应不上。每个警报都需要一个显式的排查步骤来确定它是否是面向用户的, 这会减慢响应时间。

Proposed solution. The team created a library that hooked into the logic of the framework languages of each application. The library decided if the error was impacting users at request time. The instrumentation wrote this decision in logs and exported it as a metric at the same time, improving consistency. If the metric showed that the service had returned an error, the logs contained the exact error, along with request related data to help reproduce and debug the issue. Correspondingly, any SLO impacting error that manifested in the logs also changed the SLI metrics, which the team could then alert on.

建议的解决方案. 团队创建一个库, 它与每个应用程序的框架语言逻辑挂钩。该库确定错误是否会在请求时间影响用户。在仪表盘中配置这个规则, 同时将其导出为一个度量, 提高一致性。如果度量指标显示服务返回了错误, 则日志中包含了确切的错误, 以及和请求相关的数据, 以帮助重现和调试问题。相应地, 任何出现在日志中影响SLO的错误信息也改变了SLI指标, 团队随后就可以对其进行预警。

Outcome. By introducing a uniform control surface across multiple services, the team reused tooling and alerting logic instead of implementing multiple custom solutions. All services benefited from removing the complicated, service-specific logs processing code, which resulted in increased scalability. Once alerts were directly tied to SLOs, they were more clearly actionable, so the false-positive rate decreased significantly.

结论。通过在多个服务中引入统一的控制页面, 团队重用了工具和警报逻辑, 而不是实现多个自定义解决方案。所有服务都得益于删除了复杂、服务特定的日志处理代码, 从而提高了可伸缩性。一旦警报直接与 SLOs 关联, 他们就会更清楚地去采取行动, 从而令干扰警报明显下降。

Keep logs as the data source

将日志保留为数据源

Problem. While investigating production issues, one SRE team would often look at the affected entity IDs to determine user impact and root cause. As with the earlier App Engine example, this investigation required data that was available only in logs. The team had to perform one-off log queries for this while they were responding to incidents. This step added time to incident recovery: a few minutes to correctly put together the query, plus the time to query the logs.

问题. 在调查生产问题时, SRE团队通常会查看受影响的实体id, 以确定影响的用户和根本原因。和之前的应用程序示例一样, 此次排查所需要的有用数据仅仅存在于日志中。在响应事件时, 团队需要对此执行一次完整的日志查询。此步骤增加了事件恢复的时间: 正确过滤出对应日志的时间加上遍历查询日志的时间。

Proposed solution. The team initially debated whether a metric should replace their log tools. Unlike in the App Engine example, the entity ID could take on millions of different values, so it would not be practical as a metric label.

建议的解决方案. 最初小组讨论度量指标是否应该取代他们的日志工具。与应用程序示例不同的是, 实体ID 可以包含上百万种不同的值, 因此它不像度量指标那么实用。

Ultimately, the team decided to write a script to perform the one-off log queries they needed, and documented which script to run in the alert emails. They could then copy the command directly into a terminal if necessary.

最终, 团队决定编写一个脚本来执行他们所需要的一次性日志查询, 并把需要运行的脚本记录在报警邮件中。然后, 如果需要, 他们可以把命令直接复制到终端。

Outcome. The team no longer had the cognitive load of managing the correct one-off log query. Accordingly, they could get the results they needed faster (although not as quickly as a metrics-based approach). They also had a backup plan: they could run the script automatically as soon as an alert triggered, and use a small server to query the logs at regular intervals to constantly retrieve semifresh data.

结果. 团队可以不必具有具体业务一次性日志查询的认知负担。因此, 他们可以更快地获得所需的结果 (尽管不像基于度量指标的方法那样快)。他们还有一个备份计划: 一旦警报触发, 他们就可以自动运行该脚本, 并使用小型服务器定期查询日志, 以不断检索持续更新的新数据。