Managing Your Monitoring System

监控系统管理


Your monitoring system is as important as any other service you run. As such, it should be treated with the appropriate level of care and attention.

您的监视系统与您运行的任何其他服务一样重要。因此, 应以适当的维护和注意平等对待。

Treat Your Configuration as Code

把配置当做代码一样对待

Treating system configuration as code and storing it in the revision control system are common practices that provide some obvious benefits: change history, links from specific changes to your task tracking system, easier rollbacks and linting checks, and enforced code review procedures.

将系统配置作为代码处理并将其存储在修订版本控制系统中是常见的做法, 它们提供了一些明显的好处: 修改记录、链接特定更改和任务跟踪系统、更容易进行回滚、滚动检查以及强制代码审查过程。

We strongly recommend also treating monitoring configuration as code (for more on configuration, see Chapter 14). A monitoring system that supports intent-based configuration is preferable to systems that only provide web UIs or CRUD-style APIs. This configuration approach is standard for many open source binaries that only read a configuration file. Some third-party solutions like grafanalib enable this approach for components that are traditionally configured with a UI.

我们还强烈建议将监控配置作为代码处理 (有关配置的更多内容, 请参见14章)。监视系统支持自定义配置比仅提供 web ui 或 CRUD 样式 api 的系统更可取。此配置方法对于许多只读取配置文件的开源二进制代码来说是标准的。某些第三方解决方案 (如 grafanalib) 为传统使用 UI 配置的组件启用此方法。

Encourage Consistency

鼓励一致性

Large companies with multiple engineering teams who use monitoring need to strike a fine balance: a centralized approach provides consistency, but on the other hand, individual teams may want full control over the design of their configuration.

拥有多个工程师团队的大型企业使用监控需要进行一些细微的平衡: 集中化的方法提供了一致性, 但另一方面, 各个团队又可能希望完全控制其配置的设计。

The right solution depends on your organization. Google’s approach has evolved over time toward convergence on a single framework run centrally as a service. This solution works well for us for a few reasons. A single framework enables engineers to ramp up faster when they switch teams, and makes collaboration during debugging easier. We also have a centralized dashboarding service, where each team’s dashboards are discoverable and accessible. If you easily understand another team’s dashboard, you can debug both your issues and theirs more quickly.

正确的解决方案取决于您的企业。Google的方案随着时间的推移而逐渐演变为一个单一的框架, 以服务的方式集中运行。这个解决方案对我们很有效, 原因有几个。单个框架使得工程师能够在不同团队协作时更好地提高效率, 并且在调试过程中更容易进行协作。我们还有一个集中式的仪表盘服务, 其中每个团队的仪表盘是可发现和易于访问的。如果你很容易理解另一个团队的仪表盘, 你将可以更快的和他们一起调试你的问题。

If possible, make basic monitoring coverage effortless. If all your services export a consistent set of basic metrics, you can automatically collect those metrics across your entire organization and provide a consistent set of dashboards. This approach means that any new component you launch automatically has basic monitoring. Many teams across your company—even nonengineering teams—can use this monitoring data.

如果可能的话, 让基础的监控能轻易普及。如果所有服务都导出了一组一致的基本度量指标, 则可以在整个组织中自动收集这些度量, 并提供一组一致的仪表板。此方法意味着您启动的任何新组件都将自动进行基本监视。在您的公司许多研发团队甚至非研发团队都可以使用此监视数据。

Prefer Loose Coupling

优先松耦合

Business requirements change, and your production system will look different a year from now. Similarly, your monitoring system needs to evolve over time as the services it monitors evolve through different patterns of failure.

业务需求发生变化, 您的生产系统将在一年后看起来和现在不同。同样,您的监控系统也需要随着时间的推移而发展,因为它监控的服务会通过不断的故障模式进行发展。

We recommend keeping the components of your monitoring system loosely coupled. You should have stable interfaces for configuring each component and passing monitoring data. Separate components should be in charge of collecting, storing, alerting, and visualizing your monitoring. Stable interfaces make it easier to swap out any given component for a better alternative.

我们建议您保持监控系统组件的松耦合。您应该有稳定的接口来配置每个组件并传递监视数据。独立的组件应该负责收集、存储、报警和可视化您的监视。稳定的接口使得更换任何给定组件更容易,可以获得更好的替代方案。

Splitting functionality into individual components is becoming popular in the open source world. A decade ago, monitoring systems like Zabbix combined all functions into a single component. Modern design usually involves separating collection and rule evaluation (with a solution like Prometheus server), long-term time series storage (InfluxDB), alert aggregation (Alertmanager), and dashboarding (Grafana).

在开源世界里将功能拆分为各个组件变得越来越流行。十年前, 像 Zabbix 这样的监控系统将所有功能合并为一个组件。而现代设计通常涉及分散收集和规则评估 (如Prometheus的解决方案)、长期时间序列存储 (InfluxDB)、警报聚合 (Alertmanager) 和 dashboarding (Grafana)。

As of this writing, there are at least two popular open standards for instrumenting your software and exposing metrics:

在这篇文章中, 至少有两个流行的开放标准用于检测软件和公开指标:

  • statsd

    The metric aggregation daemon initially written by Etsy and now ported to a majority of programming languages.

    度量指标聚合守护进程最初由Etsy编写,现在已经移植到大多数编程语言。

  • Prometheus

    An open source monitoring solution with a flexible data model, support for metric labels, and robust histogram functionality. Other systems are now adopting the Prometheus format, and it is being standardized as OpenMetrics.

    一个开放源码的监控解决方案, 具有灵活的数据模型, 支持 度量指标和强健的直方图功能。其他系统现在采用了普罗米修斯的格式, 它被标准化为 OpenMetrics。

A separate dashboarding system that can use multiple data sources provides a central and unified overview of your service. Google recently saw this benefit in practice: our legacy monitoring system (Borgmon3) combined dashboards in the same configuration as alerting rules. While migrating to a new system (Monarch), we decided to move dashboarding into a separate service (Viceroy). Because Viceroy was not a component of Borgmon or Monarch, Monarch had fewer functional requirements. Since users could use Viceroy to display graphs based on data from both monitoring systems, they could gradually migrate from Borgmon to Monarch.

可以使用多个数据源的单独仪表盘系统可提供对服务的集中统一描述。Google 最近在实践中发现了这一好处: 我们的遗留监视系统 (Borgmon3) 将仪表盘与警报规则的配置组合在一起。在迁移到一个新的系统 (Monarch), 我们决定把 dashboarding 迁移到一个单独的服务 (Viceroy)。由于Viceroy不是 Borgmon 或 Monarch的组成部分, Monarch的功能要求更少。由于用户可以使用Viceroy根据两个监控系统的数据显示图表, 他们可以逐渐从 Borgmon 迁移到Monarch。