ADR 011: Monitoring

Changelog

08-06-2018: Initial draft 11-06-2018: Reorg after @xla comments 13-06-2018: Clarification about usage of labels

Context

In order to bring more visibility into Tendermint, we would like it to report metrics and, maybe later, traces of transactions and RPC queries. See https://github.com/tendermint/tendermint/issues/986.

A few solutions were considered:

  1. Prometheus a) Prometheus API b) go-kit metrics package as an interface plus Prometheus c) telegraf d) new service, which will listen to events emitted by pubsub and report metrics
  2. OpenCensus

1. Prometheus

Prometheus seems to be the most popular product out there for monitoring. It has a Go client library, powerful queries, alerts.

a) Prometheus API

We can commit to using Prometheus in Tendermint, but I think Tendermint users should be free to choose whatever monitoring tool they feel will better suit their needs (if they don’t have existing one already). So we should try to abstract interface enough so people can switch between Prometheus and other similar tools.

b) go-kit metrics package as an interface

metrics package provides a set of uniform interfaces for service instrumentation and offers adapters to popular metrics packages:

https://godoc.org/github.com/go-kit/kit/metrics#pkg-subdirectories

Comparing to Prometheus API, we’re losing customisability and control, but gaining freedom in choosing any instrument from the above list given we will extract metrics creation into a separate function (see “providers” in node/node.go).

c) telegraf

Unlike already discussed options, telegraf does not require modifying Tendermint source code. You create something called an input plugin, which polls Tendermint RPC every second and calculates the metrics itself.

While it may sound good, but some metrics we want to report are not exposed via RPC or pubsub, therefore can’t be accessed externally.

d) service, listening to pubsub

Same issue as the above.

2. opencensus

opencensus provides both metrics and tracing, which may be important in the future. It’s API looks different from go-kit and Prometheus, but looks like it covers everything we need.

Unfortunately, OpenCensus go client does not define any interfaces, so if we want to abstract away metrics we will need to write interfaces ourselves.

List of metrics

Name Type Description
A consensus_height Gauge
A consensus_validators Gauge Number of validators who signed
A consensus_validators_power Gauge Total voting power of all validators
A consensus_missing_validators Gauge Number of validators who did not sign
A consensus_missing_validators_power Gauge Total voting power of the missing validators
A consensus_byzantine_validators Gauge Number of validators who tried to double sign
A consensus_byzantine_validators_power Gauge Total voting power of the byzantine validators
A consensus_block_interval Timing Time between this and last block (Block.Header.Time)
consensus_block_time Timing Time to create a block (from creating a proposal to commit)
consensus_time_between_blocks Timing Time between committing last block and (receiving proposal creating proposal)
A consensus_rounds Gauge Number of rounds
consensus_prevotes Gauge
consensus_precommits Gauge
consensus_prevotes_total_power Gauge
consensus_precommits_total_power Gauge
A consensus_num_txs Gauge
A mempool_size Gauge
A consensus_total_txs Gauge
A consensus_block_size Gauge In bytes
A p2p_peers Gauge Number of peers node’s connected to

A - will be implemented in the fist place.

Proposed solution

Status

Proposed.

Consequences

Positive

Better visibility, support of variety of monitoring backends

Negative

One more library to audit, messing metrics reporting code with business domain.

Neutral

-