监控和报警系统分开。报警系统：pagerduty
老式工具Nagios。

监控

业务监控、系统监控、网络监控、日志监控、程序监控。

业务监控：QPS、DAU日活、转化率、接口。
系统监控：CPU、内存、磁盘、TCP连接、流量。（Prometheus）
网络监控：交换机、路由器、防火墙、VPN。丢包率、延迟。
日志监控：应用、syslog、网络设备、用户行为（ELK）
程序监控：开发内嵌SDK调用接口上报数据。

守护进程采集的坑：内存泄漏、僵尸进程、性能瓶颈。
桥接式采集

--storage.tsdb.path: 存储数据的目录，默认为data/，如果要挂外部存储，可以指定该目录
--storage.tsdb.retention.time: 数据过期清理时间，默认保存15天
--storage.tsdb.retention.size: 实验性质，声明数据块的最大值，不包括wal文件，如512MB
--storage.tsdb.retention: 已被废弃，改为使用storage.tsdb.retention.time

内存相关指标

prometheus_tsdb_head_chunks

https://www.robustperception.io/why-does-prometheus-use-so-much-ram
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

2.15.0对内存的改进
https://www.robustperception.io/new-features-in-prometheus-2-15-0
Since Prometheus v2.19.0, we are not storing all the chunks in the memory.

和TSDB相关的各种文章

https://fabxc.org/tsdb/

文章底部有系列其他的链接
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/

坑

context deadline exceeded

pushgateway报这个错是因为刮取数据超时。
https://github.com/prometheus/prometheus/issues/1438

参考

https://prometheus.io/

https://www.cnblogs.com/vovlie/p/7709312.html
https://www.robustperception.io/