Default Alerts for Cluster Monitoring


When you create a cluster, some alert rules are predefined. These alerts notify you about signs that the cluster could be unhealthy. You can receive these alerts if you configure a notifier for them.

创建集群时,会预定义一些警报规则。 这些警报会通知您有关群集可能不正常的迹象。 如果为它们配置通知程序(Notifier),则可以收到这些警报。

Several of the alerts use Prometheus expressions as the metric that triggers the alert. For more information on how expressions work, you can refer to the Rancher documentation about Prometheus expressions or the Prometheus documentation about querying metrics.

一些警报使用普罗米修斯表达式作为触发警报的度量(metrics)。有关表达式如何工作的更多信息,您可以参考有关普罗米修斯表达式的Rancher文档或有关查询度量的普罗米修斯文档。

Alerts for etcd

Etcd is the key-value store that contains the state of the Kubernetes cluster. Rancher provides default alerts if the built-in monitoring detects a potential problem with etcd. You don’t have to enable monitoring to receive these alerts.

etcd是包含Kubernetes集群状态的键值存储。如果内置监控检测到etcd的潜在问题,Rancher将提供默认警报。您无需启用监控即可接收这些警报。

A leader is the node that handles all client requests that need cluster consensus. For more information, you can refer to this explanation of how etcd works.

Leader是处理所有需要集群共识(选举)的客户端请求的节点。有关更多信息,您可以参考etcd工作原理的说明。

The leader of the cluster can change in response to certain events. It is normal for the leader to change, but too many changes can indicate a problem with the network or a high CPU load. With longer latencies, the default etcd configuration may cause frequent heartbeat timeouts, which trigger a new leader election.

etcd集群的领导者可以根据某些事件进行更改。leader改变是正常的,但是太多的改变可能表明网络问题或高CPU负载。对于较长的延迟(latencies),默认的etcd配置可能会导致频繁的心跳超时,从而触发新的领导者选举。

ALERT 告警 EXPLANATION 说明
A high number of leader changes within the etcd cluster are happening
etcd集群中leader变更的次数过高
A warning alert is triggered when the leader changes more than three times in one hour.
如果1小时内发生多余3次的etcd集群选举,则出发一个warning
Database usage close to the quota 500M
etcd数据库用量接近500M上限
A warning alert is triggered when the size of etcd exceeds 500M.
当etcd的大小超过500M,则出发一个warning告警
Etcd is unavailable
etcd不可用
A critical alert is triggered when etcd becomes unavailable.
当etcd不可用时,则出发一个critical告警
Etcd member has no leader
etcd成员中没有leader
A critical alert is triggered when the etcd cluster does not have a leader for at least three minutes.
当etcd集群有至少3分钟是没有leader,则出发一个critical告警

Alerts for Kubernetes Components k8s组件的告警

Rancher provides alerts when core Kubernetes system components become unhealthy.

Rancher在核心Kubernetes系统组件不健康时提供警报。

Controllers update Kubernetes resources based on changes in etcd. The controller manager monitors the cluster desired state through the Kubernetes API server and makes the necessary changes to the current state to reach the desired state.

控制器(controllers)根据etcd中的变更更新Kubernetes资源。控制器管理器(controller manager)通过Kubernetes API服务器监控集群所需状态,并对当前状态进行必要的更改,以达到所需状态。

The scheduler service is a core component of Kubernetes. It is responsible for scheduling cluster workloads to nodes, based on various configurations, metrics, resource requirements and workload-specific requirements.

调度程序(scheduler)服务是Kubernetes的核心组件。它负责根据各种配置、度量、资源要求和特定于工作负载的要求,将集群工作负载调度到节点。

ALERT EXPLANATION
Controller Manager is unavailable A critical warning is triggered when the cluster’s controller-manager becomes unavailable.
当集群的controller-manager不可用时,将发出一个critical warning
Scheduler is unavailable A critical warning is triggered when the cluster’s scheduler becomes unavailable.
当集群的scheduler变得不可用时,将出发一个critical warning

Alerts for Events

Kubernetes events are objects that provide insight into what is happening inside a cluster, such as what decisions were made by the scheduler or why some pods were evicted from the node. In the Rancher UI, from the project view, you can see events for each workload.

Kubernetes事件是可以洞察集群内部正在发生的事情的对象,例如调度程序做出了哪些决策,或者为什么某些pod从节点被驱逐。在Rancher UI的项目视图中,您可以查看每个工作负载的事件。

ALERT EXPLANATION
Get warning deployment event A warning alert is triggered when a warning event happens on a deployment.
当deployment发生一个warning事件,则出发一个warning告警

Alerts for Nodes

Alerts can be triggered based on node metrics. Each computing resource in a Kubernetes cluster is called a node. Nodes can be either bare-metal servers or virtual machines.

可以根据节点指标触发警报。Kubernetes集群中的每个计算资源都称为节点。节点可以是裸机服务器(bare-meta)或虚拟机。

ALERT EXPLANATION
High CPU load A warning alert is triggered if the node uses more than 100 percent of the node’s available CPU seconds for at least three minutes.

如果节点有持续3分钟以上时间,cpu的使用率达到100%
High node memory utilization A warning alert is triggered if the node uses more than 80 percent of its available memory for at least three minutes.

如果节点的内存使用率达到80%以上的时间达到3分钟及以上,则发出warning告警
Node disk is running full within 24 hours A critical alert is triggered if the disk space on the node is expected to run out in the next 24 hours based on the disk growth over the last 6 hours.

如果磁盘康健在未来24小时(基于过去6小时的增长率计算得出)将用完,则发出critical告警

Project-level Alerts 项目级别的告警

When you enable monitoring for the project, some project-level alerts are provided. For details, refer to the section on project-level alerts.

当您为项目启用监视(monitoring)时,会提供一些项目级警报。有关详细信息,请参阅项目级警报部分。