6、监控 - 5.14、监控指标 - 《Kubernetes》

Cluster
Node
Workspace
Namespace
Workload
Pod
Container
Component

Cluster

指标名	说明	单位
cluster_cpu_utilisation	集群 CPU 使用率
cluster_cpu_usage	集群 CPU 用量	Core
cluster_cpu_total	集群 CPU 总量	Core
cluster_load1	集群 1 分钟 CPU 平均负载1
cluster_load5	集群 5 分钟 CPU 平均负载
cluster_load15	集群 15 分钟 CPU 平均负载
cluster_memory_utilisation	集群内存使用率
cluster_memory_available	集群可用内存	Byte
cluster_memory_total	集群内存总量	Byte
cluster_memory_usage_wo_cache	集群内存使用量2	Byte
cluster_net_utilisation	集群网络数据传输速率	Byte/s
cluster_net_bytes_transmitted	集群网络数据发送速率	Byte/s
cluster_net_bytes_received	集群网络数据接受速率	Byte/s
cluster_disk_read_iops	集群磁盘每秒读次数	次/s
cluster_disk_write_iops	集群磁盘每秒写次数	次/s
cluster_disk_read_throughput	集群磁盘每秒读取数据量	Byte/s
cluster_disk_write_throughput	集群磁盘每秒写入数据量	Byte/s
cluster_disk_size_usage	集群磁盘使用量	Byte
cluster_disk_size_utilisation	集群磁盘使用率
cluster_disk_size_capacity	集群磁盘总容量	Byte
cluster_disk_size_available	集群磁盘可用大小	Byte
cluster_disk_inode_total	集群 inode 总数
cluster_disk_inode_usage	集群 inode 已使用数
cluster_disk_inode_utilisation	集群 inode 使用率
cluster_node_online	集群节点在线数
cluster_node_offline	集群节点下线数
cluster_node_offline_ratio	集群节点下线比例
cluster_node_total	集群节点总数
cluster_pod_count	集群中调度完成3 Pod 数量
cluster_pod_quota	集群各节点 Pod 最大容纳量4总和
cluster_pod_utilisation	集群 Pod 最大容纳量使用率
cluster_pod_running_count	集群中处于 Running 阶段5的 Pod 数量
cluster_pod_succeeded_count	集群中处于 Succeeded 阶段的 Pod 数量
cluster_pod_abnormal_count	集群中异常 Pod 6数量
cluster_pod_abnormal_ratio	集群中异常 Pod 比例 7
cluster_ingresses_extensions_count	集群 Ingress 数
cluster_cronjob_count	集群 CronJob 数
cluster_pvc_count	集群 PersistentVolumeClaim 数
cluster_daemonset_count	集群 DaemonSet 数
cluster_deployment_count	集群 Deployment 数
cluster_endpoint_count	集群 Endpoint 数
cluster_hpa_count	集群 Horizontal Pod Autoscaler 数
cluster_job_count	集群 Job 数
cluster_statefulset_count	集群 StatefulSet 数
cluster_replicaset_count	集群 ReplicaSet 数
cluster_service_count	集群 Service 数
cluster_secret_count	集群 Secret 数
cluster_namespace_count	集群 Namespace 数

【说明】
1 指单位时间内，单位 CPU 运行队列中处于可运行或不可中断状态的平均进程数。如果数值大于 1，表示 CPU 不足以服务进程，有进程在等待。
2 不包含 buffer、 cache。
3 Pod 已经被调度到节点上，即 status.conditions.PodScheduled = true 。参考：Pod Lifecycle
4 节点 Pod 最大容纳量一般默认 110 个 Pod。参考：kubelet Options
5 Running 阶段表示该 Pod 已经绑定到了一个节点上，Pod 中所有的容器都已被创建。至少有一个容器正在运行，或者正处于启动或重启状态。参考：Pod Lifecycle
6 异常 Pod：如果一个 Pod 的 status.conditions.ContainersReady 字段值为 false，说明该 Pod 不可用。我们在判定 Pod 是否异常时，还需要考虑到 Pod 可能正处于 ContainerCreating 状态或者 Succeeded 已完成阶段。综合以上情况，异常 Pod 总数的算法可表示为： Abnormal Pods = Total Pods - ContainersReady Pods - ContainerCreating Pods - Succeeded Pods 。
7 异常 Pod 比例：异常 Pod 数 / 非 Succeeded Pod 数。

Node

指标名	说明	单位
node_cpu_utilisation	节点 CPU 使用率
node_cpu_total	节点 CPU 总量	Core
node_cpu_usage	节点 CPU 用量	Core
node_load1	节点 1 分钟 CPU 平均负载
node_load5	节点 5 分钟 CPU 平均负载
node_load15	节点 15 分钟 CPU 平均负载
node_memory_utilisation	节点内存使用率
node_memory_usage_wo_cache	节点内存使用量1	Byte
node_memory_available	节点可用内存	Byte
node_memory_total	节点内存总量	Byte
node_net_utilisation	节点网络数据传输速率	Byte/s
node_net_bytes_transmitted	节点网络数据发送速率	Byte/s
node_net_bytes_received	节点网络数据接受速率	Byte/s
node_disk_read_iops	节点磁盘每秒读次数	次/s
node_disk_write_iops	节点磁盘每秒写次数	次/s
node_disk_read_throughput	节点磁盘每秒读取数据量	Byte/s
node_disk_write_throughput	节点磁盘每秒写入数据量	Byte/s
node_disk_size_capacity	节点磁盘总容量	Byte
node_disk_size_available	节点磁盘可用大小	Byte
node_disk_size_usage	节点磁盘使用量	Byte
node_disk_size_utilisation	节点磁盘使用率
node_disk_inode_total	节点 inode 总数
node_disk_inode_usage	节点 inode 已使用数
node_disk_inode_utilisation	节点 inode 使用率
node_pod_count	节点调度完成 Pod 数量
node_pod_quota	节点 Pod 最大容纳量
node_pod_utilisation	节点 Pod 最大容纳量使用率
node_pod_running_count	节点中处于 Running 阶段的 Pod 数量
node_pod_succeeded_count	节点中处于 Succeeded 阶段的 Pod 数量
node_pod_abnormal_count	节点异常 Pod 数量
node_pod_abnormal_ratio	节点异常 Pod 比例

【说明】
1 不包含 buffer、 cache。

Workspace

指标名	说明	单位
workspace_cpu_usage	企业空间 CPU 用量	Core
workspace_memory_usage	企业空间内存使用量（包含缓存）	Byte
workspace_memory_usage_wo_cache	企业空间内存使用量	Byte
workspace_net_bytes_transmitted	企业空间网络数据发送速率	Byte/s
workspace_net_bytes_received	企业空间网络数据接受速率	Byte/s
workspace_pod_count	企业空间内非终止阶段 Pod 数量1
workspace_pod_running_count	企业空间内处于 Running 阶段的 Pod 数量
workspace_pod_succeeded_count	企业空间内处于 Succeeded 阶段的 Pod 数量
workspace_pod_abnormal_count	企业空间异常 Pod 数量
workspace_pod_abnormal_ratio	企业空间异常 Pod 比例
workspace_ingresses_extensions_count	企业空间 Ingress 数
workspace_cronjob_count	企业空间 CronJob 数
workspace_pvc_count	企业空间 PersistentVolumeClaim 数
workspace_daemonset_count	企业空间 DaemonSet 数
workspace_deployment_count	企业空间 Deployment 数
workspace_endpoint_count	企业空间 Endpoint 数
workspace_hpa_count	企业空间 Horizontal Pod Autoscaler 数
workspace_job_count	企业空间 Job 数
workspace_statefulset_count	企业空间 StatefulSet 数
workspace_replicaset_count	企业空间 ReplicaSet 数
workspace_service_count	企业空间 Service 数
workspace_secret_count	企业空间 Secret 数
workspace_all_project_count	企业空间下项目总数

【说明】
1 非终止阶段的 Pod 指处于 Pending、Running、Unkown 阶段的 Pod，不包含被成功终止，或者因非 0 状态退出被系统终止的 Pod。参考：Pod Lifecycle
若 Workspace Monitoring API 设置了查询参数 type 为 statistics，则返回企业空间统计信息：

指标名	说明	单位
workspace_all_organization_count	集群企业空间总数
workspace_all_account_count	集群账号总数
workspace_all_project_count	集群项目总数
workspace_all_devops_project_count1	集群 DevOps 工程总数
workspace_namespace_count	企业空间项目总数
workspace_devops_project_count	企业空间 DevOps 工程总数
workspace_member_count	企业空间成员数
workspace_role_count2	企业空间角色数

【说明】
1 前四个指标适用于 /kapis/devops.kubesphere.io/v1alpha2/workspaces
2 后四个指标适用于 /kapis/devops.kubesphere.io/v1alpha2/workspaces/{workspace}

Namespace

指标名	说明	单位
namespace_cpu_usage	项目 CPU 用量	Core
namespace_memory_usage	项目内存使用量（包含缓存）	Byte
namespace_memory_usage_wo_cache	项目内存使用量	Byte
namespace_net_bytes_transmitted	项目网络数据发送速率	Byte/s
namespace_net_bytes_received	项目网络数据接受速率	Byte/s
namespace_pod_count	项目内非终止阶段 Pod 数量
namespace_pod_running_count	项目内处于 Running 阶段的 Pod 数量
namespace_pod_succeeded_count	项目内处于 Succeeded 阶段的 Pod 数量
namespace_pod_abnormal_count	项目异常 Pod 数量
namespace_pod_abnormal_ratio	项目异常 Pod 比例
namespace_cronjob_count	项目 CronJob 数
namespace_pvc_count	项目 PersistentVolumeClaim 数
namespace_daemonset_count	项目 DaemonSet 数
namespace_deployment_count	项目 Deployment 数
namespace_endpoint_count	项目 Endpoint 数
namespace_hpa_count	项目 Horizontal Pod Autoscaler 数
namespace_job_count	项目 Job 数
namespace_statefulset_count	项目 StatefulSet 数
namespace_replicaset_count	项目 ReplicaSet 数
namespace_service_count	项目 Service 数
namespace_secret_count	项目 Secret 数
namespace_ingresses_extensions_count	项目 Ingress 数

Workload

指标名	说明	单位
workload_pod_cpu_usage	工作负载1 CPU 用量	Core
workload_pod_memory_usage	工作负载内存使用量（包含缓存）	Byte
workload_pod_memory_usage_wo_cache	工作负载内存使用量	Byte
workload_pod_net_bytes_transmitted	工作负载网络数据发送速率	Byte/s
workload_pod_net_bytes_received	工作负载网络数据接受速率	Byte/s
workload_deployment_replica	Deployment 期望副本数
workload_deployment_replica_available	Deployment 可用副本数2
workload_deployment_unavailable_replicas_ratio	Deployment 不可用副本数比例3
workload_statefulset_replica	StatefulSet 期望副本数
workload_statefulset_replica_available	StatefulSet 可用副本数
workload_statefulset_unavailable_replicas_ratio	StatefulSet 不可用副本数比例
workload_daemonset_replica	DaemonSet 期望副本数
workload_daemonset_replica_available	DaemonSet 可用副本数
workload_daemonset_unavailable_replicas_ratio	DaemonSet 不可用副本数比例

【说明】
1 目前支持的工作负载类型包括：Deployment，StatefulSet 和 DaemonSet。
2 可用副本指工作负载创建出的 Pod 处于可用状态，即该 Pod 的 status.conditions.ContainersReady 字段值为 true。
3 不可用副本数比例：不可用副本数 / 期望副本数。

Pod

指标名	说明	单位
pod_cpu_usage	容器组 CPU 用量	Core
pod_memory_usage	容器组内存使用量（包含缓存）	Byte
pod_memory_usage_wo_cache	容器组内存使用量	Byte
pod_net_bytes_transmitted	容器组网络数据发送速率	Byte/s
pod_net_bytes_received	容器组网络数据接受速率	Byte/s

Container

指标名	说明	单位
container_cpu_usage	容器 CPU 用量	Core
container_memory_usage	容器内存使用量（包含缓存）	Byte
container_memory_usage_wo_cache	容器内存使用量	Byte

Component

指标名	说明	单位
etcd_server_list	etcd 集群节点列表1
etcd_server_total	etcd 集群节点总数
etcd_server_up_total	etcd 集群在线节点数
etcd_server_has_leader	etcd 集群各节点是否有 leader2
etcd_server_leader_changes	etcd 集群各节点观察到 leader 变化数（ 1h 内）
etcd_server_proposals_failed_rate	etcd 集群各节点提案失败3频率平均数	次/s
etcd_server_proposals_applied_rate	etcd 集群各节点提案应用频率平均数	次/s
etcd_server_proposals_committed_rate	etcd 集群各节提案提交频率平均数	次/s
etcd_server_proposals_pending_count	etcd 集群各节点排队提案数平均值
etcd_mvcc_db_size	etcd 集群各节点数据库大小平均值	Byte
etcd_network_client_grpc_received_bytes	etcd 集群向 gRPC 客户端发送数据速率	Byte/s
etcd_network_client_grpc_sent_bytes	etcd 集群接受 gRPC 客户端数据速率	Byte/s
etcd_grpc_call_rate	etcd 集群 gRPC 请求速率	次/s
etcd_grpc_call_failed_rate	etcd 集群 gRPC 请求失败速率	次/s
etcd_grpc_server_msg_received_rate	etcd 集群 gRPC 流式消息接收速率	次/s
etcd_grpc_server_msg_sent_rate	etcd 集群 gRPC 流式消息发送速率	次/s
etcd_disk_wal_fsync_duration	etcd 集群各节点 WAL 日志同步时间平均值	秒
etcd_disk_wal_fsync_duration_quantile	etcd 集群 WAL 日志同步时间平均值（按分位数统计）4	秒
etcd_disk_backend_commit_duration	etcd 集群各节点库同步时间5平均值	秒
etcd_disk_backend_commit_duration_quantile	etcd 集群各节点库同步时间平均值（按分位数统计）	秒
apiserver_up_sum	APIServer 6在线实例数
apiserver_request_rate	APIServer 每秒接受请求数
apiserver_request_by_verb_rate	APIServer 每秒接受请求数（按 HTTP 请求方法分类统计）
apiserver_request_latencies	APIServer 请求平均迟延	秒
apiserver_request_by_verb_latencies	APIServer 请求平均迟延（按 HTTP 请求方法分类统计）	秒
scheduler_up_sum	调度器7在线实例数
scheduler_schedule_attempts	调度器累计调度次数 8
scheduler_schedule_attempt_rate	调度器调度频率	次/s
scheduler_e2e_scheduling_latency	调度器调度延迟	秒
scheduler_e2e_scheduling_latency_quantile	调度器调度延迟（按分位数统计）	秒
controller_manager_up_sum	Controller Manager9 在线实例数
coredns_up_sum	CoreDNS 在线实例数
coredns_cache_hits	CoreDNS 缓存命中频率	次/s
coredns_cache_misses	CoreDNS 缓存未命中频率	次/s
coredns_dns_request_rate	CoreDNS 每秒请求数
coredns_dns_request_duration	CoreDNS 请求耗时	秒
coredns_dns_request_duration_quantile	CoreDNS 请求耗时（按分位数统计）	秒
coredns_dns_request_by_type_rate	CoreDNS 每秒请求数（按请求类型分类统计）
coredns_dns_request_by_rcode_rate	CoreDNS 每秒请求数（按 rcode 分类统计）
coredns_panic_rate	CoreDNS 异常发生频率	次/s
coredns_proxy_request_rate	CoreDNS 代理每秒请求数
coredns_proxy_request_duration	CoreDNS 代理请求耗时	秒
coredns_proxy_request_duration_quantile	CoreDNS 代理请求耗时（按分位数统计）	秒
prometheus_up_sum	Prometheus 在线实例数量
prometheus_tsdb_head_samples_appended_rate	Prometheus 每秒存储监控指标数

【说明】
1 如果某一节点返回值为 1 说明该 etcd 节点在线，0 说明节点下线。
2 如果某一节点返回值为 0 说明该节点没有leader ，即该节点不可使用；如果集群中，所有节点都没有任何 leader ，则整个集群不可用。
3 中英文对照说明：提案（consensus proposals）,失败提案（failed proposals），已提交提案（commited proposals），应用提案（applied proposals），排队提案（pending proposals）。
4 支持三种分位数统计：99th 百分位数、90th 百分位数、中位数。
5 反映磁盘 I/O 延迟。如果数值过高，通常表示磁盘问题。
6 指 kube-apiserver。
7 指 kube-scheduler。
8 按调度结果分类统计：error（因调度器异常而无法调度的 Pod 数量），scheduled（成功被调度的 Pod 数量），unschedulable（无法被调度的 Pod 数量）。
9 指 kube-controller-manager。