问题背景

环境:自己的虚机测试环境(用到了启动一下的那种)

集群版本为1.18.3

今天启动了下本地部署的k8s集群,想要做一些测试,但发现pod状态有些异常,最后定位到是由于kubelet的客户端证书过期导致的。

其实kubelet证书过期的问题,如果你的集群是一直处于运行状态,并且k8s版本不低于1.8版本,是不会出现的这个问题的,kubelet会在证书即将过期的时候,主动去更新所用的证书。参考:为 kubelet 配置证书轮换
好,废话不多说,来看问题定位及解决办法。如果想要看解决办法,直接跳到 问题解决 即可。

  1. $ kubectl get nodes -o wide # 查看node节点状态如下
  2. NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
  3. centos-20-2 Ready master 392d v1.18.3 192.168.20.2 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8
  4. centos-20-3 NotReady <none> 392d v1.18.3 192.168.20.3 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8
  5. centos-20-4 NotReady <none> 392d v1.18.3 192.168.20.4 <none> CentOS Linux 7 (Core) 5.4.111-1.el7.elrepo.x86_64 docker://19.3.8
  6. # 故查看了下关键pod的状态,发现master节点的calico-node异常
  7. $ kubectl get pod -n kube-system -o wide | egrep "calico|etcd|kube-"
  8. calico-kube-controllers-5b8b769fcd-gbd2z 1/1 Running 5 392d 10.100.78.153 centos-20-2 <none> <none>
  9. calico-node-c7xr9 1/1 Running 4 392d 192.168.20.3 centos-20-3 <none> <none>
  10. calico-node-g2j88 0/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>
  11. calico-node-nvtck 1/1 Running 2 392d 192.168.20.4 centos-20-4 <none> <none>
  12. etcd-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>
  13. kube-apiserver-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>
  14. kube-controller-manager-centos-20-2 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>
  15. kube-proxy-dmdkh 1/1 Running 4 392d 192.168.20.3 centos-20-3 <none> <none>
  16. kube-proxy-qmqq4 1/1 Running 5 392d 192.168.20.2 centos-20-2 <none> <none>
  17. kube-proxy-vqkpw 1/1 Running 2 392d 192.168.20.4 centos-20-4 <none> <none>
  18. kube-scheduler-centos-20-2 1/1 Running 6 392d 192.168.20.2 centos-20-2 <none> <none>
  19. monitor-kube-state-metrics-b7b7ccf8c-dzjl4 2/2 Running 0 151d 10.100.238.207 centos-20-3 <none> <none>
  20. # 通过describe查看pod详情,发现报错如下:
  21. $ kubectl describe pod/calico-node-g2j88 -n kube-system
  22. Normal Created 6m56s kubelet, centos-20-2 Created container calico-node
  23. Warning Unhealthy 6m47s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:24:58.449 [INFO][181] health.go 156: Number of node(s) with BGP peering established = 0
  24. Warning Unhealthy 6m37s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:08.435 [INFO][258] health.go 156: Number of node(s) with BGP peering established = 0
  25. Warning Unhealthy 6m27s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:18.457 [INFO][291] health.go 156: Number of node(s) with BGP peering established = 0
  26. Warning Unhealthy 6m17s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:28.403 [INFO][330] health.go 156: Number of node(s) with BGP peering established = 0
  27. Warning Unhealthy 6m7s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:38.414 [INFO][359] health.go 156: Number of node(s) with BGP peering established = 0
  28. Warning Unhealthy 5m57s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:48.415 [INFO][385] health.go 156: Number of node(s) with BGP peering established = 0
  29. Warning Unhealthy 5m47s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:25:58.472 [INFO][420] health.go 156: Number of node(s) with BGP peering established = 0
  30. Warning Unhealthy 5m37s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:08.433 [INFO][447] health.go 156: Number of node(s) with BGP peering established = 0
  31. Warning Unhealthy 5m27s kubelet, centos-20-2 Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:26:18.409 [INFO][474] health.go 156: Number of node(s) with BGP peering established = 0
  32. Warning Unhealthy 117s (x21 over 5m17s) kubelet, centos-20-2 (combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 192.168.20.3,192.168.20.42022-05-11 09:29:48.412 [INFO][1099] health.go 156: Number of node(s) with BGP peering established = 0
  33. # 根据报错,大概能猜到,是node节点192.168.20.3,192.168.20.4出了异常,故到这两个node节点上查看kubelet服务的状态,
  34. # 发现服务异常,并且重启无效
  35. $ systemctl status kubelet
  36. kubelet.service - kubelet: The Kubernetes Node Agent
  37. Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  38. Drop-In: /usr/lib/systemd/system/kubelet.service.d
  39. └─10-kubeadm.conf
  40. Active: activating (auto-restart) (Result: exit-code) since 2022-05-11 17:35:35 CST; 8s ago
  41. Docs: https://kubernetes.io/docs/
  42. Process: 8341 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
  43. Main PID: 8341 (code=exited, status=255)
  44. 5 11 17:35:35 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
  45. 5 11 17:35:35 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.
  46. 5 11 17:35:35 centos-20-3 systemd[1]: kubelet.service failed.
  47. # 通过下面的命令查看kubelet相关报错(或者/var/log/messages也可以)
  48. $ journalctl -r -u kubelet | less
  49. -- Logs begin at 2022-05-11 17:24:18 CST, end at 2022-05-11 17:40:33 CST. --
  50. 5 11 17:40:33 centos-20-3 systemd[1]: kubelet.service failed.
  51. 5 11 17:40:33 centos-20-3 systemd[1]: Unit kubelet.service entered failed state.
  52. 5 11 17:40:33 centos-20-3 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
  53. 5 11 17:40:33 centos-20-3 kubelet[8752]: F0511 17:40:33.096373 8752 server.go:274] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
  54. 5 11 17:40:33 centos-20-3 kubelet[8752]: E0511 17:40:33.096297 8752 bootstrap.go:265] part of the existing bootstrap client certificate is expired: 2022-04-14 08:15:03 +0000 UTC
  55. 5 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082709 8752 server.go:837] Client rotation is on, will bootstrap in background
  56. 5 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082679 8752 plugins.go:100] No cloud provider specified.
  57. 5 11 17:40:33 centos-20-3 kubelet[8752]: I0511 17:40:33.082351 8752 server.go:417] Version: v1.18.3
  58. # 通过上面的报错可以看到,大概是kubelet的证书到期导致了,有效期至2022-04-14 08:15:03
  59. # 查看kubelet客户端证书有效期
  60. $ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
  61. notBefore=Apr 14 08:15:03 2021 GMT
  62. notAfter=Apr 14 08:15:03 2022 GMT

通过上面的问题定位,已经明确了问题出在哪里,那么解决起来也就比较简单了。

问题解决

# 将系统时间修改为证书过期前一天
$ date -s 2022-04-13     # 所有master节点及kubelet证书过期的节点都要执行,并尽可能一起执行,保证集群内时间一致
$ systemctl restart kubelet     # 重启故障节点的kubelet服务
# 重启后可以观察下/var/log/message的日志,并查看并确认当前节点的kubelet证书已自动轮换
$ openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
notBefore=Apr 12 15:55:35 2022 GMT
notAfter=Apr 12 15:55:35 2023 GMT


# 恢复节点时间(所有节点执行)
$ ntpdate -u ntp.aliyun.com


# 最后到master节点确定所有node节点状态为 Ready即可
$ kubectl get nodes
NAME          STATUS   ROLES    AGE    VERSION
centos-20-2   Ready    master   363d   v1.18.3
centos-20-3   Ready    <none>   363d   v1.18.3
centos-20-4   Ready    <none>   363d   v1.18.3