在部署etcd集群时,建议使用基数个etcd实例,这样至少可以保证集群有(N-1)/2个实例是可以正常提供服务的。但是如果超过了(N-1)/2个实例故障。就需要使用备份的etcd数据对集群进行容灾恢复。

1. 证书制作(可选)

由于v3版本的etcd证书是基于IP的,所以每次新增etcd节点都需要重新制作证书。

2. 备份etcd数据

  1. mkdir /data/backup

备份脚本

  1. #!/bin/bash
  2. DATA=`date +%y%m%d`
  3. BACUPDIR=/data/backup
  4. ENDPOINTS="https://10.4.7.12:2379,https://10.4.7.21:2379,https://10.4.7.22:2379 "
  5. timestamp=`date +%Y%m%d%H%M%S`
  6. if [ ! -d "$BACUPDIR" ]; then
  7. echo "making dir $BACUPDIR"
  8. mkdir -p $BACUPDIR
  9. fi
  10. ETCDCTL_API=3 /opt/etcd/etcdctl \
  11. --endpoints=$ENDPOINTS \
  12. --cacert=/opt/etcd/certs/ca.pem \
  13. --cert=/opt/etcd/certs/etcd-peer.pem \
  14. --key=/opt/etcd/certs/etcd-peer-key.pem snapshot save $BACUPDIR/snapshot_$timestamp.db
  15. # 保留5天数据
  16. find $BACUPDIR -name *.db -mtime +5 -exec rm -rf {} \;

3. 恢复etcd数据(集群不可用,灾难恢复)

3.1. 停止所有master节点的kube-apiserver服务

supervisorctl stop kube-apiserver-7-21

3.2. 停止集群中所有etcd服务

supervisorctl stop etcd-server-7-21

3.3. 移除所有etcd服务实例的数据目录(/data/etcd/etcd-server)

不需要手动创建数据目录,否则恢复时会报错.

mv /data/etcd/etcd-server /data/etcd/etcd-server_bakbackup

3.4. 准备db备份文件

3.5. 执行恢复语句 参数值可在配置文件获取

2020-12-02 10:09:09.374575 I | mvcc: restore compact to 948592 2020-12-02 10:09:09.380472 I | etcdserver/membership: added member 988139385f78284 [https://10.4.7.22:2380] to cluster d18e21fd9780915b 2020-12-02 10:09:09.380498 I | etcdserver/membership: added member 5a0ef2a004fc4349 [https://10.4.7.21:2380] to cluster d18e21fd9780915b 2020-12-02 10:09:09.380508 I | etcdserver/membership: added member f4a0cb0a765574a8 [https://10.4.7.12:2380] to cluster d18e21fd9780915b

<a name="IX1bM"></a>
## 
报错
```bash
Error:  --initial-cluster must include etcd-server-7-21=https://10.4.7.12:2380 given --initial-advertise-peer-urls=https://10.4.7.12:2380

—initial-advertise-peer-urls指定错误地址

3.6. 赋予数据目录权限

chown -R etcd:etcd /data/etcd/etcd-server

3.7. 启动各个节点etcd

~]# supervisorctl start etcd-server-7-21

3.8. 启动apiserver

~]# supervisorctl start kube-apiserver-7-21

3.9. 检查etcd集群状态

ETCDCTL_API=3  /opt/etcd/etcdctl \
--endpoints=https://10.4.7.12:2379,https://10.4.7.21:2379,https://10.4.7.22:2379 \
--cacert=/opt/etcd/certs/ca.pem \
--cert=/opt/etcd/certs/etcd-peer.pem \
--key=/opt/etcd/certs/etcd-peer-key.pem endpoint health

2020-12-02 10:28:30.838102 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-12-02 10:28:30.838709 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-12-02 10:28:30.839112 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
https://10.4.7.12:2379 is healthy: successfully committed proposal: took = 5.062842ms
https://10.4.7.21:2379 is healthy: successfully committed proposal: took = 2.342584ms
https://10.4.7.22:2379 is healthy: successfully committed proposal: took = 4.670528ms

4. 问题

4.1. 当etcd恢复后, 页面访问一时可以一时不行

现象详细描述:
因宿主机意外停机造成etcd数据损坏, 在用备份数据恢复etcd后,发现traefik暴露出来的服务有时显示正常,有时不正常.导致业务异常访问.
分析:
服务有时可以,有时不行, 那么跟负载均衡轮询有关系,将负载均衡注释一个traefik节点.
分别注释后测试也一样会出现问题. 那么是否是集群网络不能通信. 测试flannel组件是否正常,发现无法通信.
查看flannel在etcd的配置发现报错

/opt/etcd/etcdctl get /coreos.com/network/config
Error:  100: Key not found (/coreos.com) [10]

etcd节点损坏恢复后,flannel在etcd中的配置信息丢失了,无法生效,导致集群网络无法通信.
重新配置信息

/opt/etcd/etcdctl set /coreos.com/network/config '{"Network": "172.7.0.0/16", "Backend": {"Type": "host-gw"}}'

/opt/etcd/etcdctl get /coreos.com/network/config
{"Network": "172.7.0.0/16", "Backend": {"Type": "host-gw"}}

再次测试网络正常通信 网页访问也正常了