在部署etcd集群时,建议使用基数个etcd实例,这样至少可以保证集群有(N-1)/2个实例是可以正常提供服务的。但是如果超过了(N-1)/2个实例故障。就需要使用备份的etcd数据对集群进行容灾恢复。
1. 证书制作(可选)
由于v3版本的etcd证书是基于IP的,所以每次新增etcd节点都需要重新制作证书。
2. 备份etcd数据
mkdir /data/backup
备份脚本
#!/bin/bashDATA=`date +%y%m%d`BACUPDIR=/data/backupENDPOINTS="https://10.4.7.12:2379,https://10.4.7.21:2379,https://10.4.7.22:2379 "timestamp=`date +%Y%m%d%H%M%S`if [ ! -d "$BACUPDIR" ]; thenecho "making dir $BACUPDIR"mkdir -p $BACUPDIRfiETCDCTL_API=3 /opt/etcd/etcdctl \--endpoints=$ENDPOINTS \--cacert=/opt/etcd/certs/ca.pem \--cert=/opt/etcd/certs/etcd-peer.pem \--key=/opt/etcd/certs/etcd-peer-key.pem snapshot save $BACUPDIR/snapshot_$timestamp.db# 保留5天数据find $BACUPDIR -name *.db -mtime +5 -exec rm -rf {} \;
3. 恢复etcd数据(集群不可用,灾难恢复)
3.1. 停止所有master节点的kube-apiserver服务
supervisorctl stop kube-apiserver-7-21
3.2. 停止集群中所有etcd服务
supervisorctl stop etcd-server-7-21
3.3. 移除所有etcd服务实例的数据目录(/data/etcd/etcd-server)
不需要手动创建数据目录,否则恢复时会报错.
mv /data/etcd/etcd-server /data/etcd/etcd-server_bakbackup
3.4. 准备db备份文件
3.5. 执行恢复语句 参数值可在配置文件获取
- 注意—name的修改 —initial-advertise-peer-urls
```bash
在每个节点运行
ETCDCTL_API=3 /opt/etcd/etcdctl snapshot restore /data/etcd/snapshot_20201202094154.db \ —cacert=/opt/etcd/certs/ca.pem \ —cert=/opt/etcd/certs/etcd-peer.pem \ —key=/opt/etcd/certs/etcd-peer-key.pem \ —name etcd-server-7-12 \ —initial-cluster etcd-server-7-12=https://10.4.7.12:2380,etcd-server-7-21=https://10.4.7.21:2380,etcd-server-7-22=https://10.4.7.22:2380 \ —initial-advertise-peer-urls https://10.4.7.12:2380 \ —data-dir=/data/etcd/etcd-server
2020-12-02 10:09:09.374575 I | mvcc: restore compact to 948592 2020-12-02 10:09:09.380472 I | etcdserver/membership: added member 988139385f78284 [https://10.4.7.22:2380] to cluster d18e21fd9780915b 2020-12-02 10:09:09.380498 I | etcdserver/membership: added member 5a0ef2a004fc4349 [https://10.4.7.21:2380] to cluster d18e21fd9780915b 2020-12-02 10:09:09.380508 I | etcdserver/membership: added member f4a0cb0a765574a8 [https://10.4.7.12:2380] to cluster d18e21fd9780915b
<a name="IX1bM"></a>
##
报错
```bash
Error: --initial-cluster must include etcd-server-7-21=https://10.4.7.12:2380 given --initial-advertise-peer-urls=https://10.4.7.12:2380
—initial-advertise-peer-urls指定错误地址
3.6. 赋予数据目录权限
chown -R etcd:etcd /data/etcd/etcd-server
3.7. 启动各个节点etcd
~]# supervisorctl start etcd-server-7-21
3.8. 启动apiserver
~]# supervisorctl start kube-apiserver-7-21
3.9. 检查etcd集群状态
ETCDCTL_API=3 /opt/etcd/etcdctl \
--endpoints=https://10.4.7.12:2379,https://10.4.7.21:2379,https://10.4.7.22:2379 \
--cacert=/opt/etcd/certs/ca.pem \
--cert=/opt/etcd/certs/etcd-peer.pem \
--key=/opt/etcd/certs/etcd-peer-key.pem endpoint health
2020-12-02 10:28:30.838102 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-12-02 10:28:30.838709 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2020-12-02 10:28:30.839112 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
https://10.4.7.12:2379 is healthy: successfully committed proposal: took = 5.062842ms
https://10.4.7.21:2379 is healthy: successfully committed proposal: took = 2.342584ms
https://10.4.7.22:2379 is healthy: successfully committed proposal: took = 4.670528ms
4. 问题
4.1. 当etcd恢复后, 页面访问一时可以一时不行
现象详细描述:
因宿主机意外停机造成etcd数据损坏, 在用备份数据恢复etcd后,发现traefik暴露出来的服务有时显示正常,有时不正常.导致业务异常访问.
分析:
服务有时可以,有时不行, 那么跟负载均衡轮询有关系,将负载均衡注释一个traefik节点.
分别注释后测试也一样会出现问题. 那么是否是集群网络不能通信. 测试flannel组件是否正常,发现无法通信.
查看flannel在etcd的配置发现报错
/opt/etcd/etcdctl get /coreos.com/network/config
Error: 100: Key not found (/coreos.com) [10]
etcd节点损坏恢复后,flannel在etcd中的配置信息丢失了,无法生效,导致集群网络无法通信.
重新配置信息
/opt/etcd/etcdctl set /coreos.com/network/config '{"Network": "172.7.0.0/16", "Backend": {"Type": "host-gw"}}'
/opt/etcd/etcdctl get /coreos.com/network/config
{"Network": "172.7.0.0/16", "Backend": {"Type": "host-gw"}}
再次测试网络正常通信 网页访问也正常了
