1. 集群维护

etcd启动参数中--initial-advertise-peer-urls --initial-cluster --initial-cluster-state --initial-cluster-token仅在创建新的member节点中使用,对于已经添加到集群中的节点而言是不生效的,因此在大部分情况,不需要修改现有etcd启动命令中的这几个参数的,同理这个参数也不能代表集群中节点的真实状态。集群的维护操作主要有以下几种:

  • 增加集群节点:比如从3个节点增加到5个节点,提升客户端读取性能
  • 移除集群节点:比如从5个节点减少到3个节点,提升客户端写入性能
  • 节点迁移维护:节点磁盘故障、节点配置升级、系统升级需要停服维护
  • 集群灾难恢复:多数节点不可用,需要用旧数据重新建立集群
  • etcd版本升级:需要参考官方文档,确认各个版本区别,然后逐个升级
  • etcd证书替换:证书到期后更换证书

    1.1. 增加集群节点

    向当前集群添加 etcd-4: https://10.4.7.123:2380etcd-5: https://10.4.7.125:2380,操作步骤如下:
  1. 确认当前集群成员信息,确保当前集群能正常对外提供服务

    1. [root@duduniao etcd]# etc member list --write-out=table
    2. +------------------+---------+--------+-------------------------+-------------------------+------------+
    3. | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
    4. +------------------+---------+--------+-------------------------+-------------------------+------------+
    5. | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false |
    6. | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false |
    7. | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false |
    8. +------------------+---------+--------+-------------------------+-------------------------+------------+
  2. 签发新节点的证书

05-2-1-etcd入门

# 重新签发server证书,因为原来的server证书不包含etcd-4和etcd-5。
# 从v3.2.0开始,每次客户端连接会自动加载server证书和peer证书,可以动态替换旧证书。
[root@duduniao ssl]# cat server.json
{
    "CN": "local-etcd.duduniao.com",
    "hosts": [
        "10.4.7.121",
        "10.4.7.122",
        "10.4.7.123",
        "10.4.7.124",
        "10.4.7.125",
        "127.0.0.1",
        "etcd-1",
        "etcd-2",
        "etcd-3",
        "etcd-4",
        "etcd-5",
        "localhost"
    ],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "CN",
            "L": "Shanghai",
            "ST": "Shanghai"
        }
    ]
}
[root@duduniao ssl]# rm -f server.csr server*.pem
[root@duduniao ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=server server.json | cfssljson -bare server

# 以 etcd-4为例,签发 peer 证书
[root@duduniao ssl]# cat etcd-4.json
{
    "CN": "local-etcd-4.duduniao.com",
    "hosts": [
        "10.4.7.124",
        "etcd-4"
    ],
    "key": {
        "algo": "ecdsa",
        "size": 256
    },
    "names": [
        {
            "C": "CN",
            "L": "Shanghai",
            "ST": "Shanghai"
        }
    ]
}
[root@duduniao ssl]# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=peer etcd-4.json | cfssljson -bare etcd-4
[root@duduniao ssl]# ll etcd-4*.pem etcd-5*.pem
-rw------- 1 root root  227 2021-10-19 23:06:14 etcd-4-key.pem
-rw-r--r-- 1 root root 1147 2021-10-19 23:06:14 etcd-4.pem
-rw------- 1 root root  227 2021-10-19 23:06:26 etcd-5-key.pem
-rw-r--r-- 1 root root 1147 2021-10-19 23:06:26 etcd-5.pem

# 下发etcd证书
[root@duduniao ssl]# ssh 10.4.7.124 "mkdir -pv /data/etcd/{ssl,data}"
mkdir: created directory '/data/etcd'
mkdir: created directory '/data/etcd/ssl'
mkdir: created directory '/data/etcd/data'
[root@duduniao ssl]# scp ca.pem etcd-4.pem etcd-4-key.pem 10.4.7.124:/data/etcd/ssl/
# 同步所有节点的server证书,这里只是为了方便管理,旧节点可以不用替换。
[root@duduniao ssl]# for i in 10.4.7.12{1..5};do echo $i ; scp server.pem server-key.pem $i:/data/etcd/ssl/ ;done
  1. 添加新节点 ``` [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member add etcd-4 —peer-urls https://10.4.7.124:2380 Member 79b3746506cf2fc1 added to cluster 23ce29301256c4ff

ETCD_NAME=”etcd-4” ETCD_INITIAL_CLUSTER=”etcd-3=https://10.4.7.123:2380,etcd-4=https://10.4.7.124:2380,etcd-1=https://10.4.7.121:2380,etcd-2=https://10.4.7.122:2380“ ETCD_INITIAL_ADVERTISE_PEER_URLS=”https://10.4.7.124:2380“ ETCD_INITIAL_CLUSTER_STATE=”existing”

[root@duduniao etcd]# etc member list —write-out=table # 当前状态是未启动 +—————————+—————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+—————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | 79b3746506cf2fc1 | unstarted | | https://10.4.7.124:2380 | | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+—————-+————+————————————-+————————————-+——————+

查看任意一台节点的etcd日志:

Oct 19 15:17:05 ubuntu-1804-121 etcd[23741]: {“level”:”warn”,”ts”:”2021-10-19T15:17:05.569Z”,”caller”:”rafthttp/probing_status.go:68”,”msg”:”prober detected unhealthy status”,”round-tripper-name”:”ROUND_TRIPPER_RAFT_MESSAGE”,”remote-peer-id”:”79b3746506cf2fc1”,”rtt”:”0s”,”error”:”dial tcp 10.4.7.124:2380: connect: connection refused”} Oct 19 15:17:05 ubuntu-1804-121 etcd[23741]: {“level”:”warn”,”ts”:”2021-10-19T15:17:05.569Z”,”caller”:”rafthttp/probing_status.go:68”,”msg”:”prober detected unhealthy status”,”round-tripper-name”:”ROUND_TRIPPER_SNAPSHOT”,”remote-peer-id”:”79b3746506cf2fc1”,”rtt”:”0s”,”error”:”dial tcp 10.4.7.124:2380: connect: connection refused”}


4. 启动新节点(etcd-4)

[root@duduniao etcd]# cat etcd-4.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target Documentation=https://github.com/coreos

[Service] Type=notify WorkingDirectory=/data/etcd Environment=ETCD_NAME=”etcd-4” Environment=ETCD_INITIAL_CLUSTER=”etcd-3=https://10.4.7.123:2380,etcd-4=https://10.4.7.124:2380,etcd-1=https://10.4.7.121:2380,etcd-2=https://10.4.7.122:2380“ Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=”https://10.4.7.124:2380“ Environment=ETCD_INITIAL_CLUSTER_STATE=”existing”

ExecStart=/usr/local/bin/etcd \ —listen-peer-urls https://10.4.7.124:2380 \ —listen-client-urls https://10.4.7.124:2379,https://127.0.0.1:2379 \ —advertise-client-urls https://10.4.7.124:2379 \ —initial-cluster-token etcd-cluster-1 \ —client-cert-auth \ —cert-file ssl/server.pem \ —key-file ssl/server-key.pem \ —trusted-ca-file ssl/ca.pem \ —peer-client-cert-auth \ —peer-trusted-ca-file ssl/ca.pem \ —peer-cert-file ssl/etcd-4.pem \ —peer-key-file ssl/etcd-4-key.pem \ —data-dir data \ —snapshot-count 50000 \ —auto-compaction-retention 1 \ —auto-compaction-mode periodic \ —max-request-bytes 10485760 \ —quota-backend-bytes 8589934592 Restart=always RestartSec=15 LimitNOFILE=65536 OOMScoreAdjust=-999

[Install] WantedBy=multi-user.target

[root@duduniao etcd]# scp etcd-4.service 10.4.7.124:/lib/systemd/system/etcd.service [root@duduniao etcd]# scp etcd-v3.5.1-linux-amd64/etcd* 10.4.7.124:/usr/local/bin/ [root@duduniao etcd]# ssh 10.4.7.124 “systemctl daemon-reload && systemctl enable etcd && systemctl start etcd”

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.124:2379 endpoint status —write-out table +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ | https://10.4.7.124:2379 | 79b3746506cf2fc1 | 3.5.1 | 20 kB | false | false | 2 | 76 | 76 | | +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.124:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | 79b3746506cf2fc1 | started | etcd-4 | https://10.4.7.124:2380 | https://10.4.7.124:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+


5. 添加etcd-5

重复操作步骤3和步骤4即可.

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.125:2379 endpoint status —write-out table +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ | https://10.4.7.125:2379 | d0756e0778ff59b4 | 3.5.1 | 20 kB | false | false | 2 | 78 | 78 | | +————————————-+—————————+————-+————-+—————-+——————+—————-+——————+——————————+————+ [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.124:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | 79b3746506cf2fc1 | started | etcd-4 | https://10.4.7.124:2380 | https://10.4.7.124:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | | d0756e0778ff59b4 | started | etcd-5 | https://10.4.7.125:2380 | https://10.4.7.125:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+

<a name="o2nHI"></a>
## 1.2. 移除集群节点
当集群中,需要缩减集群规模,需要移除现有的节点。如从5节点缩减至3节点,操作方式如下:

1. 检查集群状态

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | 79b3746506cf2fc1 | started | etcd-4 | https://10.4.7.124:2380 | https://10.4.7.124:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | | d0756e0778ff59b4 | started | etcd-5 | https://10.4.7.125:2380 | https://10.4.7.125:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+


2. 移除节点

移除etcd-5

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member remove d0756e0778ff59b4 Member d0756e0778ff59b4 removed from cluster 23ce29301256c4ff

移除etcd-4

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member remove 79b3746506cf2fc1 Member 79b3746506cf2fc1 removed from cluster 23ce29301256c4ff

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+


3. 停止移除节点上的etcd服务

[root@duduniao etcd]# ssh 10.4.7.124 “systemctl stop etcd && systemctl disable etcd” [root@duduniao etcd]# ssh 10.4.7.125 “systemctl stop etcd && systemctl disable etcd”

<a name="uJvlC"></a>
## 1.3. 节点迁移维护
如果只是当前节点升级配置、服务器重启等操作,直接停服后操作即可,无需特殊处理。针对节点数据磁盘故障、或者当前节点数据迁移到新的节点上的两种场景,有两种解决方案:

- 按照1.2 方式移旧节点,再按照 1.1 方式新增节点,如果新旧节点IP不变,则不需要签发新的证书
- 如果数据较大(大于50MB)且旧节点数据未损坏,可用迁移节点方式

这里针对需要进行数据迁移的场景进行演示,迁移etcd-3节点到 10.4.7.124:

+—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.123:2380 | https://10.4.7.123:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+


1. 模拟etcd-3故障

root@ubuntu-1804-123:~# systemctl stop etcd


2. 迁移 10.4.7.123 上的etcd数据到 10.4.7.124

[root@duduniao etcd]# ssh 10.4.7.123 “cd /data/etcd && tar -zcf etcd-3.tar.gz data” [root@duduniao etcd]# scp 10.4.7.123 /data/etcd/etcd-3.tar.gz ./

[root@duduniao etcd]# scp etcd-3.tar.gz 10.4.7.124:/tmp/ [root@duduniao etcd]# ssh 10.4.7.124 “rm -fr /data/etcd/data ; mkdir -pv /data/etcd/ssl ; tar -xf /tmp/etcd-3.tar.gz -C /data/etcd && rm -f /tmp/etcd-3.tar.gz && ls -l /data/etcd”


3. 生成 10.4.7.124 上的etcd server证书和peer证书

参考 1.1 添加新节点中证书签发步骤,最终结果如下:

[root@duduniao etcd]# ssh 10.4.7.124 “ls -l /data/etcd/ssl” total 20 -rw-r—r— 1 root root 1387 Oct 19 15:31 ca.pem -rw———- 1 root root 227 Oct 19 15:31 etcd-3-key.pem -rw-r—r— 1 root root 1147 Oct 19 15:31 etcd-3.pem -rw———- 1 root root 227 Oct 19 15:37 server-key.pem -rw-r—r— 1 root root 1245 Oct 19 15:37 server.pem


4. 更新集群中的member成员信息

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member update 4fe2b98ed7b794f7 —peer-urls=”https://10.4.7.124:2379“ Member 4fe2b98ed7b794f7 updated in cluster 23ce29301256c4ff

client URL需要启动后etcd进程后才能更新

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.124:2379 | https://10.4.7.123:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+


5. 启动新节点的 etcd 服务

/lib/systemd/system/etcd.service

[Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target Documentation=https://github.com/coreos

[Service] Type=notify WorkingDirectory=/data/etcd Environment=ETCD_NAME=”etcd-3”

ExecStart=/usr/local/bin/etcd \ —listen-peer-urls https://10.4.7.124:2380 \ —listen-client-urls https://10.4.7.124:2379,https://127.0.0.1:2379 \ —advertise-client-urls https://10.4.7.124:2379 \ —initial-cluster-token etcd-cluster-1 \ —client-cert-auth \ —cert-file ssl/server.pem \ —key-file ssl/server-key.pem \ —trusted-ca-file ssl/ca.pem \ —peer-client-cert-auth \ —peer-trusted-ca-file ssl/ca.pem \ —peer-cert-file ssl/etcd-3.pem \ —peer-key-file ssl/etcd-3-key.pem \ —data-dir data \ —snapshot-count 50000 \ —auto-compaction-retention 1 \ —auto-compaction-mode periodic \ —max-request-bytes 10485760 \ —quota-backend-bytes 8589934592 Restart=always RestartSec=15 LimitNOFILE=65536 OOMScoreAdjust=-999

[Install] WantedBy=multi-user.target

```
[root@duduniao etcd]# ssh 10.4.7.124 "systemctl start etcd && systemctl enable etcd"

# 此时集群中的etcd节点 client URL已经发生了变化
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 member list --write-out table
+------------------+---------+--------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+-------------------------+-------------------------+------------+
| 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.124:2379 | https://10.4.7.124:2379 |      false |
| bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 |      false |
| c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 |      false |
+------------------+---------+--------+-------------------------+-------------------------+------------+

[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.124:2379 endpoint status --write-out table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.4.7.124:2379 | 4fe2b98ed7b794f7 |   3.5.1 |   20 kB |     false |      false |         2 |        135 |                135 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

1.4. 集群灾难恢复

etcd集群选择leader时,是少数服从多数,因此不会出现脑裂问题。当集群中的大部分节点不可用时,集群无法对外提供正常的服务,此时需要尽快启动不可用节点,只要节点数据超过半数,集群状态会自动恢复:

  1. 当少数节点不可用时 ``` [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.124:2379 | https://10.4.7.124:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+

停止etcd-3模拟集群中少数节点故障

[root@duduniao etcd]# ssh 10.4.7.124 “systemctl stop etcd”

```
# 节点状态正常
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 member list --write-out table
+------------------+---------+--------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+-------------------------+-------------------------+------------+
| 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.124:2379 | https://10.4.7.124:2379 |      false |
| bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 |      false |
| c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 |      false |
+------------------+---------+--------+-------------------------+-------------------------+------------+
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 endpoint health --write-out table
+-------------------------+--------+----------+-------+
|        ENDPOINT         | HEALTH |   TOOK   | ERROR |
+-------------------------+--------+----------+-------+
| https://10.4.7.121:2379 |   true | 6.6047ms |       |
+-------------------------+--------+----------+-------+

# 集群读写正常
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 put k1 v1
OK
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 get k1
k1
v1
  1. 模拟多数节点宕机 ``` [root@duduniao etcd]# ssh 10.4.7.122 “systemctl stop etcd”

节点状态异常,集群状态异常

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 endpoint health —write-out table {“level”:”warn”,”ts”:1634910316.3861823,”logger”:”client”,”caller”:”v3/retry_interceptor.go:62”,”msg”:”retrying of unary invoker failed”,”target”:”etcd-endpoints://0xc0004308c0/10.4.7.121:2379”,”attempt”:0,”error”:”rpc error: code = DeadlineExceeded desc = context deadline exceeded”} +————————————-+————+—————-+—————————————-+ | ENDPOINT | HEALTH | TOOK | ERROR | +————————————-+————+—————-+—————————————-+ | https://10.4.7.121:2379 | false | 5.001016s | context deadline exceeded | +————————————-+————+—————-+—————————————-+ Error: unhealthy cluster

节点读写报错

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 put k1 v1 {“level”:”warn”,”ts”:”2021-10-22T21:46:29.288+0800”,”logger”:”etcd-client”,”caller”:”v3/retry_interceptor.go:62”,”msg”:”retrying of unary invoker failed”,”target”:”etcd-endpoints://0xc000622540/10.4.7.121:2379”,”attempt”:0,”error”:”rpc error: code = DeadlineExceeded desc = context deadline exceeded”} Error: context deadline exceeded [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 get k1 {“level”:”warn”,”ts”:”2021-10-22T21:46:13.352+0800”,”logger”:”etcd-client”,”caller”:”v3/retry_interceptor.go:62”,”msg”:”retrying of unary invoker failed”,”target”:”etcd-endpoints://0xc0004348c0/10.4.7.121:2379”,”attempt”:0,”error”:”rpc error: code = DeadlineExceeded desc = context deadline exceeded”} Error: context deadline exceeded


3. 模拟多数节点顺利启动

[root@duduniao etcd]# ssh 10.4.7.124 “systemctl start etcd”

集群状态和节点状态正常

[root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 endpoint health —write-out table +————————————-+————+—————+———-+ | ENDPOINT | HEALTH | TOOK | ERROR | +————————————-+————+—————+———-+ | https://10.4.7.121:2379 | true | 6.8057ms | | +————————————-+————+—————+———-+ [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 member list —write-out table +—————————+————-+————+————————————-+————————————-+——————+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +—————————+————-+————+————————————-+————————————-+——————+ | 4fe2b98ed7b794f7 | started | etcd-3 | https://10.4.7.124:2379 | https://10.4.7.124:2379 | false | | bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 | false | | c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 | false | +—————————+————-+————+————————————-+————————————-+——————+ [root@duduniao etcd]# etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 put k1 v1 OK

考虑到一种极端情况,当大部分节点无法启动etcd,需要从快照或者历史备份中恢复数据,并组建一个新的集群:

1. 停止大部分节点,使得集群无法对外提供服务

[root@duduniao etcd]# ssh 10.4.7.122 “systemctl stop etcd” [root@duduniao etcd]# ssh 10.4.7.124 “systemctl stop etcd”


2. 从可用节点生成V3的快照,如果全部节点不可用则从etcd的数据目录下找到`member/snap/db`进行恢复,或者从历史备份恢复

v2版本的恢复方式和v3不太一样,可参考[官方文档](https://etcd.io/docs/v2.3/admin_guide#disaster-recovery)

[root@duduniao etcd]# ETCDCTL_API=3 etcdctl —cacert ssl/ca.pem —cert ssl/client.pem —key ssl/client-key.pem —endpoints https://10.4.7.121:2379 snapshot save snapshot.db [root@duduniao etcd]# ll -h snapshot.db -rw———- 1 root root 21K 2021-10-22 22:08:43 snapshot.d


3. 停止节点上etcd进程,清理历史数据目录

[root@duduniao etcd]# ssh 10.4.7.121 “systemctl stop etcd ; mv /data/etcd/data /data/etcd/data.20211022.bak” [root@duduniao etcd]# ssh 10.4.7.122 “systemctl stop etcd ; mv /data/etcd/data /data/etcd/data.20211022.bak” [root@duduniao etcd]# ssh 10.4.7.124 “systemctl stop etcd ; mv /data/etcd/data /data/etcd/data.20211022.bak”

[root@duduniao etcd]# scp snapshot.db 10.4.7.121:/tmp/ [root@duduniao etcd]# scp snapshot.db 10.4.7.122:/tmp/ [root@duduniao etcd]# scp snapshot.db 10.4.7.124:/tmp/


4. 重建集群

root@ubuntu-1804-121:~# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db —data-dir /data/etcd/data —initial-advertise-peer-urls https://10.4.7.121:2380 —initial-cluster etcd-1=https://10.4.7.121:2380,etcd-2=https://10.4.7.122:2380,etcd-3=https://10.4.7.124:2380 —initial-cluster-token etcd-cluster-1 —name etcd-1 root@ubuntu-1804-122:~# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db —data-dir /data/etcd/data —initial-advertise-peer-urls https://10.4.7.122:2380 —initial-cluster etcd-1=https://10.4.7.121:2380,etcd-2=https://10.4.7.122:2380,etcd-3=https://10.4.7.124:2380 —initial-cluster-token etcd-cluster-1 —name etcd-2 root@ubuntu-1804-124:~# ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db —data-dir /data/etcd/data —initial-advertise-peer-urls https://10.4.7.124:2380 —initial-cluster etcd-1=https://10.4.7.121:2380,etcd-2=https://10.4.7.122:2380,etcd-3=https://10.4.7.124:2380 —initial-cluster-token etcd-cluster-1 —name etcd-3

root@ubuntu-1804-121:~# systemctl start etcd root@ubuntu-1804-122:~# systemctl start etcd root@ubuntu-1804-124:~# systemctl start etcd

```
# 验证集群
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 member list --write-out table
+------------------+---------+--------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+-------------------------+-------------------------+------------+
| 8bb2a873a59fd89b | started | etcd-3 | https://10.4.7.124:2380 | https://10.4.7.124:2379 |      false |
| bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 |      false |
| c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 |      false |
+------------------+---------+--------+-------------------------+-------------------------+------------+
[root@duduniao etcd]# etcdctl --cacert ssl/ca.pem --cert ssl/client.pem --key ssl/client-key.pem --endpoints https://10.4.7.121:2379 get k1
k1
v1

1.5. etcd版本升级

后续推出新版本升级了再补充

1.6. etcd证书替换

etcd的证书替换分为四种类型:

  1. server 证书更换:从v3.2.0开始,每个请求会重载证书,因此证书替换会变得非常方便
  2. peer 证书更换:从v3.2.0开始,每个请求会重载证书,因此证书替换会变得非常方便
  3. ca 证书更换: ca证书的替换会变得比较麻烦,需要停服维护,做好前期准备工作,停服时间在1分钟以内

上述的第一和第二种情况很容易处理,签发证书直接下发即可,老版本的etcd可用逐个重启服务。以下针对第三种情况进行操作:

  1. 生成新的证书

05-2-1-etcd入门

  1. 下发证书,并重启etcd ``` [root@duduniao etcd]# scan_host.sh cmd -h 10.4.7.121 10.4.7.122 10.4.7.124 “cp -r /data/etcd/ssl /data/etcd/ssl-20211021.bak” [root@duduniao ssl-new]# scan_host.sh cmd -h 10.4.7.121 10.4.7.122 10.4.7.124 “mkdir /data/etcd/ssl-new” [root@duduniao ssl-new]# scp ca.pem server.pem server-key.pem etcd-1.pem etcd-1-key.pem 10.4.7.121:/data/etcd/ssl-new/ [root@duduniao ssl-new]# scp ca.pem server.pem server-key.pem etcd-2.pem etcd-2-key.pem 10.4.7.122:/data/etcd/ssl-new/ [root@duduniao ssl-new]# scp ca.pem server.pem server-key.pem etcd-3.pem etcd-3-key.pem 10.4.7.124:/data/etcd/ssl-new/

[root@duduniao etcd]# scan_host.sh cmd -h 10.4.7.121 10.4.7.122 10.4.7.124 “systemctl stop etcd “ [root@duduniao etcd]# scan_host.sh cmd -h 10.4.7.121 10.4.7.122 10.4.7.124 “rm -fr /data/etcd/ssl ; mv /data/etcd/ssl-new /data/etcd/ssl” [root@duduniao etcd]# scan_host.sh cmd -h 10.4.7.121 10.4.7.122 10.4.7.124 “systemctl start etcd “

```
[root@duduniao etcd]# etcdctl --cacert ssl-new/ca.pem --cert ssl-new/client.pem --key ssl-new/client-key.pem --endpoints https://10.4.7.121:2379 member list --write-out table
+------------------+---------+--------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |  NAME  |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+-------------------------+-------------------------+------------+
| 8bb2a873a59fd89b | started | etcd-3 | https://10.4.7.124:2380 | https://10.4.7.124:2379 |      false |
| bbd6739258f69625 | started | etcd-1 | https://10.4.7.121:2380 | https://10.4.7.121:2379 |      false |
| c5542f3740ec56cd | started | etcd-2 | https://10.4.7.122:2380 | https://10.4.7.122:2379 |      false |
+------------------+---------+--------+-------------------------+-------------------------+------------+
[root@duduniao etcd]# etcdctl --cacert ssl-new/ca.pem --cert ssl-new/client.pem --key ssl-new/client-key.pem --endpoints https://10.4.7.121:2379 endpoint status --write-out table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.4.7.121:2379 | bbd6739258f69625 |   3.5.1 |   20 kB |     false |      false |         7 |         37 |                 37 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@duduniao etcd]# etcdctl --cacert ssl-new/ca.pem --cert ssl-new/client.pem --key ssl-new/client-key.pem --endpoints https://10.4.7.122:2379 endpoint status --write-out table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.4.7.122:2379 | c5542f3740ec56cd |   3.5.1 |   20 kB |     false |      false |         7 |         37 |                 37 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@duduniao etcd]# etcdctl --cacert ssl-new/ca.pem --cert ssl-new/client.pem --key ssl-new/client-key.pem --endpoints https://10.4.7.124:2379 endpoint status --write-out table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.4.7.124:2379 | 8bb2a873a59fd89b |   3.5.1 |   20 kB |      true |      false |         7 |         37 |                 37 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

2. 备份和恢复

大部分etcd集群的数据都是很小的,可以考虑每个小时备份一次,只需要从一个节点进行备份即可,不需要每个节点都进行备份。数据恢复是指通过现有的快照数据,重新建立集群,备份和恢复的方式可以参考 1.4. 集群灾难恢复。


3. 监控和告警

3.1. 监控配置

  1. 配置prometheus

    # etcd指标采集的任务配置,需要配置客户端访问的证书
    - job_name: 'etcd'
     scrape_interval: 15s
     scheme: https
     tls_config:
       ca_file: /etc/prometheus/ssl/ca.pem
       cert_file: /etc/prometheus/ssl/client.pem
       key_file:  /etc/prometheus/ssl/client-key.pem
     static_configs:
     - targets: ['10.4.7.121:2379','10.4.7.122:2379','10.4.7.124:2379']
    

    image.png

  2. 配置grafana模板

官方提供了 etcd 监控的dashborad模板,但是该模板存在问题,导入后无法展示,需要进行调整。可用使用以下附件导入grafana。
grafana.json

  1. 压测 etcd 集群,查看监控曲线

    [root@duduniao ssl-new]# go get go.etcd.io/etcd/v3/tools/benchmark
    [root@duduniao ssl-new]# benchmark put --cacert ca.pem --cert client.pem --key client-key.pem --clients 5 --conns 100 --endpoints $END_POINTS put --val-size 256 --total 100000
    

    image.png

    3.2. 配置告警

  2. 配置prometheus规则

    # prometheus.yml 
    ......
    alerting:
    alertmanagers:
     - timeout: 10s
       static_configs:
         - targets: ["alertmanager:9093"]
    rule_files:
    - /etc/prometheus/rules.d/etcd_alerting.yaml
    

    官方提供了告警规则,用户可用根据需求自行调整: ```yaml

    etcd_alerting.yaml

    groups:

  • name: etcd rules:
    • alert: etcdMembersDown annotations: description: ‘etcd cluster “{{ $labels.job }}”: members are down ({{ $value }}).’ summary: etcd cluster members are down. expr: | max without (endpoint) (
      sum without (instance) (up{job=~".*etcd.*"} == bool 0)
      
      or
      count without (To) (
        sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01
      )
      
      )

      0 for: 10m labels: severity: critical

    • alert: etcdInsufficientMembers annotations: description: ‘etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}).’ summary: etcd cluster has insufficient number of members. expr: | sum(up{job=~”.etcd.“} == bool 1) without (instance) < ((count(up{job=~”.etcd.“}) without (instance) + 1) / 2) for: 3m labels: severity: critical
    • alert: etcdNoLeader annotations: description: ‘etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader.’ summary: etcd cluster has no leader. expr: | etcd_server_has_leader{job=~”.etcd.“} == 0 for: 1m labels: severity: critical
    • alert: etcdHighNumberOfLeaderChanges annotations: description: ‘etcd cluster “{{ $labels.job }}”: {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.’ summary: etcd cluster has high number of leader changes. expr: | increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~”.etcd.“}) or 0absent(etcd_server_leader_changes_seen_total{job=~”.etcd.*”}))[15m:1m]) >= 4 for: 5m labels: severity: warning
    • alert: etcdHighNumberOfFailedGRPCRequests annotations: description: ‘etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.’ summary: etcd cluster has high number of failed grpc requests. expr: | 100 sum(rate(grpc_server_handled_total{job=~”.etcd.*”, grpc_code!=”OK”}[5m])) without (grpc_type, grpc_code)
      /
      
      sum(rate(grpc_server_handled_total{job=~”.etcd.“}[5m])) without (grpc_type, grpc_code)
      > 1
      
      for: 10m labels: severity: warning
    • alert: etcdHighNumberOfFailedGRPCRequests annotations: description: ‘etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.’ summary: etcd cluster has high number of failed grpc requests. expr: | 100 sum(rate(grpc_server_handled_total{job=~”.etcd.*”, grpc_code!=”OK”}[5m])) without (grpc_type, grpc_code)
      /
      
      sum(rate(grpc_server_handled_total{job=~”.etcd.“}[5m])) without (grpc_type, grpc_code)
      > 5
      
      for: 5m labels: severity: critical
    • alert: etcdGRPCRequestsSlow annotations: description: ‘etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.’ summary: etcd grpc requests are slow expr: | histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~”.etcd.“, grpc_type=”unary”}[5m])) without(grpc_type))

      0.15 for: 10m labels: severity: critical

    • alert: etcdMemberCommunicationSlow annotations: description: ‘etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.’ summary: etcd cluster member communication is slow. expr: | histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~”.etcd.“}[5m]))

      0.15 for: 10m labels: severity: warning

    • alert: etcdHighNumberOfFailedProposals annotations: description: ‘etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last 30 minutes on etcd instance {{ $labels.instance }}.’ summary: etcd cluster has high number of proposal failures. expr: | rate(etcd_server_proposals_failed_total{job=~”.etcd.“}[15m]) > 5 for: 15m labels: severity: warning
    • alert: etcdHighFsyncDurations annotations: description: ‘etcd cluster “{{ $labels.job }}”: 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.’ summary: etcd cluster 99th percentile fsync durations are too high. expr: | histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~”.etcd.“}[5m]))

      0.5 for: 10m labels: severity: warning

    • alert: etcdHighFsyncDurations annotations: message: ‘etcd cluster “{{ $labels.job }}”: 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.’ expr: | histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~”.etcd.“}[5m]))

      1 for: 10m labels: severity: critical

    • alert: etcdHighCommitDurations annotations: description: ‘etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.’ summary: etcd cluster 99th percentile commit durations are too high. expr: | histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~”.etcd.“}[5m]))

      0.25 for: 10m labels: severity: warning

    • alert: etcdBackendQuotaLowSpace annotations: message: ‘etcd cluster “{{ $labels.job }}”: database size exceeds the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.’ expr: | (etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100 > 95 for: 10m labels: severity: critical
    • alert: etcdExcessiveDatabaseGrowth annotations: message: ‘etcd cluster “{{ $labels.job }}”: Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.’ expr: | increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50 for: 10m labels: severity: warning ```
  1. 配置alertmanager,参考以下配置

02-4-告警管理
image.png