Ceph - 13.Ceph高可用及性能测试 - 《运维机器人》

1、高可用验证
2、性能测试

1、高可用验证

Ceph集群所有节点都可以横向扩展，不存在单点故障，因此建议对所有节点都进行高可用部署。本次使用以下环境验证集群的高可用。

角色	主机名	Cluster Network	Public Network	OSD
admin、client	ceph-admin.yull.cc	10.37.129.14	10.10.5.25
mon、osd、mgr、mds	ceph-mon1.yull.cc	10.37.129.12	10.10.5.27	sdb、sdc
mon、osd、mgr、rgw	ceph-mon2.yull.cc	10.37.129.11	10.10.5.28	sdb、sdc
mon、osd、rgw	ceph-mon3.yull.cc	10.37.129.13	10.10.5.29	sdb、sdc
新增节点mon、osd	ceph-stor3.yull.cc	10.37.129.14	10.10.5.177	sdb、sdc

现有集群状态：

]$ ceph -s
  cluster:
    id:     817bb5ff-1b92-4734-baa8-f21ede4cb9c2   
    health: HEALTH_OK   
  services:
    mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 46m)
    mgr: ceph-mon1(active, since 36m), standbys: ceph-mon2
    osd: 6 osds: 6 up (since 16m), 6 in (since 16m)
  data:
    pools:   1 pools, 16 pgs
    objects: 1 objects, 23 B
    usage:   6.0 GiB used, 378 GiB / 384 GiB avail
    pgs:     16 active+clean

测试功能点：
- 新增存储节点10.10.5.117，观察集群回填状态。
- 模拟存储节点10.10.5.29单OSD故障，观察集群状态
- 剔除存储节点10.10.5.29，观察集群恢复状态。
  1.1、新增存储节点

初始化新增节点环境

]$ */1 * * * * ntpdate cn.pool.ntp.org   # 添加时间同步
]$ cat /etc/hosts    # 配置hosts
10.10.5.177     ceph-mon4.stbchina.cn ceph-mon4 ceph-stor4.stbchina.cn ceph-stor4
]$ systemctl stop firewalld.service  # 停止防火墙
]$ systemctl disable firewalld.service   # 禁用启动
]$ sed -i 's@^\(SELINUX=\).*@\1disabled@' /etc/sysconfig/selinux  # 永久关闭
]$ setenforce 0   # 临时关闭
]$ useradd cephadm && echo "1234qwer" | passwd --stdin cephadm   # 新增cephadm用户
]$ echo "cephadm ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephadm   # 添加sudo权限
]$ chmod 0440 /etc/sudoers.d/cephadm

添加ceph-deploy主机到新增节点的ssh免密，这里不在赘述。

安装ceph基础环境

]$ rpm -ivh https://mirrors.aliyun.com/ceph/rpm-nautilus/el7/noarch/ceph-release-1-1.el7.noarch.rpm
]$ yum install ceph

拷贝ceph配置文件到新增节点
```
]$ ceph-deploy admin ceph-stor4
```

新增节点

]$ ceph-deploy disk list ceph-stor4
[ceph-stor1][INFO  ] Disk /dev/sdb: 68.7 GB, 68719476736 bytes, 134217728 sectors
[ceph-stor1][INFO  ] Disk /dev/sdc: 68.7 GB, 68719476736 bytes, 134217728 sectors
]$ ceph-deploy disk zap ceph-stor4 /dev/sdb
]$ ceph-deploy disk zap ceph-stor4 /dev/sdc
]$ ceph-deploy osd create ceph-stor4 --data /dev/sdb
]$ ceph-deploy osd create ceph-stor4 --data /dev/sdc

查看集群状态

当新的节点加入集群，ceph集群开始将部分现有的数据重新平衡到新加入的OSD上，用下面的命令可用观察平衡过程。

新OSD加入集群后，CRUSH将会为新添加的OSD分配PG，强制OSD接受重新分配的PG并把一定数量的负载转移到新OSD中。
此过程会进入active+remapped+backfilling状态。
在backfilling状态操作期间，会看到多种状态：
- backfill_wait表示 backfill操作挂起, 但 backfill 操作还没有开始 ( PG 正在等待开始回填操作 )
- backfill表示backfill操作正在执行
- backfill_too_full 表示在请求 backfill 操作, 由于存储能力问题, 但不可以完成,
回填操作在新节点加入后会立即进行，这样会影响集群性能，可以通过以下方法进行回填操作的调整。
- osd_max_backfills设定最大数量并发backfills到一个 OSD, 默认10
- osd backfill full ratio 当 osd 达到负载, 允许 OSD 拒绝 backfill 请求, 默认 85%,
- 假如 OSD 拒绝 backfill 请求, osd backfill retry interval 将会生效, 默认 10 秒后重试
- osd backfill scan min , osd backfill scan max 管理检测时间间隔 ```bash ]$ ceph -s cluster: id: 597abcf5-e6ce-4c68-b47b-ec4e33d66694 health: HEALTH_OK
services: mon: 4 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3,ceph-stor4 (age 12h) mgr: ceph-mon1(active, since 4d), standbys: ceph-mon2 osd: 12 osds: 12 up (since 9m), 12 in (since 11h); 28 remapped pgs rgw: 3 daemons active (ceph-mon1, ceph-mon2, ceph-mon3)

task status:

data: pools: 7 pools, 256 pgs objects: 2.15M objects, 794 GiB usage: 1.7 TiB used, 8.8 TiB / 10 TiB avail pgs: 2141189/12872880 objects misplaced (16.633%)
```
    228 active+clean
    27  active+remapped+backfill_wait
    1   active+remapped+backfilling
```
io: recovery: 2.2 MiB/s, 5 objects/s ]$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 9.46489 root default
-3 3.00000 host ceph-mon1
0 hdd 1.00000 osd.0 up 1.00000 1.00000 1 hdd 1.00000 osd.1 up 1.00000 1.00000 2 hdd 1.00000 osd.2 up 1.00000 1.00000 -5 3.00000 host ceph-mon2
3 hdd 1.00000 osd.3 up 1.00000 1.00000 4 hdd 1.00000 osd.4 up 1.00000 1.00000 5 hdd 1.00000 osd.5 up 1.00000 1.00000 -7 2.00000 host ceph-mon3
6 hdd 1.00000 osd.6 up 1.00000 1.00000
7 hdd 1.00000 osd.7 up 1.00000 1.00000 8 hdd 1.00000 osd.8 up 1.00000 1.00000 -9 1.46489 host ceph-stor4
9 hdd 0.48830 osd.9 up 1.00000 1.00000 10 hdd 0.48830 osd.10 up 1.00000 1.00000 11 hdd 0.48830 osd.11 up 1.00000 1.00000 ```

当PG卡住active+remapped+backfill_wait状态
- 通过ceph pg detail查看对应pg的id
- 使用sudo ceph pg PGID query查询对应pg的状态
- 查看上述命令的输出中的backfill_targets字段，这个字段对应的就是卡住的osd
- 重启该OSD
  1.1.1、模拟单OSD故障

将osd移出集群，本次移出osd.6

]$ ceph osd out osd.6
marked out osd.6.

停止osd进程
```
]$ systemctl stop ceph-osd@6
```

查看OSD状态

]$ ceph osd tree
......   
6   hdd 1.00000         osd.6         down  0 0  
......

删除OSD，从CRUSH中移除并删除auth认证

]$ ceph osd rm osd.6
removed osd.6
]$ ceph osd crush rm osd.6   # 从运行图中删除
removed item id 6 name 'osd.6' from crush map
]$ ceph auth del osd.6
updated
]$ ceph osd remove osd.6  # 从运行图中清除

此时集群会有部分PG进入degraded状态，集群会进行相关OSD数据的迁移，待迁移完成，集群重新回到 active+clean状态。

1.1.2、单存储节点宕机并持续写入数据

将存储节点直接关机，观察集群状态

]$ ceph -s
cluster:
  id:     597abcf5-e6ce-4c68-b47b-ec4e33d66694
  health: HEALTH_WARN
          Reduced data availability: 1 pg inactive, 1 pg incomplete
          Degraded data redundancy: 3866506/12872892 objects degraded (30.036%), 96 pgs degraded, 96 pgs undersized
          1/4 mons down, quorum ceph-mon1,ceph-mon2,ceph-stor4

services:
  mon: 4 daemons, quorum ceph-mon1,ceph-mon2,ceph-stor4 (age 7m), out of quorum: ceph-mon3
  mgr: ceph-mon1(active, since 7m), standbys: ceph-mon2
  osd: 9 osds: 9 up (since 6m), 9 in (since 59m); 98 remapped pgs
  rgw: 2 daemons active (ceph-mon1, ceph-mon2)

task status:

data:
  pools:   7 pools, 256 pgs
  objects: 2.15M objects, 794 GiB
  usage:   1.1 TiB used, 6.3 TiB / 7.5 TiB avail
  pgs:     0.391% pgs not active
           3866506/12872892 objects degraded (30.036%)
           2287935/12872892 objects misplaced (17.773%)
           157 active+clean
           94  active+undersized+degraded+remapped+backfill_wait
           2   active+undersized+degraded+remapped+backfilling
           2   active+remapped+backfill_wait
           1   incomplete

io:
  recovery: 4.2 MiB/s, 13 objects/s

1.1.3、双存储节点宕机

1.2、模拟MON故障

1.3、模拟主机宕机

1.4、模拟RGW节点故障

1.5、PG状态故障模拟

2、性能测试

2.1、物理机

block	读写顺序	读写数据	线程数	IOPS	带宽速度	对象IO
4K	随机写	4G	64	1312	5252KiB/s	1.68k op/s wr
4K	随机读	4G	64	54.6k	213MiB/s	7.46k op/s rd
512K	顺序写	4G	64	16	8704KiB/s	27 op/s wr
512K	顺序读	4G	64	20.5k	10.0GiB/s	327 op/s rd
1M	顺序读	4G	16	10.8k	10.6GiB/s	180 op/s rd
4M	顺序写	4G	16	3158	12.3GiB/s	57 op/s rd

2.2、虚拟机

block	读写顺序	读写数据	线程数	IOPS	带宽速度	对象IO
4K	随机写	4G	64	1427	5709KiB/s	2.00k op/s wr
4K	随机读	4G	64	35.4k	138MiB/s	4.74k op/s rd
4K	顺序读	4G	64	115k	448MiB/s	11.3k op/s rd
512K	顺序写	4G	64	199	99.8MiB/s	50 op/s wr
512K	顺序读	4G	64	2344	1172MiB/s	60 op/s rd
2M	顺序读	4G	16	55	111MiB/s	58 op/s wr
4M	顺序写	4G	16	29	109MiB/s	52 op/s wr

2.3、SSD

block	读写顺序	读写数据	线程数	IOPS	带宽速度
64K	顺序写	2G	32	1353	84.6MiB/s
64K	顺序读	2G	32	3930	246MiB/s
1M	顺序写	2G	32	142	143MiB/s
1M	顺序读	2G	32	258	258MiB/s
2M	顺序读	4G	16	129	260MiB/s

13.Ceph高可用及性能测试