1、背景

Ceph集群中本来有9个OSD,近期出现空间不足的情况。随机增加5个OSD到集群,加入集群后进入recovery和backfill状态,此过程需要很长的时间。通过ceph -s看到recovery速度仅为18 objects/s。因此需要对recovery速度进行优化。

  1. ]$ ceph -s
  2. cluster:
  3. id: 5adf323c-bef2-42b4-8eff-7a164be1c7fa
  4. health: HEALTH_WARN
  5. Degraded data redundancy: 2246888/205599948 objects degraded (1.093%), 10 pgs degraded, 10 pgs undersized
  6. services:
  7. mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11m)
  8. mgr: ceph-mon1(active, since 6d), standbys: ceph-mon2
  9. mds: cephfs:1 {0=ceph-mon2=up:active} 1 up:standby
  10. osd: 14 osds: 14 up (since 33h), 14 in (since 41h); 114 remapped pgs
  11. rgw: 3 daemons active (ceph-mon1, ceph-mon2, ceph-mon3)
  12. task status:
  13. data:
  14. pools: 8 pools, 352 pgs
  15. objects: 34.27M objects, 2.9 TiB
  16. usage: 5.4 TiB used, 7.3 TiB / 13 TiB avail
  17. pgs: 2246888/205599948 objects degraded (1.093%)
  18. 77284550/205599948 objects misplaced (37.590%)
  19. 238 active+clean
  20. 96 active+remapped+backfill_wait
  21. 10 active+undersized+degraded+remapped+backfilling
  22. 8 active+remapped+backfilling
  23. io:
  24. client: 1.1 MiB/s rd, 14 op/s rd, 0 op/s wr
  25. recovery: 1.3 MiB/s, 18 objects/s
  26. progress:
  27. Rebalancing after osd.13 marked in
  28. [=============================.]
  29. Rebalancing after osd.10 marked in
  30. [=======================.......]
  31. Rebalancing after osd.11 marked in
  32. [===========================...]
  33. Rebalancing after osd.12 marked in
  34. [===========================...]
  35. Rebalancing after osd.9 marked in
  36. [==================............]

2、参数查看和调整

2.1、查看recovery和backfill相关参数

]$ ceph daemon osd.0 config show |grep -E 'backfill|recovery'
    "bluefs_replay_recovery": "false",
    "bluefs_replay_recovery_disable_compact": "false",
    "mon_osd_backfillfull_ratio": "0.900000",
    "osd_allow_recovery_below_min_size": "true",
    "osd_async_recovery_min_cost": "100",
    "osd_backfill_retry_interval": "30.000000",
    "osd_backfill_scan_max": "512",
    "osd_backfill_scan_min": "64",
    "osd_debug_pretend_recovery_active": "false",
    "osd_debug_reject_backfill_probability": "0.000000",
    "osd_debug_skip_full_check_in_backfill_reservation": "false",
    "osd_debug_skip_full_check_in_recovery": "false",
    "osd_force_recovery_pg_log_entries_factor": "1.300000",
    "osd_kill_backfill_at": "0",
    "osd_max_backfills": "1",
    "osd_min_recovery_priority": "0",
    "osd_recovery_cost": "20971520",
    "osd_recovery_delay_start": "0.000000",
    "osd_recovery_max_active": "3",
    "osd_recovery_max_chunk": "8388608",
    "osd_recovery_max_omap_entries_per_chunk": "8096",
    "osd_recovery_max_single_start": "1",
    "osd_recovery_op_priority": "3",
    "osd_recovery_op_warn_multiple": "16",
    "osd_recovery_priority": "5",
    "osd_recovery_retry_interval": "30.000000",
    "osd_recovery_sleep": "0.000000",
    "osd_recovery_sleep_hdd": "0.100000",
    "osd_recovery_sleep_hybrid": "0.025000",
    "osd_recovery_sleep_ssd": "0.100000",
    "osd_repair_during_recovery": "false",
    "osd_scrub_during_recovery": "false",

2.2、参数释义

  • osd-max-backfills:
    • 单个 OSD 允许的最大回填操作数。
    • 默认:1
  • osd_recovery_max_active
    • 每个osd的最大请求数
    • 默认:3
  • osd_recovery_sleep_hdd
    • luminous版本后出现,以前版本是osd_recovery_sleep, 每次recovery or backfill 的休眠时长
    • 默认:0.1
  • osd_backfill_scan_min
    • 每次backfill是最小扫描对象数
    • 默认:64
  • osd_backfill_scan_max
    • 每次backfill是最大扫描对象数
    • 默认:512

      2.3、参数调整

      调整要根据具体集群情况而定,测试验证出最佳的参数,盲目调整参数性能可能不升反降,甚至影响正常读写请求。
      osd max backfills = 10  # 8或者10,设置过大会导致slow query
      osd recovery max active = 15
      osd_recovery_sleep_hdd = 0
      osd_recovery_sleep_ssd = 0
      

      2.4、调整指令

      ]$ ceph tell osd.\* injectargs '--osd_max_backfills=10'
      ]$ ceph tell osd.\* injectargs '--osd_recovery_max_active=15'
      ]$ ceph tell osd.\* injectargs '--osd_recovery_sleep_hdd=0'
      ]$ ceph tell osd.\* injectargs '--osd_recovery_sleep_ssd=0'
      

      3、查看调整后的参数和recovery速度

      ```bash ]$ ceph daemon osd.0 config show |grep -E ‘backfill|recovery’ “bluefs_replay_recovery”: “false”, “bluefs_replay_recovery_disable_compact”: “false”, “mon_osd_backfillfull_ratio”: “0.900000”, “osd_allow_recovery_below_min_size”: “true”, “osd_async_recovery_min_cost”: “100”, “osd_backfill_retry_interval”: “30.000000”, “osd_backfill_scan_max”: “512”, “osd_backfill_scan_min”: “64”, “osd_debug_pretend_recovery_active”: “false”, “osd_debug_reject_backfill_probability”: “0.000000”, “osd_debug_skip_full_check_in_backfill_reservation”: “false”, “osd_debug_skip_full_check_in_recovery”: “false”, “osd_force_recovery_pg_log_entries_factor”: “1.300000”, “osd_kill_backfill_at”: “0”, “osd_max_backfills”: “10”, “osd_min_recovery_priority”: “0”, “osd_recovery_cost”: “20971520”, “osd_recovery_delay_start”: “0.000000”, “osd_recovery_max_active”: “15”, “osd_recovery_max_chunk”: “8388608”, “osd_recovery_max_omap_entries_per_chunk”: “8096”, “osd_recovery_max_single_start”: “1”, “osd_recovery_op_priority”: “3”, “osd_recovery_op_warn_multiple”: “16”, “osd_recovery_priority”: “5”, “osd_recovery_retry_interval”: “30.000000”, “osd_recovery_sleep”: “0.000000”, “osd_recovery_sleep_hdd”: “0.000000”, “osd_recovery_sleep_hybrid”: “0.025000”, “osd_recovery_sleep_ssd”: “0.000000”, “osd_repair_during_recovery”: “false”, “osd_scrub_during_recovery”: “false”,

]$ ceph -s cluster: id: 5adf323c-bef2-42b4-8eff-7a164be1c7fa health: HEALTH_WARN Degraded data redundancy: 1862088/205599948 objects degraded (0.906%), 10 pgs degraded, 10 pgs undersized

services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 51m) mgr: ceph-mon1(active, since 6d), standbys: ceph-mon2 mds: cephfs:1 {0=ceph-mon2=up:active} 1 up:standby osd: 14 osds: 14 up (since 34h), 14 in (since 42h); 114 remapped pgs rgw: 3 daemons active (ceph-mon1, ceph-mon2, ceph-mon3)

task status:

data: pools: 8 pools, 352 pgs objects: 34.27M objects, 2.9 TiB usage: 5.4 TiB used, 7.3 TiB / 13 TiB avail pgs: 1862088/205599948 objects degraded (0.906%) 75829821/205599948 objects misplaced (36.882%) 238 active+clean 96 active+remapped+backfill_wait 10 active+undersized+degraded+remapped+backfilling 8 active+remapped+backfilling

io: client: 435 KiB/s rd, 11 op/s rd, 0 op/s wr recovery: 41 MiB/s, 460 objects/s

progress: Rebalancing after osd.13 marked in [=============================.] Rebalancing after osd.10 marked in [=======================…….] Rebalancing after osd.11 marked in [===========================…] Rebalancing after osd.12 marked in [===========================…] Rebalancing after osd.9 marked in [==================…………]

``` 可以看到恢复速度已经达到460 objects/s