因为坏盘或者osd本身需要进行单独升级、修改的时候,osd需要重建,osd的重建会引发pg的迁移使得集群发生数据转移,在数据量很大的情况下,这种数据转移要尽可能地避免
网上有不少关于osd删除、重建的文章,也不乏质量高的牛人博客,例如徐大神删除osd的正确方式,还有ceph业内大牛Sébastien HanCeph: properly remove an OSD
参考了不少高质量的文章后,笔者打算利用手上环境实际验证一下这些方法
最常规做法
此前我们的1.6PB集群出现过一次坏盘,导致了数个osd重建,由于当时集群拥有超不多二十亿张图片,线上业务还要不间断写入,压力不少
将osd设置out
使用ceph osd out x
将osd设置out出集群后,会引发第一次数据迁移,告诉集群该osd异常了,不要再向它写数据,它的pg也迁出来,分布到其他osd上
等待数据迁移完成。。。耗时三小时
将osd进程stop
将osd进程stop才能进行后续
从crushmap中移除osd
使用ceph osd crush remove osd.x 将osd移除,这不仅使得osd所在节点的权重发生变化,更重要的是,使osd的基数发生了变化,在crush算法计算数据分布的时候,osd基数发生变化是很重要的影响,这会引发第二次数据迁移
等待数据迁移完成。。。耗时三小时
从集群中删除osd
此时该osd仍然在集群中注册,还没有删除干净,使用ceph auth del osd.x
删除osd的认证信息,并使用ceph osd rm x
将osd从集群中删除
重新创建osd
重新创建osd后,osd所在节点权重会再次发生变化,并且,osd数量发生变化导致crush算法的基数发生变化,引发数据第三次迁移
的呆数据迁移完成。。。耗时三小时
以上就是ceph官方推荐的osd删除
的步骤,注意到有三次数据迁移,很心累。实际上,有不少同行会在osd移出集群后,不等待数据平衡完成就立即删除osd,这样会减少一次数据迁移,我们在集群数据量少的时候也这么干过,但是在线上集群没有这么做,会有什么后果,不太清楚-.-
大神推荐做法
上述徐大神和Han的博文提到,应该首先将osd的reweight降为0,从而使得数据迁出,再删除osd,下面直播一下这种方式
首先将osd reweight为0,此命令一出,osd立即自动out出集群1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26[tanweijie@ceph-2-52 ~]$ sudo ceph osd reweight 0 0
[tanweijie@ceph-2-52 ~]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1096488/37818166 objects misplaced (2.899%)
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 48 osds: 48 up, 47 in; 475 remapped pgs
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.91M objects, 4.32TiB
usage: 9.81TiB used, 178TiB / 188TiB avail
pgs: 1096488/37818166 objects misplaced (2.899%)
8389 active+clean
468 active+remapped+backfill_wait
7 active+remapped+backfilling
io:
recovery: 15.2MiB/s, 62objects/s
此时osd df信息显示该osd的权重和容量信息被清空,只待将pg迁出去1
2
3
4
5 0 hdd 5.29999 0 0B 0B 0B 0 0 475
1 hdd 5.29999 0.81580 5.30TiB 291GiB 5.02TiB 5.35 1.03 477
2 hdd 5.29999 0.86830 5.30TiB 290GiB 5.02TiB 5.34 1.02 478
3 hdd 5.29999 0.83260 5.30TiB 291GiB 5.02TiB 5.36 1.03 477
......
数据均衡后,集群恢复正常1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18[tanweijie@ceph-2-52 ~]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 48 osds: 48 up, 47 in
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.91M objects, 4.32TiB
usage: 9.81TiB used, 178TiB / 188TiB avail
pgs: 8864 active+clean
此时查看osd df信息,发现pg已经全部迁出1
2
3
4
5 0 hdd 5.29999 0 0B 0B 0B 0 0 0
1 hdd 5.29999 0.81580 5.30TiB 311GiB 5.00TiB 5.72 1.07 510
2 hdd 5.29999 0.86830 5.30TiB 309GiB 5.00TiB 5.69 1.06 509
3 hdd 5.29999 0.83260 5.30TiB 309GiB 5.00TiB 5.69 1.06 506
......
按照大神推荐,此时将osd移出crushmap,发现有pg迁移1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph osd crush remove osd.0
removed item id 0 name 'osd.0' from crush map
[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
Reduced data availability: 79 pgs inactive, 80 pgs peering
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 48 osds: 47 up, 47 in; 991 remapped pgs
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.73M objects, 4.28TiB
usage: 9.80TiB used, 173TiB / 183TiB avail
pgs: 1.929% pgs not active
8693 active+clean
171 peering
随后是大规模的数据均衡,等待数据迁移完成后,集群恢复正常1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 48 osds: 47 up, 47 in
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.91M objects, 4.32TiB
usage: 9.80TiB used, 173TiB / 183TiB avail
pgs: 8864 active+clean
io:
client: 2.65KiB/s rd, 3op/s rd, 0op/s wr
此时,继续删除osd的auth和rm osd,均不会发生数据迁移;后续,新建osd重新加入集群,一定会导致数据均衡
综上,即使osd reweight为0,在移出crushmap的时候仍然会发生数据迁移,只是这次迁移的量会比out出集群小一些,但仍然涉及了大量的数据
个中缘由,osd从crushmap删除后,虽然weight为0,从crushmap中删除并不会引起节点权重变化,但是crush算法决定数据写入的时候,除了权重外,更重要的是基于osd的数量作为基数来计算实际的数据分布,因而所有crushmap的osd增减必然会引发数据的均衡
注:上文引述两位大神的文章的创作时间比较早,不排除他们使用的是10.x版本,而本文介绍的destroy方法为12.x版本引入,在此声明版本差异
实测最优的osd重建方法
官方建议的换盘方式是基于destroy来重建osd,下面实战一下这个方法
首先将osd reweight为0,然后将对应的进程停止,此时集群up和out的osd减一(测试集群基于上面的操作),集群开始进行数据均衡,systemctl stop ceph-osd@1.service
,注意到,reweight为0后会立即触发osd的数据迁移1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28[cephfsd@ceph-2-52 backup]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 47 osds: 46 up, 46 in
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.91M objects, 4.32TiB
usage: 9.80TiB used, 168TiB / 177TiB avail
pgs: 5247072/777007182 objects degraded (0.675%)
4786498/777007182 objects misplaced (0.616%)
8607 active+clean
106 active+remapped+backfill_wait
95 active+undersized+degraded+remapped+backfill_wait
50 active+undersized+degraded+remapped+backfilling
6 active+remapped+backfilling
io:
client: 35.0MiB/s rd, 2.91MiB/s wr, 3.02kop/s rd, 3.11kop/s wr
recovery: 60.4MiB/s, 889objects/s
osd仍在集群但out,osd总数数不变,接下来destroy这个osd1
[cephfsd@ceph-2-52 backup]$ sudo ceph osd destroy 1 --yes-i-really-mean-it
此时这个osd.1状态应该为destroyed,接下来zap处理osd磁盘,这里仅在原来的磁盘及分区上重建osd,实际中可能需要更换磁盘1
2
3
4
5
6
7
8
9
10
11[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph-volume lvm zap /dev/sdf3
Running command: /usr/sbin/cryptsetup status /dev/mapper/ced1e258-f1cd-430b-a0da-bb930d7dfcf2
stdout: /dev/mapper/ced1e258-f1cd-430b-a0da-bb930d7dfcf2 is inactive.
--> Zapping: /dev/sdf3
Running command: wipefs --all /dev/sdf3
Running command: dd if=/dev/zero of=/dev/sdf3 bs=1M count=10
stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB) copied
stderr: , 0.052715 s, 199 MB/s
--> Zapping successful for: /dev/sdf3
这样,这个准备给osd用的磁盘已经准备好了,现在我们重建这个osd,按照官方的说法,我们先prepare这个osd,实际上可以直接用create的方法进行创建1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph-volume lvm prepare --osd-id 1 --data /dev/sdf3 --block.wal /dev/sdd1 --block.db /dev/sdd2
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 13550fd6-6bf3-42ba-ad07-a24796379805 1
Running command: vgcreate --force --yes ceph-8ca0b72d-cbc0-428d-84c1-b167d01d0e31 /dev/sdf3
stdout: Physical volume "/dev/sdf3" successfully created.
stdout: Volume group "ceph-8ca0b72d-cbc0-428d-84c1-b167d01d0e31" successfully created
Running command: lvcreate --yes -l 100%FREE -n osd-block-13550fd6-6bf3-42ba-ad07-a24796379805 ceph-8ca0b72d-cbc0-428d-84c1-b167d01d0e31
stdout: Logical volume "osd-block-13550fd6-6bf3-42ba-ad07-a24796379805" created.
Running command: /bin/ceph-authtool --gen-print-key
Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
Running command: restorecon /var/lib/ceph/osd/ceph-1
Running command: chown -h ceph:ceph /dev/ceph-8ca0b72d-cbc0-428d-84c1-b167d01d0e31/osd-block-13550fd6-6bf3-42ba-ad07-a24796379805
Running command: chown -R ceph:ceph /dev/dm-7
Running command: ln -s /dev/ceph-8ca0b72d-cbc0-428d-84c1-b167d01d0e31/osd-block-13550fd6-6bf3-42ba-ad07-a24796379805 /var/lib/ceph/osd/ceph-1/block
Running command: ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap
stderr: got monmap epoch 2
Running command: ceph-authtool /var/lib/ceph/osd/ceph-1/keyring --create-keyring --name osd.1 --add-key AQBlDO1bl+weAhAA6/LCeG6AEdvOQgYo38NZuw==
stdout: creating /var/lib/ceph/osd/ceph-1/keyring
stdout: added entity osd.1 auth auth(auid = 18446744073709551615 key=AQBlDO1bl+weAhAA6/LCeG6AEdvOQgYo38NZuw== with 0 caps)
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/
Running command: chown -R ceph:ceph /dev/sdd1
Running command: chown -R ceph:ceph /dev/sdd2
Running command: /bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --bluestore-block-wal-path /dev/sdd1 --bluestore-block-db-path /dev/sdd2 --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 13550fd6-6bf3-42ba-ad07-a24796379805 --setuser ceph --setgroup ceph
--> ceph-volume lvm prepare successful for: /dev/sdf3
至此,这个osd就重建完了,但是此时osd还没有in进集群,所以它现在属于自由的osd,即使它有集群的osd id1
2
3
4-2 40.41992 host ceph-2-52
1 hdd 5.29999 osd.1 down 0 1.00000
2 hdd 5.29999 osd.2 up 0.86830 1.00000
3 hdd 5.29999 osd.3 up 0.83260 1.00000
这里,首先使用ceph osd reweight 1 0.xxx
将重建后的osd权重设置为重建前的权重,然后start并in这个osd,集群开始进行数据均衡1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31[cephfsd@ceph-2-52 ceph-deploy]$ sudo ceph -s
cluster:
id: 56c04287-4aed-435d-a1a4-d30392ff15ee
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1423242/37818166 objects misplaced (3.763%)
Degraded data redundancy: 2/37818166 objects degraded (0.000%), 1 pg degraded
services:
mon: 3 daemons, quorum ceph-2-52,ceph-2-53,ceph-2-54
mgr: ceph-2-55(active), standbys: ceph-2-54
osd: 47 osds: 47 up, 47 in; 617 remapped pgs
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 7 pools, 8864 pgs
objects: 18.91M objects, 4.32TiB
usage: 9.80TiB used, 173TiB / 183TiB avail
pgs: 0.056% pgs not active
2/37818166 objects degraded (0.000%)
1423242/37818166 objects misplaced (3.763%)
8245 active+clean
613 active+remapped+backfill_wait
3 activating+remapped
1 active+remapped+backfilling
1 activating+degraded
1 activating
io:
recovery: 67.0MiB/s, 279objects/s
数据迁移完成后,查看osd的信息1
2
31 hdd 5.29999 0.81580 5.30TiB 291GiB 5.02TiB 5.35 1.03 477
2 hdd 5.29999 0.86830 5.30TiB 290GiB 5.02TiB 5.34 1.02 478
3 hdd 5.29999 0.83260 5.30TiB 291GiB 5.02TiB 5.36 1.03 477
至此,osd的重建完成了,整个过程没有将osd移出crushmap,因此没有触发移出时的数据迁移,其实认真想想,我们真的需要将osd移出crushmap吗?磁盘损坏了,我们的目的是更换磁盘,然后数据backfill进来,或者我们需要升级osd,也就是重建一下osd,集群osd数在绝大多数情况下都不会减少,因而我们将osd移出crushmap其实是没有必要的做法
这里对比一下osd.1在重建前pg的分布情况,就可以理解刚才为什么需要在osd in进集群前进行reweight
事实上,osd本身的reweight在集群使用前就应该调整均匀,pg分布不均匀会导致很多问题,reweight均衡后,这个osd df信息非常有必要记录下来,为的是后续重建osd的时候作为reweight的参考,试想一下osd重建之后,reweight为1.000加入到集群,它的reweight为整个集群最大,所以pg数会最多,这就导致本来比较均衡的pg变得不均衡了,这个时候也非常不适合使用reweight命令进行多次调整,因为每次调整都会引发一次pg重分布,这就有够麻烦的了,所以事先知道重建前reweight的值,在in入集群前调整好,osd加入集群后,pg的分布就算不能与原来完全一致,但是仍然能保证pg分布的均衡度
总结
osd重建在实际ceph的运维中还是比较常见的,重建过程的数据恢复随集群数据规模而增加,并且恢复速度还会影响正常的读写业务,所以在操作的时候要非常小心,本文通过基于12.2.8版本做实验,得出了destroy的方式重建osd才是数据迁移次数最低的结论,不当之处欢迎批评指正
参考资料
ADDING/REMOVING OSDS
深入浅出BlueStore的OSD创建与启动