Hi there,
I have a bit of a mishab when trying to replace a disk in my zpool…
I use vdevs to identify my disks in my NetApp shelfs. One of the disks died which was identified by vdev as “0a-14” which was pointing to “/dev/sdo”, so I replaced it and did a “zpool replace aggr0 0a-14” which *ucked everything up… I had errors on all devices in the zpool, and it was hanging etc… I ended up pulling the new disk in “0a-14” and rebooted the server, which brought me back to a resilvering zpool where it complains about insufficient replicas…
Here is was I see now…
root@boxen:/home/hwa# zpool status aggr0 -v
pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 23 11:26:50 2022
45.3T scanned at 1.87G/s, 37.9T issued at 1.56G/s, 162T total
0B resilvered, 23.46% done, 22:29:36 to go
config:
NAME STATE READ WRITE CKSUM
aggr0 DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
0a-0 ONLINE 0 0 0
0a-1 ONLINE 0 0 0
0a-2 ONLINE 0 0 0
0a-3 ONLINE 0 0 0
0a-4 ONLINE 0 0 0
0a-5 ONLINE 0 0 0
0a-6 ONLINE 0 0 0
0a-7 ONLINE 0 0 0
0a-8 ONLINE 0 0 0
0a-9 ONLINE 0 0 0
0a-10 ONLINE 0 0 0
0a-11 ONLINE 0 0 0
0a-12 ONLINE 0 0 0
0a-13 ONLINE 0 0 0
replacing-14 UNAVAIL 0 0 0 insufficient replicas
10561013914315785071 UNAVAIL 0 0 0 was /dev/disk/by-vdev/0a-14-part1/old
4140092911830402290 UNAVAIL 0 0 0 was /dev/disk/by-vdev/0a-14-part1
0a-15 ONLINE 0 0 0
0a-16 ONLINE 0 0 0
0a-17 ONLINE 0 0 0
0a-18 ONLINE 0 0 0
0a-19 ONLINE 0 0 0
0a-20 ONLINE 0 0 0
0a-21 ONLINE 0 0 0
0a-22 ONLINE 0 0 0
0a-23 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme01-part1 ONLINE 0 0 0
nvme02-part1 ONLINE 0 0 0
I am no sure what caused the problem when I replaced the broken disk.
Before I did the zpool replace the disk was formatted from 520 to 512 blocks with sg_format (because the disk is a NetApp disk…) and that completed OK…
I can see from the logs that the new disk inserted replaced /dev/sdo as the device id.
So I would think that the “zpool replace aggr0 0a-14” would be OK? Or should it has been like “zpool replace aggr0 0a-14 0a-14” or “zpool replace aggr0 0a-14 /dev/sdo” ?
And when the current resilver has completed, and I give this another try, which device do I put as the first one?
Is is a “zpool replace aggr0 replacing-14 /dev/newdevice” ? Or “zpool replace aggr0 0a-14 /dev/newdevice” ?
I tried to investigate why this went so terrible wrong, because I still does not understand why? If the old faulted disk on “0a-14 → /dev/sdo” was gone a zpool replace command pointing to this device should just fail? And not start havoc? The command issued completed “ok” but the “zpool status” showed both read and write errors on all but the special devices… and it stated that it was resilvering…
In fact… I just found some logs of the process…
Here I have replaced the disk…
root@boxen:~/# zpool replace aggr0 0a-14
root@boxen:~/# zpool status
pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 23 11:26:50 2022
310G scanned at 38.7G/s, 471M issued at 58.9M/s, 162T total
0B resilvered, 0.00% done, 33 days 07:29:09 to go
config:
NAME STATE READ WRITE CKSUM
aggr0 DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
0a-0 ONLINE 0 0 0
0a-1 ONLINE 0 0 0
0a-2 ONLINE 0 0 0
0a-3 ONLINE 0 0 0
0a-4 ONLINE 0 0 0
0a-5 ONLINE 0 0 0
0a-6 ONLINE 0 0 0
0a-7 ONLINE 0 0 0
0a-8 ONLINE 0 0 0
0a-9 ONLINE 0 0 0
0a-10 ONLINE 0 0 0
0a-11 ONLINE 0 0 0
0a-12 ONLINE 0 0 0
0a-13 ONLINE 0 0 0
replacing-14 DEGRADED 0 0 0
old FAULTED 22 0 0 too many errors
0a-14 ONLINE 0 0 0
0a-15 ONLINE 0 0 0
0a-16 ONLINE 0 0 0
0a-17 ONLINE 0 0 0
0a-18 ONLINE 0 0 0
0a-19 ONLINE 0 0 0
0a-20 ONLINE 0 0 0
0a-21 ONLINE 0 0 0
0a-22 ONLINE 0 0 0
0a-23 ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme01-part1 ONLINE 0 0 0
nvme02-part1 ONLINE 0 0 0
errors: No known data errors
And then a few seconds later it looked like this…
root@boxen:~/# zpool status aggr0
pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jul 23 11:26:50 2022
368G scanned at 11.5G/s, 495M issued at 15.5M/s, 162T total
0B resilvered, 0.00% done, 126 days 18:17:30 to go
config:
NAME STATE READ WRITE CKSUM
aggr0 DEGRADED 0 0 0
raidz3-0 DEGRADED 264 12 0
0a-0 ONLINE 214 12 0
0a-1 ONLINE 179 12 0
0a-2 ONLINE 214 12 0
0a-3 ONLINE 208 12 0
0a-4 ONLINE 215 12 0
0a-5 ONLINE 215 12 0
0a-6 ONLINE 215 12 0
0a-7 ONLINE 208 12 0
0a-8 ONLINE 210 12 0
0a-9 ONLINE 209 12 0
0a-10 ONLINE 111 12 0
0a-11 ONLINE 178 12 0
0a-12 ONLINE 282 12 0
0a-13 ONLINE 184 12 0
replacing-14 DEGRADED 264 224 0
old FAULTED 22 0 0 too many errors
0a-14 ONLINE 6 224 0
0a-15 ONLINE 209 12 0
0a-16 ONLINE 181 12 0
0a-17 ONLINE 493 307 0
0a-18 ONLINE 112 12 0
0a-19 ONLINE 209 12 0
0a-20 ONLINE 210 12 0
0a-21 ONLINE 209 12 0
0a-22 ONLINE 214 12 0
0a-23 ONLINE 213 12 0
special
mirror-1 ONLINE 0 0 0
nvme01-part1 ONLINE 0 0 0
nvme02-part1 ONLINE 0 0 0
errors: 276 data errors, use '-v' for a list
This is on Ubuntu 22.04
Any help is very welcome…