Zpool replace command?

Hi there,

I have a bit of a mishab when trying to replace a disk in my zpool…
I use vdevs to identify my disks in my NetApp shelfs. One of the disks died which was identified by vdev as “0a-14” which was pointing to “/dev/sdo”, so I replaced it and did a “zpool replace aggr0 0a-14” which *ucked everything up… I had errors on all devices in the zpool, and it was hanging etc… I ended up pulling the new disk in “0a-14” and rebooted the server, which brought me back to a resilvering zpool where it complains about insufficient replicas…
Here is was I see now…

root@boxen:/home/hwa# zpool status aggr0 -v
  pool: aggr0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jul 23 11:26:50 2022
	45.3T scanned at 1.87G/s, 37.9T issued at 1.56G/s, 162T total
	0B resilvered, 23.46% done, 22:29:36 to go
config:

	NAME                        STATE     READ WRITE CKSUM
	aggr0                       DEGRADED     0     0     0
	  raidz3-0                  DEGRADED     0     0     0
	    0a-0                    ONLINE       0     0     0
	    0a-1                    ONLINE       0     0     0
	    0a-2                    ONLINE       0     0     0
	    0a-3                    ONLINE       0     0     0
	    0a-4                    ONLINE       0     0     0
	    0a-5                    ONLINE       0     0     0
	    0a-6                    ONLINE       0     0     0
	    0a-7                    ONLINE       0     0     0
	    0a-8                    ONLINE       0     0     0
	    0a-9                    ONLINE       0     0     0
	    0a-10                   ONLINE       0     0     0
	    0a-11                   ONLINE       0     0     0
	    0a-12                   ONLINE       0     0     0
	    0a-13                   ONLINE       0     0     0
	    replacing-14            UNAVAIL      0     0     0  insufficient replicas
	      10561013914315785071  UNAVAIL      0     0     0  was /dev/disk/by-vdev/0a-14-part1/old
	      4140092911830402290   UNAVAIL      0     0     0  was /dev/disk/by-vdev/0a-14-part1
	    0a-15                   ONLINE       0     0     0
	    0a-16                   ONLINE       0     0     0
	    0a-17                   ONLINE       0     0     0
	    0a-18                   ONLINE       0     0     0
	    0a-19                   ONLINE       0     0     0
	    0a-20                   ONLINE       0     0     0
	    0a-21                   ONLINE       0     0     0
	    0a-22                   ONLINE       0     0     0
	    0a-23                   ONLINE       0     0     0
	special
	  mirror-1                  ONLINE       0     0     0
	    nvme01-part1            ONLINE       0     0     0
	    nvme02-part1            ONLINE       0     0     0

I am no sure what caused the problem when I replaced the broken disk.
Before I did the zpool replace the disk was formatted from 520 to 512 blocks with sg_format (because the disk is a NetApp disk…) and that completed OK…
I can see from the logs that the new disk inserted replaced /dev/sdo as the device id.
So I would think that the “zpool replace aggr0 0a-14” would be OK? Or should it has been like “zpool replace aggr0 0a-14 0a-14” or “zpool replace aggr0 0a-14 /dev/sdo” ?

And when the current resilver has completed, and I give this another try, which device do I put as the first one?
Is is a “zpool replace aggr0 replacing-14 /dev/newdevice” ? Or “zpool replace aggr0 0a-14 /dev/newdevice” ?

I tried to investigate why this went so terrible wrong, because I still does not understand why? If the old faulted disk on “0a-14 → /dev/sdo” was gone a zpool replace command pointing to this device should just fail? And not start havoc? The command issued completed “ok” but the “zpool status” showed both read and write errors on all but the special devices… and it stated that it was resilvering…

In fact… I just found some logs of the process…
Here I have replaced the disk…

root@boxen:~/# zpool replace aggr0 0a-14
root@boxen:~/# zpool status
  pool: aggr0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jul 23 11:26:50 2022
        310G scanned at 38.7G/s, 471M issued at 58.9M/s, 162T total
        0B resilvered, 0.00% done, 33 days 07:29:09 to go
config:

        NAME              STATE     READ WRITE CKSUM
        aggr0             DEGRADED     0     0     0
          raidz3-0        DEGRADED     0     0     0
            0a-0          ONLINE       0     0     0
            0a-1          ONLINE       0     0     0
            0a-2          ONLINE       0     0     0
            0a-3          ONLINE       0     0     0
            0a-4          ONLINE       0     0     0
            0a-5          ONLINE       0     0     0
            0a-6          ONLINE       0     0     0
            0a-7          ONLINE       0     0     0
            0a-8          ONLINE       0     0     0
            0a-9          ONLINE       0     0     0
            0a-10         ONLINE       0     0     0
            0a-11         ONLINE       0     0     0
            0a-12         ONLINE       0     0     0
            0a-13         ONLINE       0     0     0
            replacing-14  DEGRADED     0     0     0
              old         FAULTED     22     0     0  too many errors
              0a-14       ONLINE       0     0     0
            0a-15         ONLINE       0     0     0
            0a-16         ONLINE       0     0     0
            0a-17         ONLINE       0     0     0
            0a-18         ONLINE       0     0     0
            0a-19         ONLINE       0     0     0
            0a-20         ONLINE       0     0     0
            0a-21         ONLINE       0     0     0
            0a-22         ONLINE       0     0     0
            0a-23         ONLINE       0     0     0
        special
          mirror-1        ONLINE       0     0     0
            nvme01-part1  ONLINE       0     0     0
            nvme02-part1  ONLINE       0     0     0

errors: No known data errors

And then a few seconds later it looked like this…

root@boxen:~/# zpool status aggr0
  pool: aggr0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jul 23 11:26:50 2022
        368G scanned at 11.5G/s, 495M issued at 15.5M/s, 162T total
        0B resilvered, 0.00% done, 126 days 18:17:30 to go
config:

        NAME              STATE     READ WRITE CKSUM
        aggr0             DEGRADED     0     0     0
          raidz3-0        DEGRADED   264    12     0
            0a-0          ONLINE     214    12     0
            0a-1          ONLINE     179    12     0
            0a-2          ONLINE     214    12     0
            0a-3          ONLINE     208    12     0
            0a-4          ONLINE     215    12     0
            0a-5          ONLINE     215    12     0
            0a-6          ONLINE     215    12     0
            0a-7          ONLINE     208    12     0
            0a-8          ONLINE     210    12     0
            0a-9          ONLINE     209    12     0
            0a-10         ONLINE     111    12     0
            0a-11         ONLINE     178    12     0
            0a-12         ONLINE     282    12     0
            0a-13         ONLINE     184    12     0
            replacing-14  DEGRADED   264   224     0
              old         FAULTED     22     0     0  too many errors
              0a-14       ONLINE       6   224     0
            0a-15         ONLINE     209    12     0
            0a-16         ONLINE     181    12     0
            0a-17         ONLINE     493   307     0
            0a-18         ONLINE     112    12     0
            0a-19         ONLINE     209    12     0
            0a-20         ONLINE     210    12     0
            0a-21         ONLINE     209    12     0
            0a-22         ONLINE     214    12     0
            0a-23         ONLINE     213    12     0
        special
          mirror-1        ONLINE       0     0     0
            nvme01-part1  ONLINE       0     0     0
            nvme02-part1  ONLINE       0     0     0

errors: 276 data errors, use '-v' for a list

This is on Ubuntu 22.04

Any help is very welcome…

Update:
The resilvering completed, and I re-inserted the new disk, and ZFS found it and started the resilvering… (guess it was because it had been used before)…
No issues whatsoever…
As I inserted the new disk it was identified as /dev/sdbc and vdev also found it ad correctly added is as /dev/disk/by-vdev/0a-14

So I’m not sure if I did something wrong the first time?
Maybe it could be because I inserted the new disk when it had a blocksize og 520 (netapp disk), I am 100% sure it got the device id of “/dev/sdo” because I formatted it with sg_format… As the format was completed I just issued the “zpool replace aggr0 0a-14” without any “destination disk”… maybe this caused the issues? Yet it clearly found the correct disk and started the resilvering… only with a lot of errors as described…
I guess the next time I will so a sg_format, then re-insert the disk before I run the zpool replace… and I will do it like “zpool replace aggr0 0a-14 sdo” ie. with both source and destination disk…

What a mess :slight_smile: Goo think it looks like the resilvering has become a bit faster, so the resilvering of a 24 disk raidz3 (8TB disks) takes about 30 hours. which I find acceptable… I know many will tell me that mirrors or smaller raid groups is the way to go, but in this case and with mirrored special devices on NVME I disagree :slight_smile:

Again, if anyone can find the cause for this from my ramblings, please let me know :slight_smile:

1 Like

Just FYI you should not go above 12 disks per vdev because you only have 3 disks worth of parity across 24 drives.

Meaning that you can only loose 3 disks of risk tolerance. You loose 4 disks your pool is toast.

With over 24 drives in your pool, and you can only loose 1/8 of them before total failure, you are being very risky with your data, and the performance of ZFS comes from more VDEV’s.

So you have one drives worth of throughput with high risk for data loss. I would not recommend this topology.

2 Likes

Hi again,

I don’t want to start a fight :slight_smile:

But… I’m not sure you are correct about the one drives worth of throughput?
I can easily push over 500MB/sec. worth of write to my single raidz3 vdev… which is not possible if there were actually a bottleneck of one device…

Because the resilver time of a failed device is about 48 hours, I see no issues with a RAIDZ3 with 24 drives.
I haven’t done the math, but you may be right in that a two vdev raidz2 with 12 disks each may be “more” secure… but somehow I sleep better while the system is resilvering knowing that two disks can still fail… Also I’m not sure that resilvering a raidz2 with 12 disks are that much faster than a raidz3 with 24 disks ?
In general resilvering of raidz is one of the not so great “features” of ZFS, and I have read “old” papers of some developers trying to remedy this, but as of yet it seems still far away in the distant future :slight_smile:

I also use enterprise grade NL-SAS disks from NetApp, and the day that 4 disks dies within two days in the same system is yet to come, and I have been working with NetApp storage for the last 15 years… in general their hardware is very robust, both their diskshelfs and the disks that they of cause does not build themselves but selects and use their own firmware and set aside more space for failed blocks, compared to the common consumer disks.

Anyway, I think we mostly agree, apart from the performance :wink:

1 Like

AFAIK the rule of thumb of ZFS is more VDEV’s the better the performance. But if you’re happy with the speed then we’re good :+1:

Bellow this line is me just getting on a soapbox.


It’s not just about resilver times, its about what you can loose before absolute failure. And the current way you’ve implemented it goes against best practice for risk mitigation.

It will be because when you write data and its spread across the drives it wont have to wait for all 24 drives to flush and you’ve got an additional VDEV to handle IOPS. So depending on what your workload you might not see much of a difference, however, you would far safer with a 2 x 12 RZ2 because you have 4 disks of parity and its split between two VDEVS so resilver times might also be faster as well.

If I were you I’d configure the 24 drive pool into 5 RZ2 VDEVs with 4 drives each, with 4 hot spares. That way you’d be covered against a total of 12 drive failures with a maximum tolerance of 2 drives per VDEV.

1 Like

:slight_smile: Thanks for the input. If the resilver will be that much faster with more vdevs it may make sense… I would however like to test this. (and I will)… only thing is, that a proper test includes also filling up the filesystems before a resilver test will make sense… I have the hardware to test, I just need the time :slight_smile:
But with 5 RZ2 with 4 drives as you suggest, you might as well just mirror everything… it’s close to 50% parity and sorry that just doesn’t make sense to me because of the less capacity and/or the more power I need in order to get the capacity with vdevs of 4 drives each.

Anyway… I still haven’t found out why my pool died in such a way as it did as I tried to replace a disk… even though I have a backup, it will take many days to restore my pool :wink:

Yeah, 4-wide Z2 is debatable :slight_smile: But there is the argument that losing two drives is fine, while it can be deadly with mirrors depending on the failed drives.

Best practice is to not make vdevs larger than 12 disks. I’ve seen a bunch of pools with a 8-wide Z2 vdevs. Will cost some capacity (75% efficiency instead of 87%), but I would consider it to be safer and you should see some performance increase because 3 vdevs, but with 24 disks, it’s not as linear as you might get with only a couple of drives. Resilver will be much easier.

1 Like

I just remembered recent feature additions to ZFS and I imagine that dRaid might be interesting to you.

dRaid integrates a spare directly into a RaidZ like vdev and improves resilver time quite considerably by doing it sequential. And the spare isn’t some cold disk that doesn’t get used. I don’t have enough drives to properly test this myself, but for 24 disks, as an alternative to the usual RaidZ, this is some very interesting feature. Logical RAIDs within a RAID vdev, the new magic from 2021.

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html

1 Like

Did you have to offline the drive before it let you replace it?

in the past, my pool didnt like it, if i didnt offline a degraded drive before i replaced it.