How to slow down resilvering rate for RADZ2 pool

idimitro · June 19, 2020, 11:22pm

Hello, I got in the unpleasant situation of having a 6 disk RADZ2 pool with 3 failing drives. The drives are 5+ years old so not that strange that they are failing but I never expected 3 of them to fail at the same time. Anyway…
Two of the drives are not completely dead - they are discovered by system after boot and the pool is shown as degrade but available.
I tried to copy a snapshot of the pool to other pool. Next I added a new disk to the pool in order to resilver the pool but both attempts failed at about 60-70% in.

I think if I slow down the resilvering process it should give me enough time to resilver at least one new device before the bad ones start “clicking”.
The question is: - how to slow down the resilvering in order to not load the bad drives very hard so hopefully a new drive is resilvered?
Any other suggestion how to resilver the failing pool?
Thanks.

zlynx · June 20, 2020, 1:27am

I doubt it will help you unless there’s a heat problem. If there’s a heat problem that would contribute to drive failure for sure.

But anyway, there will be the same number of head seeks for a complete resilver no matter how fast or slow it goes. By what mechanism would going slower help?

Anyway, I can’t be of any help here. Just commenting.

Commissar · June 20, 2020, 4:51am

i think you’re going to have dataloss. you’re at a 3 drive failure on raid z2. let it go, and rebuild from backup

anon37371794 · June 20, 2020, 8:22am

If you have no parity left, forget about resilvering. Just back up what you can and build a new pool with at least 3 new drives to replace the dead/dying ones.

idimitro · June 20, 2020, 9:44am

I am a bit confused from the state of the pool actually.
I started replacing one of the disks. The other two were marked as “FAULTED” in around 65% resilvering, but the process continued. Only the errors started growing.
Today in the morning it seems a second round of resilvering is ongoing and I am getting 15MB of checksum errors and counting.
I guess there is still chance for the pool but I am wondering if the resilvering with all the checksum errors is not just screwing it for good.

Blockquote
sudo zpool status zpod arch-NAS: Sat Jun 20 11:15:58 2020

pool: zpod
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jun 20 08:49:13 2020
2.46T scanned at 293M/s, 2.20T issued at 261M/s, 2.46T total
1.87G resilvered, 89.28% done, 0 days 00:17:37 to go
config:

    NAME                          STATE     READ WRITE CKSUM
    zpod                          DEGRADED     0     0     0
      raidz2-0                    DEGRADED     0     0     0
        wwn-0x50014ee65883b62b    FAULTED    407     4   157  too many errors  
        wwn-0x50014ee60396329f    FAULTED    131    22    20  too many errors  
        replacing-2               DEGRADED     0     0 18.2M
          2720453720837388431     UNAVAIL      0     0     0  was /dev/disk/by-id/wwn-0x50000395142862db-part1
          wwn-0x5000cca827d0fc7c  ONLINE       0     0     0  (resilvering)
        wwn-0x50014ee659a51023    ONLINE       0     0 18.2M
        wwn-0x50014ee6044fa6ed    ONLINE       0     0 18.2M
        wwn-0x5000cca827ce1b14    ONLINE       0     0 18.2M

errors: 4055422 data errors, use ‘-v’ for a list

Dynamic_Gravity · June 20, 2020, 3:50pm

This is the exact failure scenario that is mentioned frequently about.

When you have relatively small amount of drives you should always opt for mirrors instead of RAIDZ2 because the resilver times are then only limited by the throughput of the drive and not the recalculations of parity.

Parity is great for lots of small <2TiB drives but once drives grow the resilver times can take days to recalculate from parity.

So just food for thought for the future.

Regarding your issue, in my opinion its best to just salvage what you can.

Commissar · June 20, 2020, 8:26pm

mirrors are more risky tho. if in 4 disk raid 10, you lose the drive and the mirror of that drive, you’re screwed. you’re fine with parity tho