ZFS pool unavailable after accidentally adding metadata drive and removing it

Here is the current layout of the pool status. I caused this situation by force offline the drive and then removing it from the system to see if I could import the pool without it. Before doing it verified the drive only had around 5MB of metadata which I believed would not cause any issues minus some data loss. After the drive faulted the pool has been unable to import. When import read only all the data is not missing from the pool. I still have the drive with the metadata. The issue is that I can’t clear the faulted issue without importing the drive with write access but that is not available. All of the other metadata is on the mirror vdevs and the data is under the main nas1 dataset.

I was wondering if there was anything I could do to just read the data from the drives or have the pool recognize the drive that it shows as external device fault. I’ve spent a few hours trying to learn more about zfs as I obviously did not know enough. I got the pool import in read only mode which is below but thats as far as I got.

Any help would be appreciated.

pool: nas1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
scan: resilvered 10.8M in 00:00:18 with 0 errors on Sat Jul 20 20:57:40 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    nas1                                      DEGRADED     0     0     0
      raidz2-0                                ONLINE       0     0     0
        sdh2                                  ONLINE       0     0     0
        sdk2                                  ONLINE       0     0     0
        sdm2                                  ONLINE       0     0     0
        sdi2                                  ONLINE       0     0     0
        sdl2                                  ONLINE       0     0     0
        sdg2                                  ONLINE       0     0     0
        sda2                                  ONLINE       0     0     0
        sdj2                                  ONLINE       0     0     0
    special
      mirror-1                                ONLINE       0     0     0
        nvme2n1p1                             ONLINE       0     0     0
        nvme3n1p1                             ONLINE       0     0     0
      mirror-2                                ONLINE       0     0     0
        3d38b9ee-d2f1-11ed-8992-90e2ba3a4f1b  ONLINE       0     0     0
        3d378717-d2f1-11ed-8992-90e2ba3a4f1b  ONLINE       0     0     0
      ff14c02b-6b27-471a-a6ab-4e7bd904ccc7    FAULTED      0     0     0  external device fault

So I got some sleep and looking at the situation now I see what I did. I didn’t know I was messing with top level drives and that’s kind of how this started.
I was trying to extend the mirrors for mirror 1 and mirror 2 by 1 drive. Instead, I made a third mirror, mirror 3. After creating mirror 3, I realized my mistake and offlineed and detached one of the drives. That’s when I turned this drive off-line and caused this mess.
Another question I have is there a way to use that other drive that was part of mirrors 3.

Hi, am not an expert, but it look like having any form of raidZ in a pool, means the mirror removal, won’t work.

I dont know your seupus, but I personally would send all the pool to backup, destroy the current, and restore from backup.

I am aware that if the whole pool (meta+data) is made up of mirrors, then there is a way to remove a top level vdev permenantly.
Apparently, this is still a little messy, and results in some thrashing for re-mapped data blocks, but does apparently work.

If you can’t rebuild, then perhaps bring the removed drive back and attach another to make a third mirror?

Then the system would be stable and fully online till you can work out a backup plan?
I am aware this would be “throwing good money after bad” but at least SSD’s are fast enough, that a failed one can be replaced quickly, with little worry about further failures during resilver, like rust over 10TB?

1 Like

So there is a way! But only with mirrors which is good to know.

So the drive that ZFS is saying faulted is already connected but I can only import the pool read only.
Unfortunately when I do that there is basically no data left, like 1MB when there were TBs. This also means I can’t edit anything about the pool as it’s readonly. ZFS won’t let me import it normally since when running zpool status when the pool is exported shows that drive with corrupt data.

I’m starting to restore from backups on another drive while I research to see if there is a way to use that extra drive that was part of the mirror or recover this one in some way. Looking at it logically I kinda just blew up the metadata vdev by accident. I have learned a few things like checkpoints (which I will use from now on) and zdb which I’m still looking into.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.