Corrupted ZFS metadata on pool

This morning, my 1-vdev, 1-drive ZFS pool on an EXTERNAL USB hard drive that I was working on last night, failed to get auto imported. This is a zstd-10 compressed and aes-256-gcm encrypted pool. I got a message about CORRUPTED metadata (the specific code for which I don’t remember) and the suggested action was to DESTROY the pool and restore from backup. It didn’t give a hint of hope. This is an non-mirrored drive.

You can imagine the panic. After calming down a bit, I looked the error up, and by suggestion of a server fault thread I tried:

zpool import [pool] -F -X

After roughly the time it would take to scrub the pool (single vdev, single disk), it was auto-imported and then I realized that zpool was also SCRUBBING the disk, with a starting timestamp of yesterday evening!

The scrubbing has now ended with 0B repaired and 0 errors found. Everything seems perfectly perfect.

So all of this leads me to some very serious (for me) questions:

  1. Can I be sure that my drive is 100% back to its original state, since 0 bytes were found corrupt, and no errors were reported by the latest scrub?
  2. Is it possible that this corruption happened because a scrub was initiated yesterday evening (the scrub starting timestamp seems valid, as I was working on the PC at that time), and interrupted by shutting down (This is a USB drive we’re talking about, so this adds more possible error vectors)?
  3. I did not (unless I am schizophrenic) initiate last night’s scrub. Can I check what did? I don’t know what output to check.
  4. Why on earth would ZFS suggest a total pool destruction, when (apparently) the simplest of actions returned the pool to 100% health? The encryption key loads well, the dataset mounts normally, and I can see yesterday’s work. If something would have gone wrong, it would surely not let me open ANY files, right (again, this is an encrypted and compressed dataset, at the pool level)?

I am willing to read/learn if you can provide documentation.
I wish @wendell would shine some light, as I was in awe of his video and he surely will have input on this

I am running Debian testing and ZFS (on linux?) 2.0. Please spare me the scolding for my setup, I know it is well-meaning, but this was supposed to be temporary anyway…

zfs-2.0.3-1
zfs-kmod-2.0.3-1

UPDATE May 15th:

I have verified through an external means that indeed no data seems to be corrupted (check summing approx 1TB of data from the backup and verifying it against the zpool). This is to be expected since ZFS reports no errors, but it’s a nice re-assurance to have.

2 Likes

I’m presuming ZFS is ignoring the fact the drive is connected via USB; having the entire pool suddenly die is not expected behaviour.
It might have marked the data it was trying to access as corrupt, if the USB connection dropped part way through a read or write.

If a zpool clear, and a scrub comes up fine, as was the case, then your obviously fine to carry on.
Just for yourself, please ensure your backup is working, and good quality.

The suggestion to destroy and restore is from the logical standpoint of expecting all devices to always be available.
In that situation, if a device in a pool dies/becomes unresponsive, And there are other drives it would give appropriate advice, like to just replace a failed drive, or whatever.
With just single provider it can’t suggest to replace, because the data would be gone. It also can’t suggest using a hot spare, because there is no other device.

I have a couple systems with just a single USB provider, and every now and then the whole pool dies if the USB connection drops or whatever, and yeah, have to re-connect, clear and scrub.
But it’s not the end of the world, and at least it does still work.
I’m just glad it would rather serve No data, rather than potentially Bad data.

4 Likes

It’s hard to imagine a situation where data is corrupted and ZFS doesn’t detect it during a scrub. That said, there have occasionally been serious bugs in bleeding edge ZFS, but rarely (2 that I know of in recent history).