ZFS Pool is not healthy without disk errors

PatrikK · November 24, 2023, 10:41pm

Hi!

I have a few questions about my zfs pool that (suddenly) shows errors.

** Background **
My Truenas Scale system froze up ( looking on the (IPMI) monitor I saw that it said that it Kill-9:ed systemd-journald.service). I had to “force” reboot it.

After it started up my main pool came online, but with an error

" Pool HDD state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected."

I’m a bit surprised by this, since I thought zfs almost always would be able to reach a safe state (although the latest data might not be written).

I did not (intentionally at least) write to to the pool when Truenas Scale crashed.

None of the hard drives report errors, and a SHORT smart-test showed no errors.

When I run zpool status HDD -v

I get the following information:

 pool: HDD
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Fri Nov 24 23:05:55 2023
	11.2T scanned at 8.55G/s, 1.06T issued at 830M/s, 65.4T total
	0B repaired, 1.63% done, 22:34:01 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	HDD                                       ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    uuid-1  ONLINE       0     0     0
	    uuid-2  ONLINE       0     0     0
	    uuid-3  ONLINE       0     0     0
	    uuid-4  ONLINE       0     0     0
	    uuid-5  ONLINE       0     0     0
	    uuid-6  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        HDD/Main@auto-2023-02-21_00-00:<0x1>
        HDD/Old_NAS@auto-2023-02-10_00-00:<0x1>
        HDD/Main@auto-2023-02-22_00-00:<0x1>
        HDD/Main@auto-2023-02-20_00-00:<0x1>
        HDD/Main@auto-2023-02-19_00-00:<0x1>
        HDD/Main@auto-2023-02-23_00-00:<0x1>
        HDD/Old_NAS@auto-2023-02-09_00-00:<0x1>

Note that none of the disks reports errors

All of these are old snapshots (that I was trying to read before the crash. I don’t know if I did look at all of them):

For some reason they are called “files” and appended with :<0x1> after the snapshot name.

This pool contains mainly backups of data from other places. But some data is unique here, so I would like to be able to recover. I don’t need the snapshots that are affected.

Questions

How could this have happened?
Is it safe to (try) to delete the snapshots/ “files”?
Can I trust the rest of my data?
Can this be recovered from (after deleting the affected files)?

wendell · November 25, 2023, 12:01am

this type of error is usually a zfs bug or possibly system induced. usually removing the affected files is a way to recover. What version of the system is this and how old is the pool? Sometimes upgrading would be a good course of action… sometimes not

PatrikK · November 25, 2023, 9:24am

The system was recently reinstalled (with settings and keys restored from a configuration backup) to a new SSD to see if that helpt with web-ui stabillity issues.

I’m running TrueNas Scale 22.12.4.2.

The pool was created (with truenas scale) when the server first was build in late december last year or early january this year.

I did a scrub on the 21:st that showed no issued, and the hdd’s shows no issue on a short smart-test.

Zfs version installed:
zfs-2.1.12-1
zfs-kmod-2.1.12-1

Exard3k · November 25, 2023, 11:35am

If you can read the data, it will be 100% correct data. ZFS doesn’t return data (and throws IO error) if it isn’t sure the data is in the state you wrote it.

I’m not sure what went wrong with your pool. Listing snapshots seem strange and not what I’ve encountered so far.

diizzy · November 25, 2023, 11:38am

This scrub isn’t done, it’ll probably display more information once finished

PatrikK · November 25, 2023, 12:16pm

These errors showed up after the reboot, before starting the scrub.

I’ll see if anything more is listed when the scrub is done.

wendell · November 25, 2023, 1:38pm

Do let us know the final results of the scrub. Likely this was zfs bugs based on what you’ve said so far

PatrikK · November 25, 2023, 6:21pm

The scrub has now finished. Strangely without any errors. And the previous errors have disappeared. And without repairing anything.


  pool: HDD
 state: ONLINE
  scan: scrub repaired 0B in 20:01:35 with 0 errors on Sat Nov 25 19:07:30 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            uuid-1                                ONLINE       0     0     0
            uuid-2                                ONLINE       0     0     0
            uuid-3                                ONLINE       0     0     0
            uuid-4                                ONLINE       0     0     0
            uuid-5                                ONLINE       0     0     0
            uuid-6                                ONLINE       0     0     0

errors: No known data errors

I have not deleted the affected files

Exard3k · November 25, 2023, 6:25pm

wendell · November 25, 2023, 8:08pm

Smells like flipped ram bits possibly now esp if you didn’t zpool upgrade to change zfs versions

PatrikK · November 25, 2023, 9:43pm

I do use ECC ram (with a ryzen though). So it should be unlikely.
I have not (intentionally at least) done a zpool upgrade.