Hi!
I have a few questions about my zfs pool that (suddenly) shows errors.
** Background **
My Truenas Scale system froze up ( looking on the (IPMI) monitor I saw that it said that it Kill-9:ed systemd-journald.service). I had to “force” reboot it.
After it started up my main pool came online, but with an error
" Pool HDD state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected."
I’m a bit surprised by this, since I thought zfs almost always would be able to reach a safe state (although the latest data might not be written).
I did not (intentionally at least) write to to the pool when Truenas Scale crashed.
None of the hard drives report errors, and a SHORT smart-test showed no errors.
When I run zpool status HDD -v
I get the following information:
pool: HDD
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Fri Nov 24 23:05:55 2023
11.2T scanned at 8.55G/s, 1.06T issued at 830M/s, 65.4T total
0B repaired, 1.63% done, 22:34:01 to go
config:
NAME STATE READ WRITE CKSUM
HDD ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
uuid-1 ONLINE 0 0 0
uuid-2 ONLINE 0 0 0
uuid-3 ONLINE 0 0 0
uuid-4 ONLINE 0 0 0
uuid-5 ONLINE 0 0 0
uuid-6 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
HDD/Main@auto-2023-02-21_00-00:<0x1>
HDD/Old_NAS@auto-2023-02-10_00-00:<0x1>
HDD/Main@auto-2023-02-22_00-00:<0x1>
HDD/Main@auto-2023-02-20_00-00:<0x1>
HDD/Main@auto-2023-02-19_00-00:<0x1>
HDD/Main@auto-2023-02-23_00-00:<0x1>
HDD/Old_NAS@auto-2023-02-09_00-00:<0x1>
Note that none of the disks reports errors
All of these are old snapshots (that I was trying to read before the crash. I don’t know if I did look at all of them):
For some reason they are called “files” and appended with :<0x1>
after the snapshot name.
This pool contains mainly backups of data from other places. But some data is unique here, so I would like to be able to recover. I don’t need the snapshots that are affected.
Questions
- How could this have happened?
- Is it safe to (try) to delete the snapshots/ “files”?
- Can I trust the rest of my data?
- Can this be recovered from (after deleting the affected files)?