After a system crash, likely one caused by corrupt memory as I was using non-ECC DIMMs at the time, my zpool no longer imports. Booting from a USB drive and running zpool import -af causes an immediate kernel panic, as, unfortunately, does zpool import -aFf.
However, running zpool import -Fn, the pool does display correctly (and without any reported errors). This suggests to me I may be able to solve the issue by specifying how far I should be rolling back. I am also able to import the pool read-only with zpool import -af -o readonly=on, so maybe the issue is actually something else? I have no reason to doubt any of the disks, and I’ve since moved them all over to my backup NAS (which has ECC memory).
I have backups, and I can just zfs send from the readonly mount to rescue whatever work hasn’t been backed up, so my level of concern here is minimal. But, if it’s possible, obviously I would prefer not to have to recreate the pool, since that means many days of spinning rust slowly ticking away while my main workstation has to boot from USB and do all its work over the backup NAS’ slo-o-ow gigabit NIC. Anyone got some suggestions how I can import the pool read-write so I can do a scrub?
pool: idunn
id: 8659442495485374424
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:
idunn ONLINE
raidz1-0 ONLINE
wwn-0x5000039af8d12fe3 ONLINE
wwn-0x5000039a58c98079 ONLINE
wwn-0x5000039a68cb99ac ONLINE
wwn-0x5000039a58c98bab ONLINE
special
mirror-1 ONLINE
nvme-eui.e8238fa6bf530001001b444a48bfbe79 ONLINE
nvme-eui.e8238fa6bf530001001b444a49a5fe67 ONLINE
mirror-2 ONLINE
nvme-WDS200T1X0E-00AFY0_21469J440105_1 ONLINE
nvme-eui.e8238fa6bf530001001b444a49c68889 ONLINE
No hardware issues as far as I can tell, the read-only import shows each device and vdev with zero issues. This is the same result I get from zpool import -Fn. However, zpool import -f idunn will kernel panic both the original machine and the machine the pool is currently plugged into.
-f
Forces import, even if the pool appears to be potentially active.
-F
Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.
-R root
Sets the cachefile property to none and the altroot property to root.
Have you tried the even bigger hammer? This may roll back more TXGs and will take long time.
zpool import -XF -f -R /mnt POOLNAME # "extreme" version
-X
Used with the -F recovery option. Determines whether extreme measures to find a valid txg should take place. This allows the pool to be rolled back to a txg which is no longer guaranteed to be consistent. Pools imported at an inconsistent txg may contain uncorrectable checksum errors. For more details about pool recovery mode, see the -F option, above. WARNING: This option can be extremely hazardous to the health of your pool and should only be used as a last resort.
Used with the -F recovery option. Determines whether
extreme measures to find a valid txg should take place.
This allows the pool to be rolled back to a txg which
is no longer guaranteed to be consistent. Pools im-
ported at an inconsistent txg may contain uncorrectable
checksum errors. For more details about pool recovery
mode, see the -F option, above. WARNING: This option
can be extremely hazardous to the health of your pool
and should only be used as a last resort
No, I have not tried this, and I don’t think I want to. An inconsistent pool sounds worse to me than just recreating it entirely? I can import it read-only, send the most recent snapshots to my other pool, and then recreate this pool. It’ll take a few days, but wouldn’t the result be preferable?
EDIT: The difference between my last backup and the current state was tiny, so I tried this anyway. It’s not like it could make things worse.
But I just got another kernel panic.
Not sure what else I can do to help. If you can mount as Read Only but not “forcibly” some metadata is corrupted most likely. I’d be more inclined to believe that there was some sort of wild power outage or some other random event in addition to non-ECC ram. But something “else” may be a problem from a hardware level. You haven’t included the specs so any and all of this info is guesswork.
-X will basically do a “database log replay”.
You can:
Try and import on another system
Copy the data over somewhere else while the pool is mounted as read only.
The RAIDZ drives are Samsung QVO 870s, the mirrors are Western Digital SN850s.
Yes, I’ve already imported it read only and sent the last snapshots to my backup drive. Any kind of read-write import seems to immediately cause a kernel panic, though, so I guess I’ll just have to destroy and recreate it. Ugh.
What diagnostics would you suggest? I’ve done smartctl self-tests, all the drives claim to be in good health.
Thanks for the link. I know how ZFS handles transactions, or at least the basics of it, but I’ll need lots to read while the spinning rust ticks away.
Not sure whats going on. tar the contents of /var/log and SCP them down to your PC. I can look at it quick…but not sure it will help you much
No idea what specs you are running. Cascading failure from a problem with RAM, CPU, PSU and mobo can lead to unpredictable behavior and logs that “don’t always mean what you think they mean”. So can bad backplanes or cables…
Just as an example: If you’re using a crummy sata controller from China or something, weird things happen and you may not even have a good indicator of what went wrong.
Could be the freespace map got borked, it would explain why read-only import is working without issue. Also the kernel panics.
Seems to have a fair possibility because the free space map gets appended at seemingly random times and if interrupted just right, it will corrupt itself.
If this is the case there isn’t anything you can do with ZFS’s built in tools to fix.
Oh, that sucks.
Guess I’ll just recreate the pool. If it breaks again I’ll look into hardware issues, but I’ve never had any problems with these disks and I get identical crashes on both computers I’ve had the pool plugged into so…
Ran another set of long smartctl tests overnight, no errors. Tried to import it with freeBSD, but ran into the same problem there. Pool does seem to be dead. The experience has really soured me on ZFS though. send|receive means I’ll stick with it, but just losing an entire pool like this really sucks.
I suspected it would be similar given the shared codebase, since it’s readable it’s not completely lost at least. If you want to do some simple stresstesting, 7zip’s benchmark is easy to utilize and you can also adjust memory utilization without leaving the OS.
Aaand it’s happened again.
I’m thinking it might be related to ARC write error makes pool unusable · Issue #15466 · openzfs/zfs · GitHub, so at least it’s being worked on. Still annoying though. After the last time I’ve taken to piping journalctl --follow over ssh to another machine, though, so for once I’ve actually got the error the PC makes when it breaks a pool, rather than just the one it makes when it tries to import the broken pool.
I guess don’t update your ZFS if you want things to work. Ugh! Linux world needs to stop copying web devs “let’s make things worse” update policy, rebuilding my pools all the time is really annoying.