Arch on ZFS crashing

Hello, I have been having an issue with a server I have setup at my summer home.

The server is a Dell R720 running Arch 6.1.15-1-lts.

After a couple of weeks, something happens and VMs start crashing, podman containers keep restarting and something goes wrong in general.

This is particularly problematic as the server is running a pfSense VM which keeps restarting and is bringing the network up and down constantly. The only way to fix it remotely is to issue a hard reset via iDRAC. If I try to gracefully restart the server it hangs and then I have to drive there and force reset it on site.

Please help me troubleshoot! Is this ZFS related? Here is a dmesg paste:

Look like you already found this issue: task txg_sync blocked for more than 120 seconds. · Issue #9130 · openzfs/zfs · GitHub

I looked at your logs and yeah I’m not clear what the underlying issue is.

To be clear, it appears your saying the server will start and then runs fine for weeks, until everything suddenly starts falling apart?

Exactly. And the funny thing is that I have an almost identical server running at my main home, and I dd’d (and zfs sent) its boot drives to the summer home one, changed a couple of things and set it up because I was bored setting up everything from scratch. The main home server has never exhibited the same issue and apart from being almost identical in software, they are almost identical in hardware as well (Two SSDs in ZFS mirror, same H310 HBA, summer home is R720, main home is R720xd).

Are you trying to delete received encrypted snapshots?

Not manually, but I do have syncoid run once a day to back up each server to the other and some datasets are encrypted.

Is this an issue?

The answer is “maby”. There are currently issues with native encryption, see my comment here: What's the common setup for ZFS encrypted home server - #11 by Log

I have absolutely no idea how to work through if your running into one of those bugs, or something else. My gut feeling is to suspect something else, not that that’s very helpful.

If you haven’t already, consider hitting up the zfs mailing list. There’s a nice official frontend for it here: Topicbox

There are a couple of threads on TrueNAS Community and a link from there to a PR on the openzfs GitHub, maybe be in a comment by ikarlo (sp?)

If you attempt to manually do a delete of said snapshot and the box crashes, that is a very strong hint, in which case I would be adjusting sanoid not to trim snapshots for the moment. Then you need to think how that will affect you in the short to medium term.