Hello, I have been having an issue with a server I have setup at my summer home.
The server is a Dell R720 running Arch 6.1.15-1-lts.
After a couple of weeks, something happens and VMs start crashing, podman containers keep restarting and something goes wrong in general.
This is particularly problematic as the server is running a pfSense VM which keeps restarting and is bringing the network up and down constantly. The only way to fix it remotely is to issue a hard reset via iDRAC. If I try to gracefully restart the server it hangs and then I have to drive there and force reset it on site.
Please help me troubleshoot! Is this ZFS related? Here is a dmesg paste:
Exactly. And the funny thing is that I have an almost identical server running at my main home, and I dd’d (and zfs sent) its boot drives to the summer home one, changed a couple of things and set it up because I was bored setting up everything from scratch. The main home server has never exhibited the same issue and apart from being almost identical in software, they are almost identical in hardware as well (Two SSDs in ZFS mirror, same H310 HBA, summer home is R720, main home is R720xd).
I have absolutely no idea how to work through if your running into one of those bugs, or something else. My gut feeling is to suspect something else, not that that’s very helpful.
If you haven’t already, consider hitting up the zfs mailing list. There’s a nice official frontend for it here: Topicbox
There are a couple of threads on TrueNAS Community and a link from there to a PR on the openzfs GitHub, maybe be in a comment by ikarlo (sp?)
If you attempt to manually do a delete of said snapshot and the box crashes, that is a very strong hint, in which case I would be adjusting sanoid not to trim snapshots for the moment. Then you need to think how that will affect you in the short to medium term.