TrueNAS Core - System Instability/Freezing

OS: Truenas
Motherboard: PRIME X370-PRO
BIOS: 6042 - Up to Date
CPU: AMD Ryzen 5 1600 Six-Core Processor
RAM: Kingston RAM 9905701-143.A00G 16GB DIMM DDR4 2666MT/s (Running at 2400MT/s) x 2
Raid Card: LSI 9201-16i 6Gbps 16-lane SAS HBA P19 IT Mode
M.2 Adapter: HYPER M.2 X16 CARD V2
M.2s: Intel Optane SSD P1600X SSDPEK1A118GA01 M.2 2280 x 4
PSU: CORSAIR RMx Series (2021) RM850x CP-9020200-NA 850 W

The issue I am having with this setup is that after some time the system just freezes and fails to respond until a reboot happens. The system was working fine when the Raid Card was in PCI Slot 1 with the current parts above except the M.2 card. The only changes I have made since then is change the UEFI settings that are need for the M.2 Card to show all M.2 devices. Those change are to set the PCI Slot 1 to RAID Mode which splits the PCI lanes 4x4x4x4 and move the Raid Card to slot 3. The server runs fine for a few days until it stops responding. I can’t find anything in the logs or the syslog server I setup so I’m at a loss.

Does anyone have an idea what is causing this or/and changes I can make to help find the cause of the problem and fix it?

Added this later since I noticed I missed it in the comments.
The OS is installed on 2 SSD’s in a ZFS Mirror that connect to the SATA ports.
All the HDDs go thought the LSI card.
The M.2s in the Hyper M.2 Card are for setting up a MetaData VDEV in at least a 3 way mirror but since the system has been unstable I haven’t done it yet.

My first thought was lack of PCIe lanes, but the manual suggests slot 3 is off the chipset and supports x16, so that should be fine. I’m assuming you are using motherboard bifurcation for the SSDs and using ZFS across those…if there really is some “RAID mode” I would investigate that next. After that, I would investigate the RAM - as you don’t have ECC, you will get less evidence of anything wrong. Try single sticks in different slots.

Starting to clutch at straws, I’d then disconnect drives from the LSI card and leave running. Then I would put the LSI card in slot 2 - this would need you to use the Hyper M2 in x8 mode (remove 2 drives and rebuild OS).

You could also try TrueNAS scale (Linux based) - at least that would be a fairly quick and easy change, but I wouldn’t expect any difference…

Changing OS is likely not the issue here but oh well…

@obvious-potatoes
Do like everyone else during troubleshooting,
Disconnect everything unnecessary, stress test… add one piece of hardware, test again and so on.
I wouldn’t exclude eventual power issues (PSU or Mobo).

I had a similar issue to this on truenas when I ran it off a failing SSD. Took ages to figure out what the issue was. New install (backup config and restore) on a new SSD solved the problem at it ran fine on same system for years after.

If you have access to another m.2 boot drive I’d suggest trying a backup and reinstall.

1 Like

So for this how Asus does it on the X370 it basically when you set “Raid Mode” on PCI Slot 1 it does bifurcation 4x4x4x4 but uses the lanes for PCI Slot 2 as well so nothing can be placed in that slot which is why the LSI Card is in slot 3.The Hyper M.2 card wouldn’t be able to work without it as it is expecting those lanes to split to the M.2 drives. If you do place something in PCI Slot 2 only two drives show up when the card is in PCI Slot 1.

Is there a guide I can use for stress testing? I did memtest before and the ram cleared.
The system was stable before I changed the BIOS setting when the LSI Card was in PCI Slot 1 and none of the UEFI changes were made. It ran fine for over a week and a half.
The reason I that I’m trying to find a solution in the current config is because of the reasons above for the Hyper M.2 Card.

Also the PSU I have is the CORSAIR RMx Series (2021) RM850x CP-9020200-NA 850 W which should be sufficient.

So I thought this was the issue as well. The system is already installed on 2 SSDs that are in a mirror through the SATA connections. The cables are in the picture but the case is right next to it with the drives and SSDs.

At this point its an option but I was trying to stay in core since I’ve had no issues with it (until now), I don’t plan on using any of Scale’s features, and I have other machines running core with cloud syncs through FTP that were failing when I migrated one of them over. I’m guessing due to differences in ftp versions which I couldn’t resolve so I rolled back.

I agree with sticking with Core - that is what I do. However, a Linux version may produce some logs that Core does not, and if it hols up for a week or 2, at least you have narrowed it down. As long as you don’t upgrade the ZFS version, you can flip between the two…

This is a good point. I’ll do this.

here.txt (195.2 KB)
Here are the logs. The system crashed 3 times 2023-02-01. This was on scale but I still don’t see anything in the logs.
The time it crashed was 16:06:41, 19:49:04,and 21:21:03 to be inline with the logs.
Thank you all for the help and everything. While I’m still open for options/answers I’m thinking about starting to look for a server motherboard that supports AM4/AM5 since it “should” have better support for the system config that I want. If anyone has any recommendations I would appreciate it.

Disclaimer: I’m a relative amateur with TrueNAS Core

It does sound like TrueNAS doesn’t like the boards RAID config. I’m not saying it’s the right thing to do, but in your position I would:

Try a test install of Scale
Failing that, try another board, getting one after watching @wendell many videos on the subject of NVMe drives and Core.

Sorry I can’t help further - you’ve got shown some good info for anyone that might be able to help though.

Good luck!

At the risk of getting way out of my depth, here goes some more thoughts…

Taking these lines from your logs:

Feb 1 19:49:27 freenas kernel: Adding 2094076k swap on /dev/mapper/md127. Priority:-3 extents:1 across:2094076k FS
Feb 1 19:49:28 freenas kernel: Adding 2094076k swap on /dev/mapper/md126. Priority:-4 extents:1 across:2094076k FS
Feb 1 19:49:28 freenas kernel: md: md126: resync interrupted.
Feb 1 19:49:29 freenas kernel: md: resync of RAID array md126
Feb 1 19:49:29 freenas kernel: md: md126: resync interrupted.
Feb 1 19:49:29 freenas kernel: md: resync of RAID array md126
Feb 1 19:49:29 freenas kernel: md: md126: resync done.
Feb 1 19:49:29 freenas kernel: md126: detected capacity change from 4188160 to 0

Not sure what the “adding swap” bit is - perhaps a Scale thing? But Linux swap doesn’t work well on ZFS I believe. Maybe a red herring?

Also (and could be a scale thing I suppose?) what is the mdraid about? On my Nixos system with ZFS, I don’t see any obvious log entries with this at all.

So just a sanity check - you really do have the LSI card in HBA/IT mode (not RAID), and what do you mean by the phrase “set the PCI Slot 1 to RAID Mode” - just checking that you haven’t set the advanced-SATA-RAID options.

The reason for the sanity check is I interpret the logs that a pair of drives seem to fail on a resync, which make me wonder if it is your SATA boot pair.

Having done the sanity check, you could use the M2 card drives as your pool, leaving the LSI card in place, but unplugging all the HDDs (I assume they are in a ZFS pool…).

If I understand your config, I don’t think that AM4/5 will make any difference, so I’d keep going a little while longer with this (unless you want the excuse of course!)

Hopefully this helps move a little - it’s bugging me and my remote analysis feels under par this week.

So what I did when I upgraded to Scale is “unpair” my SATA mirrors for easy rollback.
This one was Scale but the other one not plugged in was Core. I did this so all I had to do to roll back was plug the Core one back in and if I stayed I just needed to re-add the other mirror drive.

So you disconnected one the the mirror drives…sorry I could have been clearer with a previous post.

For future reference, the change between scale and Core is just a boot environment - system → boot …I have three core entries on my system (looking like 13.0-U3 and 13.0-U3.1) with an old scale entry (22.02.0.1). Just change the version you want to activate from the menu and reboot! A reminder that the Linux/Scale ZFS version is (or can be) higher than the BSD/Core version. If you update that version, you can’t go back.