Since 28 November I’ve been going through my worst hardware troubleshooting nightmare ever. I hope someone here can help me get to the bottom of this.
The hardware / software
The problem occurs with my 1 year old FreeNAS / TrueNAS server, which has following hardware:
- OS: FreeNAS11U3 until TrueNAS 12U1 (the upgrade to TrueNAS12 was somewhere beginning of November)
- Case: Fractal Design Define R6 USB-C
- PSU: Fractal Design ION+ 660W Platinum
- Mobo: ASRock Rack X470D4U2-2T
- NIC: Intel X550-AT2 (onboard)
- CPU: AMD Ryzen 5 3600
- RAM: 32GB DDR4 ECC (2x Kingston KSM26ED8/16ME) which was replaced by 64GB DDR4 ECC (2x Samsung M391A4G43MB1-CTD) in the beginning of November
- HBA: Dell H310 ( = LSI SAS 9211-8i) which was replaced by LSI SAS 9207-8i in the beginning of 2021 as part of the troubleshooting
- HDDs: 8x WD Ultrastar DC HC510 10TB (RAID-Z2)
- Boot disk: Intel Postville X25-M 160GB
- SLOG: Intel Optane 900P 280GB (added beginning of October)
This server ran for almost a year without issues, until 28 November, and was properly burned-in by using Memtest86, Prime95, solnet-array-test, badblocks, etc…
The (horror) issue
I have monthly scrubs scheduled on my server on the 28th of the month. On 28 November I got 6 reboots during the scrub-run and it produced 8 CKSUM errors on 5 different HDDs. Before this, I never had issues with scrub…
I’ve thoroughly checked the logs and nothing can be found regarding the reboots or data corruption.
Recent hardware / software changes
As you could read earlier, I have replaced the RAM and upgraded from FreeNAS11 to TrueNAS12 in the same month as the errors started occuring. One month before that, I also added an Intel Optane as SLOG (but this change did have a problem-free scrub on 28 October).
So those are of course “likely suspects”…
A downgrade of TrueNAS12 back to FreeNAS11 isn’t easy, as I’ve already upgraded my pools flags, so I would need to destroy my pool and restore all data from my (non-redundant) offline backup (which is something I’m hesitant to do).
Something weird about the reboots
The reboots occur “in pairs” with about 4m30s / 4m35s in between. I’ve discovered that I can “prevent” the 2nd reboot by manually rebooting before 4m30s / 4m35s pass.
The 2nd reboot also occurs when the pool is still locked (I have encryption on my pool). So while nothing or hardly anything actually happens with the pool.
What I’ve tried so far (unsuccessful)
- Since 28 November I’ve run about 30-50 scrubs for my troubleshooting, which take about 12 hours per pass. For each thing I try, I need to run scrub at least twice, because the first run, it can still find CKSUM errors that were created before the thing I’ve changed.
- At first I wasn’t able to re-produce the reboots. Since mid December I’ve discovered that setting sync=always without the Optane SLOG increases the chance of reboots. But still the reboots occur very random (but always in pairs).
- In Windows, I’ve ran an extended-self-test on all 8 HDDs simultaneously, while also running Prime95 blended at the same time. This should stress the HDDs, the PSU, the CPU, the RAM all at the same time. No reboots occured and all HDDs were found to be 100% healthy.
- Also the SMART values of all HDDs are 100% healthy.
- I’ve confirmed that Memory ECC reporting properly works with the 64GB RAM (I already confirmed this with my 32GB RAM before the issues started occuring, but, just to make sure, I reconfirmed it also with the 64GB RAM).
- I underclocked the RAM (ran it below spec). This didn’t help.
- I’ve tried triggering the issue on my Intel Optane by creating a single disk pool on the Optane, constantly running scrub / writing data to it, for a whole day. I couldn’t reproduce the issue on the Intel Optane. This shifted my suspicion to the HBA, as the issue only seemed to occur when using a pool on the HBA.
- I’ve completely removed the Intel Optane from my server. This also did not solve the issue. (here I did discover the impact of sync=always on the likelihood of reboots)
- I’ve forced “Power Supply Idle Control” to “Typical Current Idle” in the BIOS and confirmed that CPU C-States already were disabled. Also this did not help.
- I’ve upgraded my BIOS, the IPMI and upgraded from TrueNAS12 to TrueNAS12U1. Also this did not help.
- I’ve re-inserted the HBA and re-attached the SFF-8087 cable, both on the HDD and HBA side. Didn’t help.
- I’ve (temporarely) added a screamingly loud Delta datacenter 120mm fan. My wife almost kicked me out of the house because of the noise, but, as it again didn’t help, I’m quite sure it is not temperature related.
- I’ve (temporarely) replaced the SFF-8087 cable with an old one. Didn’t help.
- I’ve (temporarely) replaced the PSU with an old 850W Seagate PSU. Didn’t help.
- I’ve bought a new HBA (LSI 9207-8i instead of Dell H310), installed an Intel CPU cooler on it, to make sure it is properly cooled and tried again. Didn’t help.
What I’ve tried so far (slightly successful)
- I’ve moved the HBA from PCI-e slot 6 to PCI-e slot 4. Both of these slots come directly from the CPU (not from the Chipset), so it shouldn’t really matter, but it did make a difference… There were noticeably less CKSUM errors during the many scrub runs I’ve tried. But still some CKSUM errors did occur (about 1 per run and once even 0). So I’m getting closer (finally )
- I’ve replaced the Ryzen 3600 in my server, with my desktops Ryzen 3900XT and ran scrub 3x without errors with the HBA in slot 6 and 2x without errors with the HBA in slot 4. So it seems like my issue is related to my CPU! I also discovered (I think) a tiny little bit of thermal paste covering about 2-3 pins on the side of my Ryzen 3600 CPU. So with a soft brush and lots of patience / IPA I cleaned it off.
- Finally, I re-inserted my cleaned Ryzen 3600 and ran scrub some more. The first time it completed without CKSUM errors, but the 2nd run, I again got 1 error. So although it certainly seems better than before, my problem still isn’t solved
Conclusions so far
- It seems like the issue is related to PCI-e / the CPU
- It could be related to the thermal paste covering the pins, but then it is still very strange that
- the problem only started occurring after almost a year
- cleaning it didn’t help to solve it completely
- When googling for PCI-e errors, I found for example this thread from wendell : Threadripper & PCIe Bus Errors 2 “remarkable” things I notice in this thread:
- PCI-e errors should be detected
- PCI-e errors can even be corrected sometimes
- Difference is that
- I’m using Ryzen (vs Threadripper), but I don’t expect this to matter, as both are using the same Zen cores
- I’m using a BSD flavor (TrueNAS12) (vs Linux), but also here I would expect the experience to be the same, as TrueNAS12 has “added” support for Ryzen (FreeNAS11 didn’t have this) and properly reports, for example, MCA errors (like memory ECC errors). I validated this myself.
- That the problem didn’t occur with my Intel Optane, but only with the HBA, could be related to the Optane being in PCI-e slot 4, while the HBA was in PCI-e slot 6
- It still blows my mind that the reboots occur in pairs. This smells like a “software-issue”, but everything else clearly points at a “hardware-issue”
- Why did cleaning my CPU only help a little bit?
- Is my CPU broken?
- Should I try to clean it differently? It was only extremely little thermal paste (I was lucky to spot it) and I was very thorough in cleaning it…
- Why am I not seeing PCI-e errors in /var/log/messages? @wendell do you perhaps have an idea?
- If PCI-e errors are being corrected (without being reported), then I suppose the problem is even worse, as only the ones that it fails to correct form CKSUM errors in scrub.
- Am I missing something?