ZFS data corruption and random reboots during scrubs on TrueNAS12

Mastakilla · January 13, 2021, 10:07am

Hi all,

Since 28 November I’ve been going through my worst hardware troubleshooting nightmare ever. I hope someone here can help me get to the bottom of this.

The hardware / software

The problem occurs with my 1 year old FreeNAS / TrueNAS server, which has following hardware:

OS: FreeNAS11U3 until TrueNAS 12U1 (the upgrade to TrueNAS12 was somewhere beginning of November)
Case: Fractal Design Define R6 USB-C
PSU: Fractal Design ION+ 660W Platinum
Mobo: ASRock Rack X470D4U2-2T
NIC: Intel X550-AT2 (onboard)
CPU: AMD Ryzen 5 3600
RAM: 32GB DDR4 ECC (2x Kingston KSM26ED8/16ME) which was replaced by 64GB DDR4 ECC (2x Samsung M391A4G43MB1-CTD) in the beginning of November
HBA: Dell H310 ( = LSI SAS 9211-8i) which was replaced by LSI SAS 9207-8i in the beginning of 2021 as part of the troubleshooting
HDDs: 8x WD Ultrastar DC HC510 10TB (RAID-Z2)
Boot disk: Intel Postville X25-M 160GB
SLOG: Intel Optane 900P 280GB (added beginning of October)

This server ran for almost a year without issues, until 28 November, and was properly burned-in by using Memtest86, Prime95, solnet-array-test, badblocks, etc…

The (horror) issue

I have monthly scrubs scheduled on my server on the 28th of the month. On 28 November I got 6 reboots during the scrub-run and it produced 8 CKSUM errors on 5 different HDDs. Before this, I never had issues with scrub…
I’ve thoroughly checked the logs and nothing can be found regarding the reboots or data corruption.

Recent hardware / software changes

As you could read earlier, I have replaced the RAM and upgraded from FreeNAS11 to TrueNAS12 in the same month as the errors started occuring. One month before that, I also added an Intel Optane as SLOG (but this change did have a problem-free scrub on 28 October).
So those are of course “likely suspects”…
A downgrade of TrueNAS12 back to FreeNAS11 isn’t easy, as I’ve already upgraded my pools flags, so I would need to destroy my pool and restore all data from my (non-redundant) offline backup (which is something I’m hesitant to do).

Something weird about the reboots

The reboots occur “in pairs” with about 4m30s / 4m35s in between. I’ve discovered that I can “prevent” the 2nd reboot by manually rebooting before 4m30s / 4m35s pass.
The 2nd reboot also occurs when the pool is still locked (I have encryption on my pool). So while nothing or hardly anything actually happens with the pool.

What I’ve tried so far (unsuccessful)

Since 28 November I’ve run about 30-50 scrubs for my troubleshooting, which take about 12 hours per pass. For each thing I try, I need to run scrub at least twice, because the first run, it can still find CKSUM errors that were created before the thing I’ve changed.
At first I wasn’t able to re-produce the reboots. Since mid December I’ve discovered that setting sync=always without the Optane SLOG increases the chance of reboots. But still the reboots occur very random (but always in pairs).
In Windows, I’ve ran an extended-self-test on all 8 HDDs simultaneously, while also running Prime95 blended at the same time. This should stress the HDDs, the PSU, the CPU, the RAM all at the same time. No reboots occured and all HDDs were found to be 100% healthy.
Also the SMART values of all HDDs are 100% healthy.
I’ve confirmed that Memory ECC reporting properly works with the 64GB RAM (I already confirmed this with my 32GB RAM before the issues started occuring, but, just to make sure, I reconfirmed it also with the 64GB RAM).
I underclocked the RAM (ran it below spec). This didn’t help.
I’ve tried triggering the issue on my Intel Optane by creating a single disk pool on the Optane, constantly running scrub / writing data to it, for a whole day. I couldn’t reproduce the issue on the Intel Optane. This shifted my suspicion to the HBA, as the issue only seemed to occur when using a pool on the HBA.
I’ve completely removed the Intel Optane from my server. This also did not solve the issue. (here I did discover the impact of sync=always on the likelihood of reboots)
I’ve forced “Power Supply Idle Control” to “Typical Current Idle” in the BIOS and confirmed that CPU C-States already were disabled. Also this did not help.
I’ve upgraded my BIOS, the IPMI and upgraded from TrueNAS12 to TrueNAS12U1. Also this did not help.
I’ve re-inserted the HBA and re-attached the SFF-8087 cable, both on the HDD and HBA side. Didn’t help.
I’ve (temporarely) added a screamingly loud Delta datacenter 120mm fan. My wife almost kicked me out of the house because of the noise, but, as it again didn’t help, I’m quite sure it is not temperature related.
I’ve (temporarely) replaced the SFF-8087 cable with an old one. Didn’t help.
I’ve (temporarely) replaced the PSU with an old 850W Seagate PSU. Didn’t help.
I’ve bought a new HBA (LSI 9207-8i instead of Dell H310), installed an Intel CPU cooler on it, to make sure it is properly cooled and tried again. Didn’t help.

What I’ve tried so far (slightly successful)

I’ve moved the HBA from PCI-e slot 6 to PCI-e slot 4. Both of these slots come directly from the CPU (not from the Chipset), so it shouldn’t really matter, but it did make a difference… There were noticeably less CKSUM errors during the many scrub runs I’ve tried. But still some CKSUM errors did occur (about 1 per run and once even 0). So I’m getting closer (finally )
I’ve replaced the Ryzen 3600 in my server, with my desktops Ryzen 3900XT and ran scrub 3x without errors with the HBA in slot 6 and 2x without errors with the HBA in slot 4. So it seems like my issue is related to my CPU! I also discovered (I think) a tiny little bit of thermal paste covering about 2-3 pins on the side of my Ryzen 3600 CPU. So with a soft brush and lots of patience / IPA I cleaned it off.
Finally, I re-inserted my cleaned Ryzen 3600 and ran scrub some more. The first time it completed without CKSUM errors, but the 2nd run, I again got 1 error. So although it certainly seems better than before, my problem still isn’t solved

Conclusions so far

It seems like the issue is related to PCI-e / the CPU
It could be related to the thermal paste covering the pins, but then it is still very strange that
- the problem only started occurring after almost a year
- cleaning it didn’t help to solve it completely
When googling for PCI-e errors, I found for example this thread from wendell : Threadripper & PCIe Bus Errors 2 “remarkable” things I notice in this thread:
- PCI-e errors should be detected
- PCI-e errors can even be corrected sometimes
- Difference is that
  - I’m using Ryzen (vs Threadripper), but I don’t expect this to matter, as both are using the same Zen cores
  - I’m using a BSD flavor (TrueNAS12) (vs Linux), but also here I would expect the experience to be the same, as TrueNAS12 has “added” support for Ryzen (FreeNAS11 didn’t have this) and properly reports, for example, MCA errors (like memory ECC errors). I validated this myself.
That the problem didn’t occur with my Intel Optane, but only with the HBA, could be related to the Optane being in PCI-e slot 4, while the HBA was in PCI-e slot 6
It still blows my mind that the reboots occur in pairs. This smells like a “software-issue”, but everything else clearly points at a “hardware-issue”

Questions

Why did cleaning my CPU only help a little bit?
Is my CPU broken?
Should I try to clean it differently? It was only extremely little thermal paste (I was lucky to spot it) and I was very thorough in cleaning it…
Why am I not seeing PCI-e errors in /var/log/messages? @wendell do you perhaps have an idea?
If PCI-e errors are being corrected (without being reported), then I suppose the problem is even worse, as only the ones that it fails to correct form CKSUM errors in scrub.
Am I missing something?

Anti-ctrl · January 13, 2021, 10:42am

Wild guess here: Could it be, that maybe the Mainboard is toast?

Reason why i think that (as i said, wild guess): You took out the 3600, inserted the new one and remounted the cooler. Now you have different Pressurecharakteristics on the board, which could potentially mean, that the PCIe Lanes are “bound” perfectly to correct the problem. The Problem then reappears, after you put in the old CPU again, because the Pressure is different and this time “not Perfect”.

Maybe you can confirm this by trying the following: Get yourself a Big Ass Cooler, like a Alpenföhn Brocken or something, to put strain on the board and check if that “breaks” it. If so, you found the Problem. Possible that this doesnt do anything, but it might be worth a shot.

EDIT: Else, maybe check if some paste is still in the socket, but i would doubt it.

marelooke · January 13, 2021, 10:57am

There is apparently a data corruption issue with TrueNAS 12, see this ticket, so while it could still be hardware related, having a look through that bug report certainly couldn’t hurt, I’d say.

Mastakilla · January 13, 2021, 7:59pm

I’m waiting for Noctua to send me some AM4 mounting of their NH-D14 cooler, then I can try with that. I might try another swap before that. I tried to check the socket for paste as well, but it is very hard to see…

Mastakilla · January 13, 2021, 8:06pm

I see no mention of scrub anywhere in this topic / ticket. So I think they are having their corruption occur earlier, before the checksum is calculated. So it might be something else.
But perhaps I should also try to create a bug report. At least the reboots are very strange…

edit:
I’ve created below bug report:
https://jira.ixsystems.com/projects/NAS/issues/NAS-108980?filter=allopenissues

Mastakilla · January 14, 2021, 7:50pm

Since yesterday evening I’ve been trying to trigger PCIe express errors on Linux using my Optane. I’m doing this to figure out if the problem “that none of these errors are reported in TrueNAS” is a problem limited to TrueNAS or if it is OS independent…

I’ve done the following:

In the BIOS I found an PCIe AER setting. I’ve changed this from “auto” to “enabled”.
I’ve swapped my HBA and Optane. Earlier my HBA was in the “more sensitive” PCIe slot 6. But since my HBA contains my pool with all my data, I can’t really use it for testing in Linux. So I swapped both cards so that now my Optane is in the “more sensitive” PCIe slot 6 and I can do Linux testing with that.
In Linux (Fedora Rawhide with Linux 5.11 kernel) I validated that AER was properly enabled (in dmesg it is confirmed to be active and enabled)
I created an XFS partition on the Optane and mounted it. If it are really PCIe errors, then the partition shouldn’t really matter.
I used bonnie++ in a look to do lots of reading / writing to the Optane, while running “tail -f /var/log/messages | grep -i pcie” in another window.

So far I wasn’t able to trigger any error…

marelooke · January 16, 2021, 8:34pm

As a PSA, the issue in the linked ticket has been fixed (by reverting the offending patch) in bugfix release 12.0-U1.1, the underlying issue will be dealt with in this new ticket, for those interested.

@Mastakilla might be worth upgrading, if at all possible, just in case. If nothing else it will be undeniable proof that the issue is not related

Mastakilla · January 16, 2021, 11:50pm

That is indeed a very interesting patch… I just aborted a scrub run to get it installed.

After trying to trigger AER errors in Linux, I left the Optane in PCIe slot 6 and the HBA in PCIe slot 4 and tried running scrub twice. In total there were 24 CKSUM errors for these 2 scrubs (I think the most I ever had). So where PCIe slot 4 used to be the slot with a lot less CKSUM errors, after swapping the CPU and then swapping it back, PCIe slot 4 now is the weaker PCIe slot.

So I swapped the Optane to the “now-weaker” PCIe slot 4 and the HBA to the “now-better” PCIe slot 6. Then I retried to trigger AER errors in Linux using the Optane in the weaker PCIe slot 4, but still I wasn’t able to trigger any errors on Linux.

Next I’ll run some scrubs with the HBA in the “hopefully-still-better” slot to confirm.

(Or, who knows, confirm that the patch actually solves my problem too…)

edit: I just had 2 CKSUM errors during the 2nd scrub after upgrading to TrueNAS12U1.1, so it didn’t help with my issue…

Mastakilla · January 17, 2021, 10:54am

After this round of scrubs is completed, I wanted to swap the CPUs again, to reconfirm that my 3900XT still doesn’t have the problem (to check if the previous swap wasn’t just an accidental “good mount” or something).

I also wanted to try to reproduce errors with my “possibly-broken” Ryzen 3600 on my Win10 desktop. That would be useful in case I need to do an RMA. I’ll need to be able to tell them an easy way to reproduce the issue.

I don’t have an HBA in my desktop, only a GPU and a NIC (but the NIC is on an chipset-PCIe slot). So I’ll need to figure out a way to trigger the issue using my GPU (Radeon 5700).

I think about following requirements for the test:

Runs in Windows 10
Transfers a lot of data over PCIe, so not something that does mainly calculations and few PCIe transfers
It needs to be able to discover data corruption (by using checksums or validating the result over the internet)

Possible candidates I think of are

Folding@home (not sure if this transfers a lot of data over PCIe though?)
Mining ethereum (not sure if this transfers a lot of data over PCIe though?)
Calculating something with the GPU in some Adobe product (not sure if this is able to check the result)
etc

Does anyone have an idea which software could be a good candidate for this test?

Mastakilla · January 17, 2021, 4:16pm

I just had another pair of unexpected reboots. This time I had a SOL session open while they occured.
The TrueNAS bootscreen overwrites the previous lines, so I wasn’t able to see if anything was logged just before the first reboot.
But I was in time to fully observe the 2nd reboot… And just before the reboot, nothing was written to /dev/console.

This means that the “no log entries just before the reboot” isn’t because the log filesystem is unreachable or something, as even over network, no log entries are written…

Mastakilla · January 21, 2021, 12:22pm

In the TrueNAS bug report, it was advised to boot with the debug kernel and see if the unexpected reboots would log anything with that kernel.
But it seems like the debug kernel prevents the issue from happening completely… I did 2 scrubs with the HBA in slot 6 and 1 scrub with the HBA in slot 4 and had 0 CKSUM errors.

So either the debug kernel stresses the broken component in my CPU less.
Or the debug kernel does something different causing the issue not to be triggered at all in software.

Not sure which of the 2 it is…

Next step is to swap CPUs again… No one here has a suggestion on how to try to trigger the issue in Windows with a GPU?

Mastakilla · February 4, 2021, 9:50am

fyi:

It seems like the root cause was a broken CPU. After replacing the CPU (first temporarily with my desktops Ryzen, later by RMAing it for a new Ryzen 3600), the problems are gone.

So the problem was not caused by TrueNAS <-> AMD incompatibility. Although perhaps not perfect, my experience with TrueNAS <-> AMD compatibility has not been a bad one.

It is a bit concerning that the data corruption itself wasn’t detected by anything. Only the corrupted data itself got detected by scrub. But as I wasn’t even able to trigger any PCIe AER errors for example in Linux or Windows either (I tried this using my Optane instead of the HBA), I am not sure exactly which part of the CPU was broken and if it is a TrueNAS issue or a platform (AMD) issue or perhaps a combination…

system · November 5, 2021, 3:50am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.