Data corruption with 12900K

alpha754293 · March 30, 2022, 6:19am

Has anybody else experienced data corruption as a result of the 12900K/Z690 platform in Linux?

I’m still using CentOS 7.7.1908 with the 3.10.1127 kernel (because the newer kernels causes a kernel panic due to a conflict with my Infiniband ConnectX-4 card/system/kernel modules), and I’ve been getting bad page frame numbers fairly regularly now, albeit at different times.

It used to complain about the r8125 kernel module, but since then, I’ve disabled that and put in an Intel Desktop CT gigabit NIC instead and I’m still getting kernel error reports of bad page frame numbers.

Now, the new issue that I’ve discovered is that in processing my data (using pixz), upon checking the archives that my 12900K produces, it says that the archive/tarball files are corrupt or have problems with it (but I am not getting this same problem with my AMD Ryzen 9 5950X). (Both systems use Crucial DDR4-3200 unbuffered, non-ECC RAM.)

Just wondering if anybody else is seeing this issue under Linux.

Thanks.

anon34403656 · March 30, 2022, 11:33am

Firstly a note, that this is not my expertise. I have a hard time imagining that the silicon of the chip would be faulty and result only in those specific problems and not a ~~platora~~ plethọra of other failures.

Firstly I suggest you prepare a run with Memtest. Please let it finish it’s standard setting of 4 passes. This would be a good indication of maybe your memory is faulty or at least unstable and resulting in errors.

If this yields not results I would test the underlying storage media. Maybe you could attach a additional SSD and read + write the data from and to that additional SSD, so that you can make sure it is not a defect of your primary storage media. Hardware defects of storage media are unfortunately a little bit harder to test, since SSDs are not always reporting problems correctly as the according SMART values.

alpha754293 · March 30, 2022, 6:47pm

Yeah, so, I actually threw on Windows 10 20H2 and then upgraded it to 21H2 just to see if I would be able to use the system as a Windows system (instead of as a Linux system).

(With the intention that it was going to maybe be the system that was going to take over the virtualization management/hosting tasks as I’ve got between 8-10 VMs that I can have it host instead of using my old 6700K.)

And when I was trying to import the OVA applications, it was failing, saying that the OVA source images (that I created and verified as working) were corrupted.

So now I am running memtest86 and sure enough, I’m currently at around 25 minutes into the test (pass 1, test 6) and it has already recorded over 700 errors.

So, the next thing that I am going to have to do to be able to tell whether it’s a CPU problem or whether it’s a memory problem would be to take the same four sticks of Crucial 32GB DDR4-3200 RAM and put those into my AMD Ryzen 9 5950X system, and run memtest86 on the same four sticks to see if I am getting the errors again.

If I’m not, then I know that it’s the CPU that’s the problem.

re: testing the primary storage media
Not a bad idea, but I am using a HGST 1 TB SATA 6 Gbps 7200 rpm HDD, so in theory, it is less likely, though not immune, to problems.

But yeah, this is going to be a “fun week” for me, having to do this hardware testing.

anon34403656 · March 30, 2022, 8:52pm

It could be the memory, the memory controller on the CPU or the mainboard, but now you have some clues what to look for.

xzpfzxds · March 30, 2022, 9:10pm

Do a fresh install on ZFS and it’ll very quickly tell you if you have a problem with the disk / controller / cable / SATA driver. If you’re still getting corruption and ZFS is happy then it’s probably the RAM. I’ve heard multiple reports of Alder Lake systems having issues with 4 DIMMs, try disabling XMP?

alpha754293 · March 31, 2022, 5:11am

Yeah, well, that’s why I said that once the 4 sticks of memory finishes testing on the 12900K, I’m going to put the same 4 sticks into my 5950X and test it again.

if it passes on the 5950X, then it’s the CPU and/or motherboard (at this point, I am apt to think that it is more likely the CPU than the motherboard).

I’m not using XMP.

There’s two possibilities:

The archive was created with inherent corruption due to what is leaning towards there being an issue with the 12900K, and therefore; the archive that it produced was corrupted/problematic to begin with.
For the archives that are now being tested and reported as being corrupted on the 12900K, SOME of them, I’ve been able to re-test on the 5950X and it tests out fine.

And to add another data point to that, as I mentioned, I installed Windows 10 on the 12900K, and was trying to import a virtual machine and said that the file was corrupted.

But when I tried to import the exact same virtual machine on my 6700K, it imported it without any issues, which is what really started to make me think that there was an issue with the system.

(Again, I am starting to memtest first and so far, it’s about 10 hours and 50 minutes in, pass 3 of 4, running test 4, and it’s encountered 8043 errors so far.

I would think that if it was the memory that was the issue, if the RAM had THIS many error (and memtest isn’t even finished yet), I would have started seeing/noticing corruption issues a lot sooner, but I had always pegged the “bad page frame number” as being an incompatibility between the older Linux kernel with the 12900K and not necessarily an actual problem with the hardware like the current testing is showing.

(sidebar: I am using my TrueNAS system and restoring some of the other archives that were recently written to tape and am running verification checks on the data from tapes to make sure that there aren’t any issues with those archives. That is done also due in part because I’ve been thrashing the SAS hard drives that’s on my cluster headnode between preparing the data to be written to tape and now having to run these verification checks on the data also and resolve any corruption issues by restoring whatever I can from the last backup. LTO-8 is AWESOME for that, BTW.)

zenstrata · March 31, 2022, 5:32am

plethora, Plethora! Pleth-o-ra!
Phew. that was bugging me.

Not specifically on Linux, but I have heard of some very odd problems that are cropping up with the new ‘e’ core ‘p’ core designs in windows 11. A couple tech channels I watch have run into issues while they were streaming, and it appears to be linked with the mismatched cores somehow. Although that is entirely speculation, they have never had those sorts of problems before with any other processors.

I suspect it could easily be due to this - basically teething issues with new tech. Not entirely unexpected.

That said - my 5950x system is working great. Not one problem so far on all sorts of different workloads.

Try disabling the E cores and running the system for a while without them. See if it gives you any corruption.

anon34403656 · March 31, 2022, 9:36am

You are right, of course. Fixed!

alpha754293 · March 31, 2022, 7:18pm

I concur with this statement.

I appreciate the suggestion, but I think that I am going to try and RMA the processor back to Intel for a refund because if I have to cripple the processor just to get it to work/just to make it so that it is not going to corrupt my data, then from my perspective, that processor is not really worth it as a processor to/for me, especially given its cost/the price that I paid for it.

alpha754293 · April 2, 2022, 3:04am

For those that have been following this saga, I actually pulled the memory that was in my 5950X system (also four sticks of Crucial 32 GB DDR4-3200 unbuffered, non-ECC RAM, which passed memtest86 on said 5950X system before I pulled it).

This happened:

There is definitely something wrong with the CPU and/or the motherboard in order to produce this result.

AliasInWendellLand · April 2, 2022, 5:06am

I’ve heard multiple reports of Alder Lake systems having issues with 4 DIMMs, try disabling XMP?

But that’s been DDR5 I believe?

alpha754293 · April 2, 2022, 12:41pm

I don’t know.

I was originally going to build the system with DDR5, but I couldn’t source 128 GB of said DDR5, not even at the 4800 speeds, which is why I ended up with DDR4.

(And that worked out because then the memory was interchangable between my 12900K system and my 5950X system.)

xzpfzxds · April 3, 2022, 9:42am

It definately has problems with 4 DDR5 DIMMs, but many systems have problems filling all the DIMM slots with large capacity fast unbuffered DIMMS and running them at fast timings This is some of the reason why registered DIMMs exist, to take load off of the memory controller.

alpha754293 · April 3, 2022, 5:19pm

I WISHED that the 12900K supported ECC registered memory.

On the other hand though, given the issue that I am seeing with my system, I would have been concerned that if I was using ECC registered memory, that the ECC would be trying to correct for these errors and it would only hide/mask this problem that I’m experiencing instead of being like “no, this is and there is a legitimate issue that needs to be properly fixed and addressed”.

(Like there is no reason why a system should be spontaneously rebooting itself when I’m trying to run memtest86 on the four sticks of DDR4-3200 unbuffered, non-ECC RAM from my 5950X system. (Which, I ran memtest86 on said 5950X system to make sure that there weren’t any problems with that RAM in the first place, and it passed on my 5950X) before popping them into my 12900K system and then that’s when I recorded the video of the system spontaneously rebooting itself whilst trying to run memtest86 on those four sticks of RAM.)

There is CLEARLY a problem here.

xzpfzxds · April 3, 2022, 7:10pm

ECC gives you more information to make a diagnosis.

Like this from a system I have here:

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 308 Corrected Errors

I know I should replace that DIMM, but it’ll do for now

alpha754293 · April 4, 2022, 5:11pm

Right, but something would have either prompted you to check (or look at it) in the first place.

If ECC was, say, to mask the issue, such that you aren’t notified that there is a fault and/or that ECC is doing what it is supposed to do, then you might not know that there is a problem enough to warrant checking into it. This is what I mean by the ECC masking the underlying hardware issue.

Unless you’re being told by the system that ECC is having to do what ECC is meant to do, and the errors are being fixed in the background, and transparent to you; then you would have no reason to check nor query the system itself (for issues) (unless querying the system for issues is just a routine part of your business due to some other prior failure that was missed, which is what made it now a part of your SOP).

xzpfzxds · April 4, 2022, 5:49pm

Sure, like a fan masks an underlying hardware issue of no thermal paste

Well, many people monitor their CPU / GPU / disk temperature without any symptoms of overheating, or SMART data without any symptoms of disk failure Many people do the same with ECC, especially on production servers or in data centers.

alpha754293 · April 5, 2022, 2:08am

Yes, but the difference also being that consumer platforms, like the Z690, support thermal monitoring, but the 12900K/Z690 doesn’t support ECC and/or ECC registered memory.

The only SMART data I monitor is for SSDs (to see how close they are to wearing out, because with SSDs, the question isn’t IF they will fail, but when (they will fail) given that it’s a consumable/wear product, like the brake pads on your car.

I am not aware of anybody deploying 12900K/Z690 in a production server environment. The only data center that I think that has deployed it are cloud computing providers so that people can rent time on the system without having to purchase the hardware themselves.

As a home user for this processor and this platform, checking the ECC data doesn’t apply given that neither the processor nor the platform support ECC memory.

How many home users who have deployed 12900K are checking their ECC data?

bobjones003 · April 5, 2022, 3:00pm

For check if memory is stable / cpu. I generally prefer using something like intel burn test for check the if the memory stable. I have found mem86 does stress the ram out enough to show all issues. Note sometime with intel burn test you have to use all the ram to know if really stable. Other note if cooling system on computer is not the best I would not run the test.

Ruklaw · April 5, 2022, 5:38pm

This isn’t necessarily helpful, but just a short heads up that the 12900k can support ECC if you have the W680 chipset.

If you look on the Ark page the 12900k supports ecc:

Tempted to upgrade/consolidate my desktop and server at home once I can get hold of a w680 motherboard, they don’t seem to be out there yet.