This is a fairly long post, Bottom Line Up Front: should I be worried about the “corrected errors” being logged, and what should I do next?
I built a new Ryzen 5900x system in early July and placed it in service on 11 July.
Four days ago, I tried the latest UEFI. I noticed that correctable MCE errors were being logged when I was perusing the logs using journalctl. It seemed reasonable that the new errors were due to the new UEFI. So I rolled back to the previous version I had been using. No more errors showing up via journalctl.
I decided that some more reliable system of noticing such errors was called for, and I installed rasdaemon and associated utils. Well, now I have see errors…
[sylph Corrected-Errors]# ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No ARM processor errors.
No Extlog errors.
No devlink errors.
Disk errors summary:
0:0 has 3 errors
0:1794 has 6 errors
0:66304 has 12 errors
MCE records summary:
33 Corrected error, no action required. errors
The disk errors worry me too, but moving on…
The last two errors:
32 2021-08-18 14:31:09 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000d01000000, walltime=0x611d51ed, cpuid=0x00a20f10, bank=0x00000011
33 2021-08-18 14:36:09 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000e01000000, walltime=0x611d5319, cpuid=0x00a20f10, bank=0x00000011
All the errors seem to be at the same address. All the errors seem to be from the same CPU.
The cavalier me says, “Well, they’re all corrected. So no problem.”
The thoughtful me says, “Errors!? Holy smokes! Must fix errors!”
With my limited smarts, I’d try fixing the errors by buying some more memory from another vendor and swapping it in.
By the way, I ran mprime torture for 1.25 hours this morning – nothing, nada, zip. I don’t know know what triggers the errors.
I’m happy to post the full build specs, but here’s a gist:
AMD Ryzen 5900X
ASUS B550-Plus Prime AMD AM4 ATX Motherboard
2x16GB Nemix DDR4-3200 PC4-25600 ECC Unbuffered 2Rx8 Memory