MCE Corrected Errors

This is a fairly long post, Bottom Line Up Front: should I be worried about the “corrected errors” being logged, and what should I do next?

I built a new Ryzen 5900x system in early July and placed it in service on 11 July.

Four days ago, I tried the latest UEFI. I noticed that correctable MCE errors were being logged when I was perusing the logs using journalctl. It seemed reasonable that the new errors were due to the new UEFI. So I rolled back to the previous version I had been using. No more errors showing up via journalctl.

I decided that some more reliable system of noticing such errors was called for, and I installed rasdaemon and associated utils. Well, now I have see errors…

[sylph Corrected-Errors]# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
0:0 has 3 errors
0:1794 has 6 errors
0:66304 has 12 errors
MCE records summary:
33 Corrected error, no action required. errors
[sylph Corrected-Errors]#

The disk errors worry me too, but moving on…

The last two errors:

32 2021-08-18 14:31:09 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000d01000000, walltime=0x611d51ed, cpuid=0x00a20f10, bank=0x00000011
33 2021-08-18 14:36:09 -0400 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci CECC, memory_channel=0,csrow=3, mcgcap=0x0000011c, status=0x9c2041000000011b, addr=0x3fdc1fcc0, misc=0xd01a000e01000000, walltime=0x611d5319, cpuid=0x00a20f10, bank=0x00000011

All the errors seem to be at the same address. All the errors seem to be from the same CPU.

The cavalier me says, “Well, they’re all corrected. So no problem.”

The thoughtful me says, “Errors!? Holy smokes! Must fix errors!”

With my limited smarts, I’d try fixing the errors by buying some more memory from another vendor and swapping it in.

By the way, I ran mprime torture for 1.25 hours this morning – nothing, nada, zip. I don’t know know what triggers the errors.

I’m happy to post the full build specs, but here’s a gist:

AMD Ryzen 5900X
ASUS B550-Plus Prime AMD AM4 ATX Motherboard
2x16GB Nemix DDR4-3200 PC4-25600 ECC Unbuffered 2Rx8 Memory

1 Like

nothing to worry about if it happens infrequently. they are doing their job :slight_smile:

1 Like

Interestingly enough, or maybe strangely enough, the error shows up about every 5 minutes. Something happening every 5 minutes. Hmmmm.

1 Like

Might be a device going to sleep and waking up

1 Like

I didn’t think to check this until just now, but edac-util shows no errors. So, does that imply the corrected error is someplace else, e.g. CPU cache?

[sylph ISOs]# edac-util --verbose
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
[sylph ISOs]#

yes, but probably pcie bus errors.

thats why the default in some setups is ‘platform first error handling’

Wow. Any suggestions as to how to track this thing down? There’s not much in this system: 2 NVME drives, a SATA SSD, and the GPU.

What is “platform first error handling”? Is enabling it just hiding errors from the OS?

If so, why would someone want that?

Oh and this is just BIZARRE. The errors have stopped, at least for the moment. I wonder if running edac-utils poked some bit of machinery and cleared something? I think I’ll reboot and see what happens.

MCE is only applicable to Intel systems AFAIK.

My hot take

RedHat recommends disabling or uninstalling the mcelog.service on AMD systems since it doesn’t do anything and throws false positives.
Double check if this applies to you.

Root Cause

  • Newer AMD processors do not support the mcelog daemon. With the release of RHEL 6.3 and the update of mcelog to mcelog-1.0pre3_20110718-0.14.el6 , mcelog now properly reports an error against these newer AMD processors.

Now, I will say that I went down the rabbit hole with this one because out of the box installed of Fedora Server 34 throws up this error for my First Gen Epyc CPU. So I’m certain that all this crap is a red herring.

There are bugzilla reports that are still open that this stuff was failing as far back as Bulldozer CPU’s and never fixed.

2 Likes

Well, it’s 1443 local time. The last error was at 1354. It seems unlikely to me that running edac-utils worked any sort of magic across boots. However, I did write a USB image to a flash drive.

Kablooey!

I was happily typing out this message when my monitors went dark and the UPS starting screaming at me. Well, the stupid UPS died. So I grabbed the one on my “bench” and put it in service on my problematic PC. And the errors are back.

So now I’ll try writing a USB flash drive and see if that changes things. Since that was my sort-of plan before the computer gods decided to strike my UPS down.

1 Like

Nope. Writing a flash drive didn’t change things, but it seemed worth a shot.

I stopped and disabled rasdaemon to see what would happen, and the number of events went down. Unfortunately, the journal isn’t particularly helpful telling us what the errors were.

Aug 19 15:35:59 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 15:41:13 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 15:46:27 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 15:57:24 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 16:02:38 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 16:07:52 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 16:18:20 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 16:23:34 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 17:20:41 sylph kernel: mce: [Hardware Error]: Machine check events logged
Aug 19 17:41:37 sylph kernel: mce: [Hardware Error]: Machine check events logged

Well there are some digital hypochondrites among us who love to see errors to confirm that something was always iffy on the system.

We gotta fix the errors, they are everywhere! #verbose4life

Well, a few more pieces of info.

The errors are detected by Windows 10 as well and at what appears to be the same address. Windows says “0x4000003fdc1fcc0”. While Linux reports “addr=0x3fdc1fcc0”.

I googled the Windows error and it led me to this page: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03282091

That page was last updated in 2014 and is for Intel Xeon processors, but it sure sounds similar.

“Frequent transitions to package C3/C6 memory self-refresh during some idle workloads may increase the number of recoverable (corrected) SMI link events observed. While a low occurrence of recoverable events is normal for high speed processor busses, the rate of recoverable events may increase during some idle conditions due to the higher number of SMI link disable-enable transitions occurring.”

Well, that’s quite a lot to parse, but I did recall some issues with C6 from the 1st Ryzens. So, since it was easy to do, I dug up “zenstates” and disabled C6. I’m sorry to say that had no effect. I didn’t fiddle with the other power states.

I also tried the 5.13 kernel and things stayed the same.

So, for now I’m putting this on the far back burner. It seems harmless and probably is, but it sure is irritating.

Thanks everyone!

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.