Does this mean ECC memory is doing its thing?

hi all,

I’m running proxmox on a Gigabyte MC62-G40 with DDR4 ECC memory and noticed the following error:

[25565.560631] mce: [Hardware Error]: Machine check events logged
[25565.560640] [Hardware Error]: Corrected error, no action required.
[25565.560664] [Hardware Error]: CPU:1 (19:8:2) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b
[25565.560694] [Hardware Error]: Error Addr: 0x000000031bd79980
[25565.560704] [Hardware Error]: PPIN: 0x02b6a883a8b98008
[25565.560714] [Hardware Error]: IPID: 0x0000009600250f00, Syndrome: 0x45c208000a800910
[25565.560728] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
[25565.560742] EDAC MC0: 1 CE on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0xcaf5e6 offset:0x680 grain:64 syndrome:0x800)
[25565.560755] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Looking online it seems that there was an error and it got corrected, am I right? is there anything else I should do about it?
Checking with dmesg I don’t see more occurrences, it seems this is the first time since I rebooted the system a couple of days ago.

Thanks

Corrected errors are fine. A large amount of corrected errors (say, thousands in a short time) would indicate an upcoming issue. Uncorrected errors are bad, and would indicate the need to replace the RAM ASAP.

1 Like

Ah ok, thanks. Thousands of corrected errors happening in a short time means I need to replace the RAM as well or what would it mean?
I see that in the last 12 hours it happened 3 more times so I’ll keep an eye.

If you can afford to and want minimum fuss and downtime, yes, replace. Unless you’ve pushed the memory beyond it’s spec with tight timings, you shouldn’t be seeing regular errors, even correctable ones (remember if this wasn’t ECC, this situation would be resulting in random crashes or file corruption right now)

Otherwise, diagnose further by swapping memory around in the slots and see if the problem follows the memory module or the slot/channel. I’ve had motherboards where the memory channels gradually died, independent of any fault in the CPU or the memory modules.

3 Likes

Yes, I was reading about that yesterday, don’t think I’m coming back to non-ECC memory :slight_smile:

Don’t know if it’s really related, but this started when I replaced the fans with Noctua ones and added a couple more. But yesterday I switched them around to make sure I was generating positive air pressure and haven’t had another error in the past 12 hours, while before it was happening like clockwork every 3 hours, the exact same error I put at the beginning. So knocking on wood this went away, but I’ll keep an eye on it and good to know how to proceed.
Thanks!