AM5 consumer motherboards with full (reporting and correcting) ECC?

Success!?

Don’t forget to:

sudo systemctl start rasdaemon.service
sudo systemctl enable rasdaemon.service
sudo systemctl start ras-mc-ctl.service
sudo systemctl enable ras-mc-ctl.service

The question mark is because sudo dmesg | tail -f is still unchanged with only this 1 line appearing after the stress-ng run: mce: [Hardware Error]: Machine check events logged.

With the machine still running when the Corrected error event happened, sudo edac-util -v isn’t displaying any changes, which is ok, as it’s unmaintained and replaced by ras-mc-ctl, but interestingly sudo ras-mc-ctl --error-countalso isn’t displaying any changes.

Fortunately there is this:

sudo ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
No disk errors.
MCE records summary:
	7 Corrected error, no action required. errors
sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

MCE events:
1 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7d41b7940, misc=0xd01a000601000000, walltime=0x6506e4a4, cpuid=0x00a60f12, bank=0x00000015
2 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7ddc36140, misc=0xd01a001301000000, walltime=0x6506e852, cpuid=0x00a60f12, bank=0x00000015
3 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7de7b7040, misc=0xd01a000601000000, walltime=0x650828b1, cpuid=0x00a60f12, bank=0x00000015
4 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7df237a40, misc=0xd01a002401000000, walltime=0x65082d99, cpuid=0x00a60f12, bank=0x00000015
5 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7d48b6140, misc=0xd01a000701000000, walltime=0x65082fbc, cpuid=0x00a60f12, bank=0x00000015
6 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7e6c37b00, misc=0xd01a000901000000, walltime=0x65083230, cpuid=0x00a60f12, bank=0x00000015
7 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7de437d00, misc=0xd01a000201000000, walltime=0x65083bd5, cpuid=0x00a60f12, bank=0x00000015

And here is more info that it’s about DRAM ECC error:

$ journalctl | grep -i "ecc e"
*** rasdaemon[807]:            <...>-7     [000] .....     0.000031 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[807]:            <...>-7     [000] .....     0.000126 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[779]:            <...>-8425  [000] .....     0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[779]:            <...>-15720 [000] .....     0.000220 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[799]:            <...>-9     [000] .....     0.000031 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[799]:            <...>-9     [000] .....     0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[790]:            <...>-9     [000] .....     0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.

This looks like ECC error reporting is working.

1 Like