Success!?
Don’t forget to:
sudo systemctl start rasdaemon.service
sudo systemctl enable rasdaemon.service
sudo systemctl start ras-mc-ctl.service
sudo systemctl enable ras-mc-ctl.service
The question mark is because sudo dmesg | tail -f
is still unchanged with only this 1 line appearing after the stress-ng
run: mce: [Hardware Error]: Machine check events logged
.
With the machine still running when the Corrected error
event happened, sudo edac-util -v
isn’t displaying any changes, which is ok, as it’s unmaintained and replaced by ras-mc-ctl
, but interestingly sudo ras-mc-ctl --error-count
also isn’t displaying any changes.
Fortunately there is this:
sudo ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No ARM processor errors.
No Extlog errors.
No devlink errors.
No disk errors.
MCE records summary:
7 Corrected error, no action required. errors
sudo ras-mc-ctl --errors
No Memory errors.
No PCIe AER errors.
No ARM processor errors.
No Extlog errors.
No devlink errors.
No disk errors.
MCE events:
1 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7d41b7940, misc=0xd01a000601000000, walltime=0x6506e4a4, cpuid=0x00a60f12, bank=0x00000015
2 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7ddc36140, misc=0xd01a001301000000, walltime=0x6506e852, cpuid=0x00a60f12, bank=0x00000015
3 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7de7b7040, misc=0xd01a000601000000, walltime=0x650828b1, cpuid=0x00a60f12, bank=0x00000015
4 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7df237a40, misc=0xd01a002401000000, walltime=0x65082d99, cpuid=0x00a60f12, bank=0x00000015
5 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7d48b6140, misc=0xd01a000701000000, walltime=0x65082fbc, cpuid=0x00a60f12, bank=0x00000015
6 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7e6c37b00, misc=0xd01a000901000000, walltime=0x65083230, cpuid=0x00a60f12, bank=0x00000015
7 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7de437d00, misc=0xd01a000201000000, walltime=0x65083bd5, cpuid=0x00a60f12, bank=0x00000015
And here is more info that it’s about DRAM ECC error
:
$ journalctl | grep -i "ecc e"
*** rasdaemon[807]: <...>-7 [000] ..... 0.000031 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[807]: <...>-7 [000] ..... 0.000126 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[779]: <...>-8425 [000] ..... 0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[779]: <...>-15720 [000] ..... 0.000220 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[799]: <...>-9 [000] ..... 0.000031 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[799]: <...>-9 [000] ..... 0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
*** rasdaemon[790]: <...>-9 [000] ..... 0.000094 mce_record ** Unified Memory Controller (bank=21), status= dc2040000400011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
This looks like ECC error reporting is working.