ECC is set to Enabled. When changing the default RAM timings from 40-39-39-77 4800 to 32-32-32-32 5000 I get reliably this mce error after running stress-ng --vm 16 --vm-bytes 95% --vm-method rowhammer --verify -t 1m
in dmesg
in a 5 minutes interval:
sudo dmesg | grep -i 'edac\|mce\|hardware e\|ras'
[ 0.346504] EDAC MC: Ver: 3.0.0
[ 0.734491] RAS: Correctable Errors collector initialized.
[ 5.966483] MCE: In-kernel MCE decoding enabled.
[ 5.970316] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
[ 5.970319] EDAC amd64: F19h_M60h detected (node 0).
[ 5.970322] EDAC MC: UMC0 chip selects:
[ 5.970322] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 5.970324] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[ 5.970326] EDAC MC: UMC1 chip selects:
[ 5.970327] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 5.970328] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[ 7.069194] amdgpu 0000:0c:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 313.769390] mce: [Hardware Error]: Machine check events logged
[ 1255.902703] mce: [Hardware Error]: Machine check events logged
$
Sometimes the system would crash, but more often than not it would produce that mce:
line. The line does not appear with default RAM timings [and ECC Enabled].
But unfortunately there is no [Hardware Error]: Corrected error, no action required.
, that people often mention.
Does that mean that it’s not guaranteed that an ECC error correction reporting happened?
Run stress-ng
many times, also with --vm-method all
, but that single mce line is all I got so far. STH and a different article mention stress-ng
as a good program to cause ECC events.
Also nothing in (except the mentioned mce:
line in dmesg
):
$ ras-mc-ctl --error-count
Label CE UE
mc#0csrow#3channel#0 0 0
mc#0csrow#3channel#1 0 0
mc#0csrow#2channel#1 0 0
mc#0csrow#2channel#0 0 0
$ sudo edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
With default RAM timings, but trying 5200, the system crashes and boots into a reset BIOS with the last setting being set to 5200. With 32-32-32-32 5000 the system crashes after like 45min on normal use. With 32-32-32-17 5000 the system crashes randomly minutes after booting on normal use. And with 32-32-32-16 5000 it crashes after a fraction of a second after entering the BIOS.
System: TUF Gaming B650-Plus BIOS version 1654, 7800X3D, 2x KSM48E40BD8KM-32HM, Arch with latest updates (Linux 6.5.3).
General other infos:
$ sudo lsmod | grep -i amd
amdgpu 12312576 1
amdxcp 12288 1 amdgpu
drm_buddy 20480 1 amdgpu
gpu_sched 57344 1 amdgpu
amd64_edac 77824 0
edac_mce_amd 53248 1 amd64_edac
i2c_algo_bit 20480 1 amdgpu
drm_suballoc_helper 12288 1 amdgpu
kvm_amd 204800 0
drm_ttm_helper 12288 1 amdgpu
ttm 110592 2 amdgpu,drm_ttm_helper
kvm 1372160 1 kvm_amd
drm_display_helper 229376 1 amdgpu
gpio_amdpt 16384 0
gpio_generic 20480 1 gpio_amdpt
ccp 159744 1 kvm_amd
video 77824 3 asus_wmi,amdgpu,nvidia_modeset
edit
I found this:
$ sudo cat /sys/kernel/debug/tracing/events/mce/mce_record/enable
0
$ sudo su
# echo 1 >/sys/kernel/debug/tracing/events/mce/mce_record/enable
# exit
$ sudo cat /sys/kernel/debug/tracing/events/mce/mce_record/enable
1
After running stress-ng
with the same unstable timings again, there’s a new kworker ..
-line, which doesn’t appear when tracing is disabled:
sudo cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:16
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
kworker/0:1-9 [000] ..... 313.981082: mce_record: CPU: 0, MCGc/s: 11d/0, MC21: dc2040000400011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000007d6637e00/d01a001701000000/c88200220a800602, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:a60f12, TIME: 1694946925, SOCKET: 0, APIC: 0
kworker/0:0-7 [000] ..... 1255.915547: mce_record: CPU: 0, MCGc/s: 11d/0, MC21: dc2040000400011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000007ddc36140/d01a001301000000/c88200220a800602, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:a60f12, TIME: 1694951506, SOCKET: 0, APIC: 0
The question is of course how this is helpful or if this can be decoded to provide more infos.
The output of sudo dmesg | grep -i 'edac\|mce\|hardware e\|ras'
and sudo edac-util -v
unfortunately remains unchanged and doesn’t show [Hardware Error]: Corrected error, no action required.
.