Validating ECC RAM on AM4 Ryzen Consumer Motherboard

kewiha · March 20, 2025, 1:46am

Preamble

Writing up the validation exercise I did on the following system:

CPU: Ryzen 5 5600X
Motherboard: Gigabyte B550 Aorus Pro AC
- Firmware F18d (2024 Sept, AGESA 1.2.0.CC)
RAM: 4x16GB 2133 MT/s CL15 ECC UDIMMs pulled from another system
OS: MemTest86 V10.6 Free & Debian Bookworm (12.10)
- The Debian OS install was sacrificial. Don’t do this testing on a system with data you care about.

I was trying to validate that ECC is both working and errors are reported in Debian. I installed Debian along with the edac-utils and rasdaemon apt packages. It’s possible that only the former was needed. I also installed the memtester apt package to perform stress testing within Debian.

Validation

Running ras-mc-ctl --status will give ras-mc-ctl: drivers not loaded. on a system without ECC RAM (or without the drivers running…try a reboot). On a system with ECC RAM, the output should be ras-mc-ctl: drivers are loaded.. Running ras-mc-ctl --error-count outputs all the recorded errors (presumably since a reboot):

Label               	CE	UE
mc#0csrow#2channel#1	0	0
mc#0csrow#2channel#0	0	0
mc#0csrow#1channel#0	0	0
mc#0csrow#0channel#0	0	0
mc#0csrow#1channel#1	0	0
mc#0csrow#0channel#1	0	0
mc#0csrow#3channel#1	0	0
mc#0csrow#3channel#0	0	0

Once you get here, the system appears to have working ECC. To test it, you have some options:

Physically disturb the RAM (e.g. overheat it, short data pins to GND, etc). Too bold for me.
Inject errors via software. Not all platforms support this, and mine doesn’t appear to Ref1, Ref2
Overclock the RAM to the point of instability. Consumer motherboards tend to allow that.

I went with the last option. I fixed the following settings:

RAM Command Rate: 1T
RAM Gear Down Mode: Disabled
RAM Power Down: Disabled

From there I iteratively lowered the CAS Latency from Auto (15) to 12, and subsequently to 10. 10 completed POST but gives ECC errors in MemTest86 within 15 seconds of stress testing:

The errors in orange shown above are different from those that MemTest86 spits out when an uncorrected error is detected. This suggests that at least 1-bit (i.e. correctable) errors are both corrected and reported. After this I booted into Debian to reproduce the results. Running memtester 50000 to test 50,000 MB of RAM pretty quickly led to kernel messages indicating it detected something:

Sure enough, after running the memory test a few times and running ras-mc-ctl --error-count I observed both nonzero correctable (i.e. 1-bit) and uncorrectable (i.e. 2-bit) errors:

Success! As far as I can tell, this means that ECC works like you’d expect on Debian with the right packages installed. If you disagree, I would love to hear from you!

Postamble

To get notified when an error is observed, you could run a script like this on a schedule:

Script

#!/bin/bash
#hck_ram.sh: Monitors ECC RAM error counters via rasdaemon
#arg 1: email recipient(s), comma separated
#arg 2: filename with ras-mc-ctl output (for debugging only)

if [ "$2" != "" ] ; then
        debugmode="TRUE"
        printf '%s\n' "debugmode is on"
        printf '%s\n' "hck_ram.sh: Health Check Starting"
else
        printf '%s\n' "$(date +%F_%H.%M.%S) hck_ram.sh: Health Check Starting"
fi

ras_status="$(ras-mc-ctl --status)"
if [[ "$ras_status" != "ras-mc-ctl: drivers are loaded." ]]; then
	printf '%s\n' "ERROR: rasdaemon not ready, or ECC RAM not installed"
	exit 1
fi

if [ "$debugmode" == "TRUE" ]; then
	ras_error_count="$(cat "$2")"
else
	ras_error_count="$(ras-mc-ctl --error-count | sort -u)"
fi
ras_error_count_nonzero="$(printf '%s\n' "$ras_error_count" | grep -v 0$'\t'0$ )"
ras_error_count_nonzero_lines="$(printf '%s\n' "$ras_error_count_nonzero" | wc -l )"

if [[ "$ras_error_count_nonzero_lines" -gt 1 ]]; then
	report="hck_ram.sh Report $(date +%F_%H.%M.%S): ECC RAM ERRORS DETECTED"
	report="$report"$'\n'$'\n'"$ras_error_count"
	msg_body="$report"
        printf '%s\n' "Subject: $(hostname): hck_ram: ECC RAM ERRORS DETECTED"$'\n'"$msg_body" | msmtp "$1"
        printf '%s\n' "$(date +%F_%H.%M.%S) hck_ram.sh: Sent report"
	printf '%s\n' "$report"
else
	printf '%s\n' "$(date +%F_%H.%M.%S) hck_ram.sh: Check passed"
fi

printf '%s\n' "$(date +%F_%H.%M.%S) hck_ram.sh: Health Check Done"

The script above requires the msmtp apt package to be installed and configured, and run I it like so: bash -c "/script/hck_ram/hck_ram.sh root,[email protected],[email protected]". This is just an example, and I expect there will also be useful output in dmesg and syslog that you can use instead.

KeithMyers · March 20, 2025, 11:38pm

You could just poll ras-mc-ctl every once in a while to see the accumulated errors in the rasdaemon database.

ras-mc-ctl --errors
Memory controller events:
1 2025-01-18 18:13:00 -0800 1 Corrected error(s): at DIMM_A2 location: 0:2:0:-1, addr 12847823424, grain 6, syndrome 2

homeserver78 · March 26, 2025, 4:55pm

I’m in the process of switching motherboard from ASRock B550M-HDV to an Asus ProArt B550-CREATOR and thought it would be interesting to do this test on both the old and new motherboards. So far I’ve been having a real struggle with the old board:

CPU: Ryzen PRO 4650G (widely reported working ECC support)
Motherboard: ASRock B550M-HDV (single slot per channel!)
- UEFI: 3.40 (AGESA 1.2.0.B)
RAM: 2x Mushkin 16 GiB (MPL4E320NF16G18)

Default settings for these DIMMs are 3200 MT/s, 22-22-22-22-52, 1.20 V.

This is what I’ve tried:

(“OK” here means boots and runs memtest86 v11.2 for at least 5 minutes without any errors detected. “FAIL” = fails to boot; just a black screen, no beep codes.)

19-21-21-21-46: OK
18-20-20-20-45: OK
17-20-20-20-42: OK
16-20-20-20-40: FAIL
16-20-20-20-42: FAIL
17-20-20-20-39: OK
17-19-19-19-39: OK
17-19-19-19-36: OK
17-18-18-18-36: Crash to black screen in memtest within seconds
17-18-19-19-36: Crash to black screen in memtest within seconds
17-19-18-18-36: OK (ran memtest86 for 45 minutes)
17-19-18-18-33: OK
17-19-18-18-33 @ 1.18V: OK
17-19-18-18-33 @ 1.16V: OK
17-19-18-18-33 @ 1.14V: OK
17-19-18-18-33 @ 1.12V: Fails to boot, 3 long beeps
17-19-18-18-33 @ 1.13V: Crashes within seconds in memtest86 and UEFI

Ok, so change of tactic: What I want is marginal signalling on data transfers and not marginal function of command signals, so let’s use the default primary timings and lower voltage and increase frequency instead:

Timings set to Auto (so clock counts adjusted to keep actual timings constant with increasing frequency):

3200 MT/s @ 1.13 V: OK (ran memtest86 for 20 minutes)
3200 MT/s @ 1.12 V: OK
3200 MT/s @ 1.10 V: OK
(3200 MT/s @ 1.08 V: Nope, UEFI won’t let me lower voltage this far)
3400 MT/s @ 1.10 V: OK (ran for 15 minutes)
3600 MT/s @ 1.10 V: OK (ran for 25 minutes)
3800 MT/s @ 1.10 V: OK (ran for 10 minutes)
4000 MT/s @ 1.10 V: FAIL
3933 MT/s @ 1.10 V: Edit: Ran for almost three passes without any errors, then crashed. So I think this config is very much on the verge of instability and should have produced ECC errors if the reporting was working.

All this without a single reported ECC error. ECC is enabled and memtest86 is reporting ECC support and ECC polling enabled. However there is no “Platform First Error Handling (PFEH)” setting in the UEFI. I’m obviously at this point wondering if reporting really is disabled on this motherboard.

Any thoughts?

Edit:

Here’s one of the previous (unfortunately now closed) threads on the same subject: Ryzen 5700X ECC reporting.

Notifying some of the most active members from that thread: @aBav.Normie-Pleb @brain-short @fornex

aBav.Normie-Pleb · March 26, 2025, 5:41pm

I use the same methodology to verify ECC actually working, not just a platform telling an operating system that it does.

Looks all good. I got some old DDR4-2400 ECC UDIMMs for testing ECC on AM4 where I increase the frequency until the system doesn’t boot anymore, then go back one DDR frequency step. That way you can be 99 % sure that the memory chips themselves are causing the errors before the CPU’s memory controller or motherboard memory traces do.

That way I get memory errors almost always within a few minutes in Passmark Memtest Pro, even on “enterprise” or “server” motherboards where you can’t overclock or decrease the default JEDEC memory voltage.

Also, with later AGESA versions ECC error injection testing as an option appears more and more often on regular motherboards, because the motherboard manufacturers seem to stop messing with AMD’s default BIOS menus (which I consider a good thing).

PS: I also consider “enterprise” or “server” motherboards to have the shittiest and buggiest BIOSes. I much prefer somethiong like a prosumer ASUS motherboard that has had it’s 10th BIOS update compared to a Supermicro motherboard on its 3rd BIOS release.

homeserver78 · March 26, 2025, 5:59pm

So if I understand you correctly, my methodology (of raising memory frequency) is good, and if there was functioning error reporting I should have seen some errors by now? I.e. is it time to give up?

jon666 · March 26, 2025, 6:24pm

I’ve never played with ECC so what I am curious about is if your mobo is correcting secondary and tertiary timings, cause 3933 at 1.1 V error free sounds absurd to me, but I seem to remember reading AMD APUs have killer memory controllers.

aBav.Normie-Pleb · March 26, 2025, 6:24pm

You should be happy with the ASUS ProArt B550-CREATOR, I have it for testing with a 4750G and ECC is working (128 GB ECC DDR4-3200). My only nitpicking issue with my two units of the ProArt B550 is that the USB-C ports are a bit loose compared to any other motherboard with USB-C I got, even ProArt X570 or ProArt X670E variants from the same manufacturer.

I’m not a fan of ASRock or ASRock Rack, they have messed up functioning ECC multiple times through buggy BIOSes. Also AGESA 1.2.0.B is old AF.

In my experience with AM4 and AM5 motherboards ASUS is the best manufacturer regarding timely BIOS updates, which for me is the most important thing for motherboards in general since I intend to use hardware for at least 5 years. They released the latest BIOSes with AGESA 1.2.0.E which includes important security fixes about a month ago, you still have to wait for those on many other manufacturers’ motherboards.

And two or even more things can be true at once, at the same time I can criticize ASUS about the bad customer RMA service.

That’s how corrected errors via ECC should look like.

homeserver78 · March 26, 2025, 6:32pm

Yes, the UEFI is set to Auto timings here; as an example memtest86 reported the timings at 3800 MT/s as 28-27-27-58. Also remember that this is a 1-slot-per-channel motherboard.

Edit: Sorry, I see you wrote secondary and tertiary timings, not primary. :oops: I don’t know, will try to find out!

homeserver78 · March 26, 2025, 6:38pm

Nice!

Yep, that’s their most recent non-beta UEFI. They have a 1.2.0.Cc one out but that’s marked as Beta, whatever that means…

I did notice the 1.2.0.E update for the Proart B550 board and have it downloaded and ready for flashing! I just need to finish my testing of the old board first and move over PSU, CPU and RAM.

aBav.Normie-Pleb · March 26, 2025, 6:41pm

You can also use USB BIOS flashback without anything installed in the motherboard except a PSU with standby power on. I always do that with any new motherboard before I turn it on for the first time.

PS: On anything with the name “ASRock” attached to it “Beta” or the opposite, the absence of the word “Beta” doesn’t mean jack shit, unfortunately.

homeserver78 · March 26, 2025, 9:14pm

Yes, it is. Timings for 3200 MT/s:

Timings for 3933 MT/s:

homeserver78 · March 26, 2025, 9:18pm

As a final attempt I booted debian-12 and ran ‘dd if=/dev/urandom of=/dev/null bs=1G’ for an hour or so. It also didn’t produce any ECC warnings (and crashed and rebooted soon after).

So I think it’s relatively safe to say that ASRock B550M-HDV does not have functioning ECC error reporting.

homeserver78 · March 27, 2025, 10:15pm

I’ve started testing on the ProArt B550-CREATOR (UEFI 3802) now. Its UEFI won’t let me set voltage to lower than 1.20 V. Instead I put both DIMMs on the same channel to create a more difficult load. :evil:

Running fine now (so far) at 3800 MT/s… (with auto-adjusted timings).

homeserver78 · March 27, 2025, 11:02pm

So… I’m at a loss here. Despite the different config (2 DPC @ 1.20 V rather than 1 DPC @ 1.10 V) things run seemingly fine at 3933 MT/s and fail to boot at 4000 MT/s, just like on the ASRock board.

I’ll let it run memtest86 for longer at 3933 MT/s I guess, but … any ideas here? Do I really need to get slower ECC RAM to be able to actually trigger an error?

BTW, at 4 GT/s setting the Auto-selected FCLK was 2 GHz. I tried manually setting this to the “standard” 1.6 GHz but the system still wouldn’t boot. Any other settings I should check while keeping the RAM at 4 GT/s?

homeserver78 · March 28, 2025, 6:17pm

I kept at it yesterday to no avail. I fixed the timings at 17-19-18-18-33 and the voltage at 1.20 V. This failed to boot at 3600 MT/s but passed almost 2 passes of memtest86 at 3533 MT/s:

So I now have two motherboards – ASRock B550M-HDV AGESA 1.2.0.B and Asus ProArt B550-CREATOR AGESA 1.2.0.E – with very similar behaviour of not producing ECC errors.

I was wondering if I’ve hit a bug in memtest86 11.2 Free, but I did run that test of the marginally stable ASRock system under Linux too with no mce ECC errors, so…

???

ProArt UEFI settings for the test above:

aBav.Normie-Pleb · March 28, 2025, 8:34pm

I’ll check a previously ECC-validated functioning ProArt B550/4750G again, maybe this time around ASUS or AMD with their AGESA messed something up.

aBav.Normie-Pleb · March 29, 2025, 1:22am

Looks okay on my end, though I’m using Passmark Memtest Pro with support for ECC Error Injection.
With 4 x 16 GB DDR4-2400@3200 also errors that were not intentionally injected show up.
Can confirm that for some reason you can’t go below 1.20 V DRAM voltage.

I’ll let the default 4 test sets completely finish, am interested to see if ECC can be overwhelmed in that situation.

homeserver78 · March 29, 2025, 5:15pm

Thank you, @aBav.Normie-Pleb!

This is so weird. And annoying, lol!

…

Have you looked into error injection under Linux? There is supposedly a kernel interface for this, and there’s a setting for error injection in the Asus UEFI:

But booting up debian-12 there is no ACPI EINJ table (/sys/firmware/acpi/tables/EINJ), and ‘modprobe einj’ fails with “No such device”. sigh

aBav.Normie-Pleb · March 29, 2025, 7:14pm

Unfortunately I am utterly incompetent regarding Linux, but I doubt that Memtest Pro is hallucinating ECC error injection tests…? That would create a PR nightmare for Passmark.
At your previous BIOS screenshot: Can you test changing Data Poisoning from Auto to Enabled? I change that option out of habit, to my understanding that’s teling the platform’s firmware that it should tell an operating system that specific data xyz is compromised, for example if too many errors happen for ECC to successfully correct them.
I’ve also tried the current MemTest Free version instead of my older Pro version, it also shows corrected ECC errors in red, the only difference is the white line with ECC Inject is missing.
Did the default 4 complete test passes with the Pro variant, 3101 memory errors were detected and corrected by ECC, no uncorrectable errors happened:

MemTest86-Report-20250329-015633.zip (4.5 KB)

My only guess is that maybe due to some unlucky or lucky circumstances your memory isn’t overclocked enough to produce errors at 1.20 V? (For reference my DDR4 UDIMM ECC testing memory sticks are about 10 years old and are operating at DDR4-2400@DDR4-3200 with the default 1.20 V)
Note: I’m also testing with a dGPU so the APU’s iGPU is disabled and isn’t hiding some memory areas from the memory testing program.
With Memtest Pro you could check this by using its error injection feature.
Or do 4 or more complete passes over night with Memtest Free to increase the likelyhood of catching an error?

@wendell Do you have an idea what’s going on?

Summary:

2 users, same motherboard, same BIOS version, same PRO APU generation (4650G and 4750G), ECC UDIMMs.
Overclocked ECC UDIMMs to the point of almost not booting, user 1’s Passmark Memtest shows absolutely no errors, user 2 sees corrected errors (as expected).