ECC capable/verified motherboards for Ryzen 7000

This is fascinating

Querying the memory controller

When I mentioned setting up ECC at work, Robert Mustacchi pointed me to the excellent illumos documentation about AMD’s Unified Memory Controller. I did some reading and learned that essentially, AMD processors expose a bus called the System Management Network (SMN). Among other things, this bus can be used to query and configure the AMD Unified Memory Controller (UMC).

NOTE: The information in the rest of this section is not part of the public AMD Processor Programming Reference, but can be gleaned from the source code for the open-source Linux and illumos kernels.

WARNING: Accessing the SMN directly, and especially sending write commands to it, is dangerous and can severely damage your computer. Do not write to the SMN unless you know what you’re doing.

The idea is that we can ask the UMC the question “is ECC enabled” directly, by sending a read request over the SMN to what is called the UmcCapHi register. The exact addresses involved are a little bit magical, but on illumos with a Ryzen 7000 processor, here’s how you would query the UMC over the SMN bus (channel 0 and channel 1 are the two memory channels on the system, and each channel has one of the 32GB sticks plugged into it.)

# Query the UMC at address 0x50df4, representing channel 0
$ pfexec /usr/lib/usmn -d /devices/pseudo/amdzen@0/usmn@2:usmn.0 0x50df4
0x50df4: 0x40000030

# Query the UMC at address 0x150df4, representing channel 1
$ pfexec /usr/lib/usmn -d /devices/pseudo/amdzen@0/usmn@2:usmn.0 0x150df4
0x150df4: 0x40000030

(pfexec is the illumos equivalent to sudo.)

Also, illumos comes with a really nice way to break up a hex value into bits:

$ mdb -e '0x40000030=j'
                1000000000000000000000000110000
                |                        ||
                |                        |+------ bit 4  mask 0x00000010
                |                        +------- bit 5  mask 0x00000020
                +-------------------------------- bit 30 mask 0x40000000

The bit we’re interested in here is bit 30. If it’s set, then ECC is enabled in the memory controller.

He then goes on to hack together something in linux that does the same, and then later finds that EDAC already queries the memory controller for bit 30. So If I understand this correctly EDAC is reporting what the SMU actually says it can do rather than guessing. Whereas dmidecode seems to report what the systems UEFI tells it.

6 Likes