AM5 consumer motherboards with full (reporting and correcting) ECC?

ecclex · August 20, 2023, 3:36pm

Hi,
I like the detective teamwork in (forum.level1techs .com/t/ryzen-5700x-ecc-reporting/197090).
Maybe it can be done with AM5. After initial AM5 ECC talk like 6 months ago, it’s been quiet since. With prices down and giving this one to my parents so it can drive an LG 42C3 OLED TV (4:4:4 120Hz 4K 10bit) as their future monitor, I’m planing to upgrade from my 5600X to 7800X3D.
According to AMD, ECC is supported (amd .com/en/product/12731) by the 7800X3D and it’s up to the motherboard manufacturers to implement the ECC (Yes (Requires mobo support)).

Initially I wanted to go with an MSI Tomahawk B650, but according to this (geizhals .de/?cat=mbam5&xf=4400_ATX~494_ECC-RAM+Unterst%C3%BCtzung&asuch=&bpmin=&bpmax=&v=e&hloc=de&plz=&dist=&mail=&sort=nbew&bl1_id=30#productlist), only ASUS, supposedly, supports ECC? Does ASUS mean real ECC or just that the mobo supports running real ECC RAM sticks without actually utilizing the ECC functionality? Or do they only mean on-die ECC? From their manual:

4 x DIMM, Max. 128GB, DDR5 6400+(OC)/ 6200(OC)/ 6000(OC)/
5800(OC)/ 5600(OC)/ 5400(OC)/ 5200/ 5000/ 4800 ECC/Non-ECC,
Unbuffered Memory*
*Non-ECC, un-buffered memory supports On-Die ECC function.

Not sure, 4800 ECC suggests that real ECC up to 4800 MT/s is supported?

Reasons for me to ECC is that I have OpenZFS running a mirror pool, where I backup my data on the same gaming PC (through zfs-dkms repo) and since ECC RAM sticks are a bit slower, and the 7800X3D is less RAM-speed dependent, it’s another good reason to go with this efficient gaming CPU.

Not sure if I would have asked this question if I wouldn’t have found this: With Linux v6.5, Linux’ amd64_edac.c (git.kernel .org/pub/scm/linux/kernel/git/torvalds/linux.git/log/drivers/edac/amd64_edac.c) gets ECC reporting for Ryzen 7000 CPUs, which is nice and helpful for testing:

EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh

Ryzen 9 7950X uses model 61h. Treat it as Epyc 9004, but with 2 channels
instead of 12.

With two 32GB dual-rank DIMMs the sizes appear to be reported correctly:

[…]

ECC errors can also be detected:

[…]

According to Mario Limonciello, the same code should also work for
models 70h-7Fh (follow thread in Link).

diizzy · August 20, 2023, 4:16pm

https://www.reddit.com/r/truenas/comments/10lqofy/ecc_support_for_am5_motherboards/

Both myself and @Susanna have ECC memory running on our Asus ProArt X670E-CREATOR WIFI board. Depending on your overall workload the 7900-series might be a better choice due to the amount of cores. Can’t say much about Linux support overall but it FreeBSD 14-CURRENT runs along fine at least.

Susanna · August 20, 2023, 4:19pm

I can only speak for the X670E ProArt, as that’s what I have, but yeah, it supports UDIMM ECC, including error detection and correction. I don’t have Wendell’s fancy error injector to verify it with, but I have seen (corrected) errors logged in memtest, so I’m 90% sure it’s real.

You do have to turn ECC on in the UEFI, the default (Auto) is Disabled for some reason, even though if you set it to Enabled and run non-ECC memory it’ll just ignore the setting and boot anyway. But once you do, it’s real ECC, according to both Memtest86 Pro and dmidecode -t memory.

quilt · August 20, 2023, 6:15pm

Kingston now also has 5600 and 5200 MT/s 16 and 32 GB ECC modules. You probably won’t be limited to 4800 MT/s. In a 64 GB configuration it should be easy to hit those speeds. Perhaps even more as they have Hynix A-die chips.

Susanna · August 20, 2023, 6:25pm

I have the 4800MT/s ones, but in pairs they’ll happily go to 6800MT/s with decent timings, same as just about any Hynix M-die dual rank DIMM, so I don’t think the modules themselves are the limiting factor. I’ve gotten all four to 5200 together, and they passed memtest, but then I tried 5400, it wouldn’t train, I went back to 5200 to fiddle with terminations, and then it wouldn’t POST until I went all the way back to 4800 even with the settings that worked before, so I’m sticking with 4800 and super tight timings until AGESA 1.0.0.8 or whatever is released to improve memory controller.

ecclex · August 22, 2023, 5:30pm

Thanks, so it’s real, server-grade, end-to-end ECC, nice. If ASUS were also clear on reporting of ECC errors to the OS.

The X670E ProArt is 2.3x the price of the mentioned B650 Tomahawk, too pricey for me just to get ECC and it’s closer to the real deal like SUPERMICRO H13SSL-N ATX, not that I would do it (maybe at some future point) and I wonder if the gaming performance would be lower with such a server motherboard, also a quick read suggests that the split of the PCEe lanes isn’t that good on the X670E ProArt?(?).

I just care about ECC next, if everything else is fine, not the X670E’s features – B650 (non-E) is enough for me, at least for now and I’m not sure if future-proofing is a thing with technology, esp. stuff like this.

In the link someone confirmed ECC working on a TUF GAMING X670E-PLUS. It’s below 300 bucks, so it comes into question.
Another cheaper choices are the Strix B650E-E and Strix B650E-F. @hhh23 says ECC is working (full support/does it report the corrected ECC errors?) in forum.level1techs .com/t/asus-rog-strix-b650e-e-gaming-wi-fi-ecc-and-intel-i225-v-lan/194753, but The Intel I225-V LAN disconnects ever 24 hours or so..
Once Linux 6.5 is available on Arch (I’ll just wait, can’t bother to compile myself esp. since it’s at 6.5-rc7 already), I’m going, after 3 MSI boards in a row, with ASUS this time, because they officially advertise real end-to-end ECC support (if only they would clarify of the ECC error correction reporting), but I want the full support/reporting.

Depending on your overall workload the 7900-series might be a better choice due to the amount of cores.

I’m fine with 8 cores, since it’s still the fastest gaming CPU, but especially efficient (thanks, TSMC). 12 cores would come into question if the architecture was the same (one 12-core CCD or one 16-core CCD) so scheduling weren’t an issue.

Kingston now also has 5600 and 5200 MT/s 16 and 32 GB ECC modules. You probably won’t be limited to 4800 MT/s. In a 64 GB configuration it should be easy to hit those speeds. Perhaps even more as they have Hynix A-die chips.

I quoted 4800 ECC because it’s the only value where ECC was added, but they probably mean all speeds and it’s just a shorter notation.
Their DDR5-5200 is CL42 and DDR5-5600 is CL46 (not the common CL32 or CL30), but I hope it matters less with the 7800X3D because of its extra cache. I’m planing to go with 2x16GB, no overclocking, but I still prefer to go with HYNIX chips (I read best for stability when it comes to overclocking, in case I have to overclock because ECC RAM limits gaming performance).

edit

So far I’m gonna go with ~~ASUS Strix B650E- F~~ ASUS TUF Gaming B650-Plus, because is seems worth its price (being maybe a 8 layer PCB is nice to have too, but I need real ECC, and ASUS advertises it officially).
So, after several MSI motherboards in a row, it will be ASUS this time. Next time, MSI, maybe advertise real ECC support.

For anyone else, AFAIHaveRead: Using ZFS or a filesystem, which stores file hashes, is already an advantages over the likes like ext4. ZFS is just the most stable if using zraid pools, for mirror pools Btrfs is also supposedly stable (if only talking about file hashes). Additionally, I reduced zfs_arc_max (so less data is in RAM) to 512 MiB (I don’t need the speed, it’s just a local backup) and added zfs_flags=0x10:cat /etc/modprobe.d/zfs.conf as long as I don’t have ECC:

options zfs zfs_arc_max=536870912
options zfs zfs_flags=0x10

(536870912 = 512*1024^2)

diizzy · August 22, 2023, 6:41pm

512Mb is way too low and I’m not even sure it’ll run correctly with that little allocated.

ecclex · August 28, 2023, 4:31pm

I’m going to increase it once I have ECC.

After more digging:

ASUS ROG Strix B650E-E Gaming WIFI:
Has many people saying its INTEL I225-V has disconnecting issues.
ASUS ROG Strix B650E-F Gaming WIFI
No such reports on the same site, but it also has the INTEL I225-V, so idk.

The WIFI on both boards adds like 20-40 bucks on top of the price and I don’t need WIFI.

ASUS TUF Gaming X670E-Plus:
– No INTEL LAN (no LAN issues on my B550 Tomahawk with its REALTEK LAN (what a time, one may wants to avoid INTEL LAN now, at least for the I225-V, but I think some others too)).
– 40 bucks cheaper than its WIFI version.
– has confirmed ECC error reporting in the link by @diizzy

So it’s going to be the TUF, even though it’s not ASUS’ top line like the ROG.

edit

I decided I don’t need the X670E-Plus and the ASUS TUF Gaming B650-Plus is enough, has confirmed ECC reporting (URL by diizzy) and worth its price.

My only question left is regarding ECC UDIMM RAM:
Since 4800CL40 vs 6000CL30 doesn’t matter that much (youtube .com/watch?v=XW2rubC5oCY) with the X3D cache (218 FPS/204 FPS → 6.9% (nice)) in 1080p, which means it’s going to be lower in higher resolution like 1440p and even lower in 4K, what difference do then MT/s and CL in these play:

Kingston Server Premier UDIMM DDR5-5600, CL46-45-45, ECC, on-die ECC

and

Micron UDIMM DDR5-4800, CL40-39-39, ECC, on-die ECC
(MICRON doesn’t offer anything above 4800, maybe for stability reasons? see below.)

Which is better, the higher MT/s or the lower CL latency? Seems like neither? Also MICRON RAM is not listed on ASUS’s memory compatibility site, but a user on @diizzy link has recently confirmed it working: Do I need to worry about the MICRON modules not working or being less stable?
The same user overclocked his MICRON [..] to 5400MHz with CAS 38 I was able to boot to debian 12 and see corrected ecc error logs via dmesg [..], but already got ECC errors reported: Seems like it’s not a good idea to overclock ECC RAM in general, so I might as well go with the MICRON RAM sticks at JEDEC specs, even though they are MICRON’s own chips and not SK HYNIX.

Once the RX 7800 XT is available (and Linux 6.5 on Arch, because if the updated EDAC), I’m going to order the motherboard, CPU and that GPU at the same time, so I have time to legally send anything back without questions asked, in case there’s something.

diizzy · August 28, 2023, 10:41pm

I have no issues with Intel i225 myself on my Asus board but I guess there are edge cases. I use the Micron memory and it works great, in my Asus X670E Creator mobo.

ecclex · August 29, 2023, 10:42am

Yeah, not every mobo with the i225 seems to have disconnecting issues.

After more digging: According to MICRON’s info page on MT/s vs CAS Latency (or CL) and what’s more important, they say that both parameters are important, but latency is wins (it probably also depends on your application’s use case). When comparing the ECC RAM stick’s latencies:
CL/(MT/s / 2) → latency in nanoseconds:

Kingston Server Premier DIMM 32GB, DDR5-4800, CL40: ~16.67ns
Kingston Server Premier DIMM 32GB, DDR5-5200, CL42: ~16.15ns
Kingston Server Premier DIMM 32GB, DDR5-5600, CL46: ~16.43ns

The differences are very small: Compare this so non-ECC sticks like 32/(6000/2) → 10.67ns.
The fastest is the 5200CL42 one. The 5600 is too expensive (currently risen in price) and the only one officially supported by my future mobo is the 4800MT/sCL40 (16GB or 32GB per stick) (asus .com/motherboards-components/motherboards/tuf-gaming/tuf-gaming-x670e-plus/helpdesk_qvl_memory/?model2Name=TUF-GAMING-X670E-PLUS) and it’s also the cheapest. So I’m probably just going with 2x KSM48E40BD8KM-32HM (the HYNIX M-chips on there are nice to have, but the deciding choosing factor is that the sticks are officially supported) (there are only them on their ECC RAM QVL list).

Initially I was sure I’d go with 2x16GB, but when certain games require 32GB already (not a hard requirement I think, but stutters occur) + OpenZFS + maybe one VM running + also supposedly 2 dual rank sticks are faster than 2 single rank ones + the relatively low current RAM prices and the decision is easy.

Susanna · August 29, 2023, 11:06am

If you’re running them one DIMM per channel, my experience is that the 4800 bin will overclock just as well as any other Hynix M-die DIMM. 6800-32 was pretty easy to reach, and I could probably have gone beyond that fairly easily. ASUS QVL has some oddities around it, but as far as I can tell when they claim to support 4800 in four channels they actually mean they support two DIMM per channel. And setting the frequency manually without touching terminations does run it at that speed easily. Touching terminations I can get higher frequencies, but finding something that isn’t super finicky is tricky. I’ll probably try downgrading my BIOS to one that doesn’t limit VSOC and see if that lets me push higher, I had a far easier time going for high frequencies on that version.

bambinone · August 29, 2023, 3:16pm

Awesome news!

ecclex · August 30, 2023, 3:00pm

6800-32 was pretty easy to reach […]

If you mean that with ECC RAM then that’s pretty crazy in a positive sense (from reading I got the sense that ECC RAM is much less overclockable, but maybe that’s not the case? But the buyable ECC RAM linked above by me only shows stuff like 5600CL46, why no CL32, idk).
MICRON’s version for the same line of sticks with their own chips seems to be much less overclockable: to quote again:[..] to 5400MHz with CAS 38 I was able to boot to debian 12 and see corrected ecc error logs via dmesg [..].

ASUS QVL has some oddities around it […]

Seems like it, because in the link by @diizzy, a guy uses MICRON’s ECC RAM ( 2x 32GB DDR5-4800 ECC UDIMM 2Rx8 1.1V/(5V ext) CL40 - MTC20C2085S1EC48BR) [which is not in ASUS’s QVL] without problems. But it also was released just very recently and maybe ASUS just needs to verify and add it.

quilt · August 30, 2023, 3:21pm

Because 5600CL46 is a jedec spec. I.e. running at 1.1V. But those modules can apparently do the same speeds as the non-ecc hynix kits regardless.

Susanna · August 30, 2023, 3:40pm

The thing about overclocking ECC memory is, that while you usually can’t do it on server motherboards, and registered DIMMs are inherently limited by their buffer, ECC UDIMMs are basically the same as regular consumer non-ECC memory. The memory chips connect pretty much directly to the CPU, which handles all the memory controller stuff. All ECC does is add a few extra channels to report the parity. Or something like that, I only have a surface level understanding of this stuff. Consumers overclock memory because that’s the cheapest way to get good performance, servers are supposed to just add extra channels of slow but reliable memory instead.

My ECC memory uses Hynix M-die, which is basically top-tier for overclocking. Hynix A-die will reach higher frequencies, but zen4 doesn’t really benefit from going beyond about 6400MT/s anyway (the fabric will switch to half rate and murder your RAM & L3 performance). Micron chips, while perfectly good for JEDEC, are considerably less overclockable. Just compare some overclocking stories with non-ECC DIMMs and you’ll see a stark difference. On Intel you can get Hynix to sit stable at 8000+ fairly reliably, while Micron often struggles to hit even just 6000. Samsung is even worse.

Of course, overclocking ECC means you’re making it not really ECC any more. When Kingston qualified my memory, they did so at 4800. At that speed it’s server-grade and guaranteed to work. Pushing it beyond that is an overclock, and while in my experience it’ll still work just fine (including the ECC), Kingston can’t guarantee that it’s never going to return incorrect data. It’s my personal workstation and I back up to my NAS regularly, so in my case running these ECC DIMMs out of spec isn’t an issue, but if you’re setting your server up to host anything important, don’t overclock your memory. It could possibly defeat the purpose of paying extra for ECC in the first place. There are JEDEC 5600 certified DIMMs, and even faster than that are certainly on the way, so just buy those instead and run at a certified speed.

strauss2 · August 30, 2023, 10:32pm

What do you think about the Asrock Rack b650d4u. It has ECC support and ipmi and was in stock at $350 USD on newegg recently! I know the title says consumer stuff though.

ecclex · September 1, 2023, 4:23pm

Guess JEDEC decided 5600CL46 to be the super stable CL at the MT/s for now. Anandtech has a 5600CL34 as another JEDEC spec value, but idk whether they removed it or if it’s coming later and more maturing/testing is required.

Yeah that unfortunately is going to increase the ECC errors (or is how ECC errors can be generated on purpose (read, not confirmed by me), no shorting of pins, buying special kits or overheating with a hot-air-gun required).

All ECC does is add a few extra channels to report the parity. Or something like that, I only have a surface level understanding of this stuff.

Me neither. Something like (no claim for correctness): Mr. Hamming was one of those who was interested in detecting and correcting data. His basic idea is still used today, aka Hamming code (examples with much fewer bits can be looked up, but it’s basically the same for 72 bit or wider). For singe rank non-ECC memory sticks it’s 64 bit and 64 lanes. For ECC memory it’s additionally 8 bit and 8 lanes for the parity, 72 bit in total. The amount of chips on a memory module can be 8chips*8bit + 1chip*8bit, but doesn’t have to be and can be e.g. 16chips*4bit, but must match the 72 lanes in total. Those 8 extra lanes must be on the motherboard (and of course the BIOS must support ECC).
For the KSM48E40BD8KM-32HM, which is dual rank, x8, it’s 144=8*18 chips (= 2*72 single rank).
A modified/optimized Hamming algorithm, or sometimes still the original Hamming (72,64), is then used for the error detection and correction, aka ECC (72,64) SECDED (Single-Error Correcting and Double-Error Detecting) (based on Hamming (127,120)); different algos are by M. Y. Hsiao, by INTEL, by MARVELL, … .
AfaiHaveRead Hamming ECC algos can only detect 2-bit errors, not correct them, for multi-bit error correction, different algorithms, like BCH code, are used. Wonder what algo those commercial multi-bit ECC servers actually use, found ECC with BCH Algorithm on XILINX, but this question is for another topic.

Good infos and confirmations you posted. Maybe there needs to be something like a wiki/FAQ on this forum for people like me/us who are just starting with ECC RAM.

Read that if a double bit error is detected, a server is/can be? automatically halted. What behavior have our consumer ASUS boards?

@strauss2 A variant of it has some bad ratings on AMAZON. Personally I would check if SUPERMICRO has something for AM5 first (though I’m not sure if it would be ~350$).
But maybe: Soon AMD is gonna release their “Siena” Zen 4 Epyc, starting with 70 Watt TDP CPUs. Maybe these 6 channel motherboards and low TDP CPUs are going to be similarly priced as AM5. Because for all Epyc generations, the cheapest CPUs are also the ones with the lowest TDP (en.wikipedia .org/wiki/Epyc):

1st gen Epyc, 120W CPU: 475 $
2nd gen Epyc, 120W CPU: 450 $
3rd gen Epyc,155W CPU: 900$ $
4th gen Epyc, 200W CPU: 1100 $

So a 70W TDP Epyc CPU could be super affordable.

edit
Ordered the parts.

edit 2
All ordered parts arrived: imgur .com/a/2cWAThN.
The CPU fan not pictured is a THERMALRIGHT Assassin King 120 SE for 23 bucks, seems like a very good price to performance.
Assembled in a new FRACTAL DESIGN North case (white, non TG). Default BIOS was version 1224 from March and not on ASUS’ page anymore, updated to latest.
Will test ECC soon.

ecclex · September 14, 2023, 12:55pm

ECC is set to Enabled. When changing the default RAM timings from 40-39-39-77 4800 to 32-32-32-32 5000 I get reliably this mce error after running stress-ng --vm 16 --vm-bytes 95% --vm-method rowhammer --verify -t 1m in dmesg in a 5 minutes interval:

sudo dmesg | grep -i 'edac\|mce\|hardware e\|ras'
[    0.346504] EDAC MC: Ver: 3.0.0
[    0.734491] RAS: Correctable Errors collector initialized.
[    5.966483] MCE: In-kernel MCE decoding enabled.
[    5.970316] EDAC MC0: Giving out device to module amd64_edac controller F19h_M60h: DEV 0000:00:18.3 (INTERRUPT)
[    5.970319] EDAC amd64: F19h_M60h detected (node 0).
[    5.970322] EDAC MC: UMC0 chip selects:
[    5.970322] EDAC amd64: MC: 0:     0MB 1:     0MB
[    5.970324] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[    5.970326] EDAC MC: UMC1 chip selects:
[    5.970327] EDAC amd64: MC: 0:     0MB 1:     0MB
[    5.970328] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[    7.069194] amdgpu 0000:0c:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  313.769390] mce: [Hardware Error]: Machine check events logged
[ 1255.902703] mce: [Hardware Error]: Machine check events logged
$

Sometimes the system would crash, but more often than not it would produce that mce: line. The line does not appear with default RAM timings [and ECC Enabled].

But unfortunately there is no [Hardware Error]: Corrected error, no action required., that people often mention.
Does that mean that it’s not guaranteed that an ECC error correction reporting happened?

Run stress-ng many times, also with --vm-method all, but that single mce line is all I got so far. STH and a different article mention stress-ng as a good program to cause ECC events.

Also nothing in (except the mentioned mce: line in dmesg):

$ ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#3channel#0	0	0
mc#0csrow#3channel#1	0	0
mc#0csrow#2channel#1	0	0
mc#0csrow#2channel#0	0	0

$ sudo edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

With default RAM timings, but trying 5200, the system crashes and boots into a reset BIOS with the last setting being set to 5200. With 32-32-32-32 5000 the system crashes after like 45min on normal use. With 32-32-32-17 5000 the system crashes randomly minutes after booting on normal use. And with 32-32-32-16 5000 it crashes after a fraction of a second after entering the BIOS.

System: TUF Gaming B650-Plus BIOS version 1654, 7800X3D, 2x KSM48E40BD8KM-32HM, Arch with latest updates (Linux 6.5.3).

General other infos:

$ sudo lsmod | grep -i amd
amdgpu              12312576  1
amdxcp                 12288  1 amdgpu
drm_buddy              20480  1 amdgpu
gpu_sched              57344  1 amdgpu
amd64_edac             77824  0
edac_mce_amd           53248  1 amd64_edac
i2c_algo_bit           20480  1 amdgpu
drm_suballoc_helper    12288  1 amdgpu
kvm_amd               204800  0
drm_ttm_helper         12288  1 amdgpu
ttm                   110592  2 amdgpu,drm_ttm_helper
kvm                  1372160  1 kvm_amd
drm_display_helper    229376  1 amdgpu
gpio_amdpt             16384  0
gpio_generic           20480  1 gpio_amdpt
ccp                   159744  1 kvm_amd
video                  77824  3 asus_wmi,amdgpu,nvidia_modeset

edit
I found this:

$ sudo cat /sys/kernel/debug/tracing/events/mce/mce_record/enable
0
$ sudo su
# echo 1 >/sys/kernel/debug/tracing/events/mce/mce_record/enable
# exit
$ sudo cat /sys/kernel/debug/tracing/events/mce/mce_record/enable
1

After running stress-ng with the same unstable timings again, there’s a new kworker ..-line, which doesn’t appear when tracing is disabled:

sudo cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:16
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
     kworker/0:1-9       [000] .....   313.981082: mce_record: CPU: 0, MCGc/s: 11d/0, MC21: dc2040000400011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000007d6637e00/d01a001701000000/c88200220a800602, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:a60f12, TIME: 1694946925, SOCKET: 0, APIC: 0
     kworker/0:0-7       [000] .....  1255.915547: mce_record: CPU: 0, MCGc/s: 11d/0, MC21: dc2040000400011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000007ddc36140/d01a001301000000/c88200220a800602, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:a60f12, TIME: 1694951506, SOCKET: 0, APIC: 0

The question is of course how this is helpful or if this can be decoded to provide more infos.

The output of sudo dmesg | grep -i 'edac\|mce\|hardware e\|ras' and sudo edac-util -v unfortunately remains unchanged and doesn’t show [Hardware Error]: Corrected error, no action required..

KeithMyers · September 14, 2023, 6:55pm

Do you have access to the IPMI interface. I see corrected MCE errors logged there in my servers.

09/07/2023, 00:17:35 | BIOS | Memory | Correctable ECC - Asserted
| | | | Rank on DIMM : 0
| | | | Socket ID : CPU1 ,Channel : A ,DIMM : 1

ecclex · September 15, 2023, 9:45am

Unfortunately no IPMI on consumer TUF Gaming B650-Plus.