Ryzen, motherboards, and the state of proper ECC support

Okay,
First off shoutout to microcenter’s amazing return policy or this would have been very risky and potentially expensive to figure out and I probably would have never tried. I did not abuse their policy or anything for this but it sure gave me peace of mind while getting to the bottom of this.

First I need to address ASRock and call them out here. I gathered from these forums, as well as by emailing support directly that the platform simply cannot support ecc reporting… supposedly.

See below for snippet from asrock support.

Q1:
If the ECC ram is correcting errors will that be reported to the Windows operating system with ASRock motherboards? (posts on the internet say it should for proper ECC support)
A:
No, it need to manually to check the ECC function by type command.

First off completely dodged my question (the one above is one they wrote (???))

Second of all this is complete lie. I have tested the X570 taichi and you get WHEA errors in event log! Fantastic! However oddly enough my X470 taichi does not work with the same memory and CPU.

What they are referring to by command is opening a command prompt and typing

wmic memphysical get memoryerrorcorrection

Which will return 6 for multi-bit correction (or at least if the hardware claims it is working).

Furthermore it appears ASRock’s Rack X470D4U board also lacks support for proper reporting (per some thread on here I cannot find right now).

So at this point you would think okay well X470 just does not work and X570 does right? Not exactly. For example my Asus X570 Gaming TUF does not report in windows (but does appear to in memtest and kind of reports in linux). More on this later.

So here I will sum up what I KNOW

With latest bioses:
ASRock X470 Taichi - No working reporting, claims multi-bit ecc in windows
ASUS X570 Gaming TUF - No working reporting, claims multi-bit ecc in windows
GIGABYTE B550 Vision D - Working reporting, claims multi-bit ecc in windows
ASRock X570 Taichi - Working reporting, claims multi-bit ecc in windows

So now a word on the X570 Gaming TUF and what I mean by it kind of works in linux. For some reason using

edac-util -v

I never see any corrected errors however using

journalctl

and use grep to sort out memory related errors it seems to report something, but obviously not in a way that is being seen or recognized by Windows and in a way that is maybe even confusing whatever mechanism reports it in Linux? I don’t know how this works on the Linux side at all so perhaps someone else can better enlighten us on why this could happen (HELP!)

Furthermore in bootable memtest86 I can get it to report an ECC error however it does not specify if it is multi or single bit.

Lastly on the systems where reporting does not work Windows never blue-screened. It would always hard-reset as opposed to the other systems where it would blue screen. Obviously the reporting boards are leading to more graceful failures. Or at the very least more informational failures.

The method I am using in order to cause memory errors is overclocking my memory to tightest primary timings I can hit at 1.35v 3600mhz stable and then dropping voltage to where I am starting to easily see errors but not crashing (1.29v for me).

In conclusion I am almost done venting. However I think its absolutely unacceptable that on server hardware like ASRock Rack they are trying to wriggle out of providing proper ecc support. If it is a x470 limitation they should say so, otherwise they should swallow whatever it costs and make it work. On a regular board I could be more forgiving but this is not acceptable.

As a side note Asus support was only marginally better claiming my board was broken (when it obviously works fine and telling me to RMA it. I have a second identical board I will use to verify behavior but I seriously doubt that is the case as it is rock stable and I have no issues with it other than this.

I have no ill will toward either companies support reps, but engineering needs to step up and own this or explain why it doesn’t work.

This is the thread

ECC support works, reporting to the OS works - both single and dual bit errors. If you have issues under windows then that’s windows fault. ECC reporting is just a bunch of registers/interrupts - much lower in the HW stack then the OS itself. The OS just needs to be able to read/interpret what the CPU is saying.

What doesn’t work is the reporting by the IPMI on those boards.

My theory is that asrock is under embargo/mandate from AMD that prevents them from declaring full ECC support on non-epyc cpus. Or simply amd does not provide asrock with everything they need for full support validation. (market segmentation)

1 Like

What could cause edac-util not to report any corrected errors though? It seems strange that I would not see the errors there. Also would you know how to tell if memtest is seeing single or multi-bit errors?

EDIT: Thank you for the response, if it works I would be thrilled! Even if its not in IPMI I can work with that. No reporting though was a nightmare scenario.

I don’t know, it works on my board (X470D4U)
With known bad configuration that causes semi-frequent ECC errors and a bit of a load on the host OS:

edac-util -v:

mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 1 Corrected Errors

dmesg/journalctl:

[  305.000689] mce: [Hardware Error]: Machine check events logged
[  305.006525] [Hardware Error]: Corrected error, no action required.
[  305.012716] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[  305.023419] [Hardware Error]: Error Addr: 0x00000000002c9d00
[  305.029076] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
[  305.036826] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  305.045216] EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x593 offset:0xa00 grain:64 syndrome:0xa40)
[  305.055905] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

AFAIK there is no distinction. memtest simply checks for errors and will report whichever happens - be it single or double bit error. (If it happens at all)

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.