Return to Level1Techs.com

Asrock X570 Taichi Razer Edition - Missing ECC option!

Hi!

I have seen positive reports about the X570 Taichi. And screenshots from its BIOS confirmed that it has the ECC memory options:


Original author for those screenshots: techpowerup.com via Grundlagenartikel - X570-Boards mit ECC-Unterstützung Seite 2 | igor´sLAB Community

Now I ordered the "Razer edition" of this board, because it has a better chipset fan placement and also some slight updates (2.5GB network, which I don’t use).

However:
Looking at the fully updated BIOS P1.40 as of February 2021 it seems to be missing the menu “AMD CBS \ UMC Common Options \ Common RAS” for ECC settings. I can only see the “DRAM Memory Mapping” menu but not the “Common RAS” menu!


Did Asrock actually remove those options for the Razer edition???

Memtest86 v9 claims that ECC memory is present and enabled:

Using only 1 DIMM, the Samsung 2666 ECC (M391A4G43MB1-CTD) memory has extreme overclocking potential, in my tests was stable up to 3666 without even raising the voltage above the default of 1.20V.

However even when trying different overclocking and and undervolting combinations for multiple hours: The computer either reboots during Memtest86 testing or it completes multiple test runs without error! I never get a corrected ECC report with Memtest86. This makes me suspicious if ECC memory actually works.

Did Asrock remove ECC support from the X570 Taichi Razer edition board variant?

Appendix:
version screenshots:


Triggering ECC error reports with OC is hard. The margin between “not enough oc = no ecc errors” and “too much oc = no post/straight reboots” is thin.

If you don’t see ecc errors then you did not confirm or denied anything. The board can still have functioning ECC reporting - you just might have not hit the oc sweet-spot. At the same time it is also possible that ECC isn’t functioning.

Although I would bet on ECC being functional - Engineers are lazy-creatures, the underlying code probably didin;t change and the ECC is set to auto by default. Menu entry could have been deleted to simplify for the “gaming/rgb” crowd.

Please refer to my previous post: Does A520 support ECC memory? - #8 by Remisc

And the original with the meat (his post contains successful “OC to trigger ECC errors”):

Also: congrats on hitting 3666 on ECC kit, I did not think ECC memory could OC so well. ECC dram chips are usually bottom bins so it’s nice to see that there is potential.

@wendell mentions in his review that it works on this board, so I’d be surprised if it was removed.

They might just have moved the option, happens a lot. Maybe under Security?

1 Like

+1 i suppose that the ecc options might be moved under a different category,
in comparison to the regular Taichi board.
This is kinda one of the annoying things in regards to an Asrock bios in general.
Because they are sometimes pretty inconsistent in regards to where they put things,
or how they name certain options.

But Asrock is improving on that.

I looked everywhere, I can’t find that option anywhere else in the BIOS. It’s not under security. There’s a small chance I might still have missed it.

I also want to believe that the ECC functionality is just active and working. But some user reported in this German forum for his Gigabyte board that when he left the ECC setting on Auto it was actually disabled and he had to manually force it to Enabled for it to work: https://www.igorslab.de/community/threads/x570-boards-mit-ecc-unterstützung.1554/post-118853 So I would rather verify than trust.

That’s what I was afraid of too, that the difference between not enough oc to trigger errors and too much so it won’t boot is not enough to verify it’s working. I did now almost 2 days of torture testing with the highest overclock that doesn’t trigger reboots constantly and could not trigger any 1 bit corrected errors or they were not reported. I wonder what causes the reboots if I go any higher: would this be from resets caused by detected multibit errors? Or would this just be plain instability? It’s important to note: I only get reboots, never freezes in Memtest86! Any way to determine the cause of the reset/reboot?

Maybe we need one of those:


source from Passmark/memtest employee on Twitter
Quote: “Should be ready in a couple of months.” :heart_eyes:
It doesn’t look hugely complicated, but it is ingenious: it’s a small switch and some resistors on a RAM raiser to inject errors. Not sure where to get those raisers though.

That doesn´t really suprise me that much.
And it might also be the case with Asus boards as well maybe.
An “auto” setting within the bios does not always do what you expect it to do.
Sometimes it actually doesn’t really do anything.
And you actually have to manually enable a said feature for it to work properly. :slight_smile:

That’s why I really don’t like the menu to be missing. There is nothing I can change from “Auto” to “Enabled” if I can’t see it :wink:.

As I showed in my screenshot: Memtest86 reports: “ECC Enabled: Yes (ECC correction)”, if I put in an ECC DIMM, but I don’t know if that means it’s actually working/correcting/reporting.

Well it’s a pretty new board.
So i could be that the feature will be added to the bios,
with future bios updates maybe.
Or the feature is present but hidden deeply somewhere…
I don’t own the said board so i cannot poke at it unfortunately.

Maybe @wendell could shine a light on this.
Because he recently reviewd the said board.
Or maybe he know´s a propper way to actually test ECC functionality.
That way you guys could test it.
Maybe the board just auto detect at this point.

1 Like

Oh, That is a very nice find, I was actually considering doing something like this from DDR4 riser myself: https://aliexpress.com/i/32958843576.html

I didn’t think it was already a reality in the wild

Also: about the missing errors: this board can also be experiencing “Platform first error handling” shenanigans besides just ECC setting.
Refer for exmple to this recent post:

And already mentioned post:

(Search for more PFEH / Platform first error handling in that thread)

1 Like

I am no memtest engineer but I think I have some basic understanding, if anyone has more authoritative information please correct me.

The problem is that there is a period when RAM needs to function before we start the memtest.

First the memory-training must pass.
And memory training means talking to ddr chips.

Too unstable settings won’t post.

Then after post it’s not like UEFI doesnt use any RAM, it does, it’s not much but errors there can mean a reboot.

And after that when we had that luck that memory training passed and we got through the UEFI what is left is loading that memtest.
That also uses some RAM. The amount is minimal but at that point RAM must also function correctly.

After all those parts without major errors the testing begins and now we expect errors to start happening.

@Remisc What is peculiar is that at a certain overclock the Asrock x570 Taichi Razer reliably posts and boots the Memorytest86. But then also reliably reboots after a few seconds of testing. If it had corrupted an area that is essential to Memtest86 working I would expect a crash instead of a reboot.
I am just making an educated guess here, but I think the board does not properly report 1 bit errors and then multibit errors trigger a reset generated by the memory controller.

This is a documentation for an unrelated Oracle server, but this is what I think could be happening:

For all operating systems, the behavior is the same for uncorrectable errors (UCEs):

  1. When a UCE occurs, the memory controller causes an immediate reboot of the system.

  2. During reboot, the BIOS checks the Machine Check registers and determines that the previous reboot was due to a UCE.

The uncorrectable ECC error is displayed in the service processor’s system event log (SEL) as shown here:

Memory | Uncorrectable ECC | Asserted | DIMM A0

Source: Troubleshooting DIMM Problems

May also be related to the “PFEH / Platform first error handling” problematic that you mentioned: ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T - #1201 by Mastakilla

Since this board does not have IPMI, nor a PFEH setting that I know of - is there any way to check the content of the Machine Check registers to determine reboot cause?

1 Like

MC registers are the very thing that is used to relay ECC errors to the OS.
this OS can be linux, windows, memtests, etc.

After UEFI finished the handover memtest should have been in control what happens afer ECC errors are detected.
For example multi-bit errors on linux don’t mean instant restart. Panic/reboot only happens when the address in question is in the kernel memory range. If it’s in the user memory then that process is simply killed and OC continues to function.
In case of memtest those MC registers are interpretted and printed as your typical messages about detected errors.

I think that what you are describing with restarts after few seconds of memtest can mean 2 things:

  • PFEH - in that case the decision to reboot is done before the errors has a chance to be seen by the os (memtest)
  • UCE you mentioned (extreme instability causing memory controller to give up when hit with a load)

Edit:

In theory something msr-tools could probably be used - you can read the correspondig MSR registers.
Register guide for 17h ryzens: Developer Guides, Manuals & ISA Documents - AMD)
and BKDG for the 15h family (2.13.1.6 Handling Machine Check Exceptions)

But that is assuming that the register was not cleared. And that is a longshot. If I remember correctly this is configurable and can be a part of PFEH.

Edit2:
Relevant exerpts from the BKDG:

Uncorrectable errors detected by this mechanism will set both UECC=1 and CECC=1 in
MSR0000_0411.

• Determine type of startup using D18F0x6C[ColdRstDet]:
• If this is a cold reset then BIOS must clear MSR0000_0411 [NB Machine Check Status
(MC4_STATUS)].
• If this is a warm reset then BIOS may check for valid MCA errors and if present save the status for later
use. See 2.13.1.6 [Handling Machine Check Exceptions].

note the “may”

1 Like

Wow, that is a 1000page document plus referenced documents. But a crash/panic is not a cold reset. It should be a warm reset so not cleared. It says if it’s not a cold reset then it may save the error register to a log. But after it may save the error to log is it implied that the register is cleared before it boots the OS?
In other words: assuming that the reboot is done before the errors has a chance to be seen by the Memtest86 and that the UEFI does not clear the register immediately at reboot: Can I read that MSR register from user process in Windows/Linux after the reboot? (msr-tools that you mentioned: Reading CPU model specific registers on linux - Super User ) Or is it already too late when Linux/Windows has booted? I could also compile a freestanding C program that reads the register and boot my program off a USB drive without OS Bare Bones - OSDev Wiki after a memtest panic.

Any ideas?

Please note that public BKDG documents are only available for 15h family (Bulldozer / Piledriver / Steamroller / Excavator).
BKDG for Zen seems conspicuously absent. (even though few years passed). AFAIK drafts are distributed under an NDA but I wouldn’t expect much to change when it comes to the MSR registers between 15h and 17h.
Just keep this in mind and don’t take that document for gospel.

Going back to your question:
You should be able to read it with msr-tool. But Depending on how linux handles it you may have to disable edac driver. It is handling ecc errors and may be clearing/checking those registers at startup.

There is also a possibility that msr-tool is redundant and the edac driver can already read/dump it to syslog.
I am not that familiar with the edac/mca internals but if you are interested then those are the links that should contain most of the meat:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/mce.h
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/uapi/asm/mce.h

+ maybe the related OSdev wiki pages for reference:
https://wiki.osdev.org/Interrupt
https://wiki.osdev.org/Exceptions#Machine_Check


I have zero experience/knowledge on windows apart from expecting something in the event logs in the administration tools section of control panel.


edit:

While possible I would bet on this being too much work for little reward, that should be possible to acheive with just careful linux-driver modification. Unless you are in market for a small self-development project. It would probably be a good learning experience. (+ there is less moving parts than linux)

And instead of trying to figure it out you may also just try contacting somone who might now how to check properly already: for example kernel mailing list or an email to one of the edac driver authors

1 Like

@Remisc Very well explained and great links. Thank you!

I am currently already working on two other side projects and time is limited but I will link here if I have some code/blog that I can share or if continue the discussion somewhere else like on the Linux kernel list.

In the mean time @wendell also hinted at a video on ECC on linux with AM4 so maybe he will share some more insights.

I just received a similar DDR4 riser. I wonder what resistor values the Passmark team used in their prototype? :slight_smile:

I think the color code on their twitter image is blue-brown-black-silver-brown. So 6.1 ohm. but a simple short also worked in the source I linked previously.

1 Like