Asrock X570 Taichi Razer Edition - Missing ECC option!

Hi!

I have seen positive reports about the X570 Taichi. And screenshots from its BIOS confirmed that it has the ECC memory options:


Original author for those screenshots: techpowerup.com via Grundlagenartikel - X570-Boards mit ECC-Unterstützung Seite 2 | igor´sLAB Community

Now I ordered the "Razer edition" of this board, because it has a better chipset fan placement and also some slight updates (2.5GB network, which I don’t use).

However:
Looking at the fully updated BIOS P1.40 as of February 2021 it seems to be missing the menu “AMD CBS \ UMC Common Options \ Common RAS” for ECC settings. I can only see the “DRAM Memory Mapping” menu but not the “Common RAS” menu!


Did Asrock actually remove those options for the Razer edition???

Memtest86 v9 claims that ECC memory is present and enabled:

Using only 1 DIMM, the Samsung 2666 ECC (M391A4G43MB1-CTD) memory has extreme overclocking potential, in my tests was stable up to 3666 without even raising the voltage above the default of 1.20V.

However even when trying different overclocking and and undervolting combinations for multiple hours: The computer either reboots during Memtest86 testing or it completes multiple test runs without error! I never get a corrected ECC report with Memtest86. This makes me suspicious if ECC memory actually works.

Did Asrock remove ECC support from the X570 Taichi Razer edition board variant?

Appendix:
version screenshots:


Triggering ECC error reports with OC is hard. The margin between “not enough oc = no ecc errors” and “too much oc = no post/straight reboots” is thin.

If you don’t see ecc errors then you did not confirm or denied anything. The board can still have functioning ECC reporting - you just might have not hit the oc sweet-spot. At the same time it is also possible that ECC isn’t functioning.

Although I would bet on ECC being functional - Engineers are lazy-creatures, the underlying code probably didin;t change and the ECC is set to auto by default. Menu entry could have been deleted to simplify for the “gaming/rgb” crowd.

Please refer to my previous post: Does A520 support ECC memory? - #8 by Remisc

And the original with the meat (his post contains successful “OC to trigger ECC errors”):

Also: congrats on hitting 3666 on ECC kit, I did not think ECC memory could OC so well. ECC dram chips are usually bottom bins so it’s nice to see that there is potential.

@wendell mentions in his review that it works on this board, so I’d be surprised if it was removed.

They might just have moved the option, happens a lot. Maybe under Security?

1 Like

+1 i suppose that the ecc options might be moved under a different category,
in comparison to the regular Taichi board.
This is kinda one of the annoying things in regards to an Asrock bios in general.
Because they are sometimes pretty inconsistent in regards to where they put things,
or how they name certain options.

But Asrock is improving on that.

I looked everywhere, I can’t find that option anywhere else in the BIOS. It’s not under security. There’s a small chance I might still have missed it.

I also want to believe that the ECC functionality is just active and working. But some user reported in this German forum for his Gigabyte board that when he left the ECC setting on Auto it was actually disabled and he had to manually force it to Enabled for it to work: https://www.igorslab.de/community/threads/x570-boards-mit-ecc-unterstützung.1554/post-118853 So I would rather verify than trust.

That’s what I was afraid of too, that the difference between not enough oc to trigger errors and too much so it won’t boot is not enough to verify it’s working. I did now almost 2 days of torture testing with the highest overclock that doesn’t trigger reboots constantly and could not trigger any 1 bit corrected errors or they were not reported. I wonder what causes the reboots if I go any higher: would this be from resets caused by detected multibit errors? Or would this just be plain instability? It’s important to note: I only get reboots, never freezes in Memtest86! Any way to determine the cause of the reset/reboot?

Maybe we need one of those:


source from Passmark/memtest employee on Twitter
Quote: “Should be ready in a couple of months.” :heart_eyes:
It doesn’t look hugely complicated, but it is ingenious: it’s a small switch and some resistors on a RAM raiser to inject errors. Not sure where to get those raisers though.

That doesn´t really suprise me that much.
And it might also be the case with Asus boards as well maybe.
An “auto” setting within the bios does not always do what you expect it to do.
Sometimes it actually doesn’t really do anything.
And you actually have to manually enable a said feature for it to work properly. :slight_smile:

That’s why I really don’t like the menu to be missing. There is nothing I can change from “Auto” to “Enabled” if I can’t see it :wink:.

As I showed in my screenshot: Memtest86 reports: “ECC Enabled: Yes (ECC correction)”, if I put in an ECC DIMM, but I don’t know if that means it’s actually working/correcting/reporting.

Well it’s a pretty new board.
So i could be that the feature will be added to the bios,
with future bios updates maybe.
Or the feature is present but hidden deeply somewhere…
I don’t own the said board so i cannot poke at it unfortunately.

Maybe @wendell could shine a light on this.
Because he recently reviewd the said board.
Or maybe he know´s a propper way to actually test ECC functionality.
That way you guys could test it.
Maybe the board just auto detect at this point.

1 Like

Oh, That is a very nice find, I was actually considering doing something like this from DDR4 riser myself: https://aliexpress.com/i/32958843576.html

I didn’t think it was already a reality in the wild

Also: about the missing errors: this board can also be experiencing “Platform first error handling” shenanigans besides just ECC setting.
Refer for exmple to this recent post:

And already mentioned post:

(Search for more PFEH / Platform first error handling in that thread)

1 Like

I am no memtest engineer but I think I have some basic understanding, if anyone has more authoritative information please correct me.

The problem is that there is a period when RAM needs to function before we start the memtest.

First the memory-training must pass.
And memory training means talking to ddr chips.

Too unstable settings won’t post.

Then after post it’s not like UEFI doesnt use any RAM, it does, it’s not much but errors there can mean a reboot.

And after that when we had that luck that memory training passed and we got through the UEFI what is left is loading that memtest.
That also uses some RAM. The amount is minimal but at that point RAM must also function correctly.

After all those parts without major errors the testing begins and now we expect errors to start happening.

@Remisc What is peculiar is that at a certain overclock the Asrock x570 Taichi Razer reliably posts and boots the Memorytest86. But then also reliably reboots after a few seconds of testing. If it had corrupted an area that is essential to Memtest86 working I would expect a crash instead of a reboot.
I am just making an educated guess here, but I think the board does not properly report 1 bit errors and then multibit errors trigger a reset generated by the memory controller.

This is a documentation for an unrelated Oracle server, but this is what I think could be happening:

For all operating systems, the behavior is the same for uncorrectable errors (UCEs):

  1. When a UCE occurs, the memory controller causes an immediate reboot of the system.

  2. During reboot, the BIOS checks the Machine Check registers and determines that the previous reboot was due to a UCE.

The uncorrectable ECC error is displayed in the service processor’s system event log (SEL) as shown here:

Memory | Uncorrectable ECC | Asserted | DIMM A0

Source: Troubleshooting DIMM Problems

May also be related to the “PFEH / Platform first error handling” problematic that you mentioned: ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T - #1201 by Mastakilla

Since this board does not have IPMI, nor a PFEH setting that I know of - is there any way to check the content of the Machine Check registers to determine reboot cause?

1 Like

MC registers are the very thing that is used to relay ECC errors to the OS.
this OS can be linux, windows, memtests, etc.

After UEFI finished the handover memtest should have been in control what happens afer ECC errors are detected.
For example multi-bit errors on linux don’t mean instant restart. Panic/reboot only happens when the address in question is in the kernel memory range. If it’s in the user memory then that process is simply killed and OC continues to function.
In case of memtest those MC registers are interpretted and printed as your typical messages about detected errors.

I think that what you are describing with restarts after few seconds of memtest can mean 2 things:

  • PFEH - in that case the decision to reboot is done before the errors has a chance to be seen by the os (memtest)
  • UCE you mentioned (extreme instability causing memory controller to give up when hit with a load)

Edit:

In theory something msr-tools could probably be used - you can read the correspondig MSR registers.
Register guide for 17h ryzens: Developer Guides, Manuals & ISA Documents - AMD)
and BKDG for the 15h family (2.13.1.6 Handling Machine Check Exceptions)

But that is assuming that the register was not cleared. And that is a longshot. If I remember correctly this is configurable and can be a part of PFEH.

Edit2:
Relevant exerpts from the BKDG:

Uncorrectable errors detected by this mechanism will set both UECC=1 and CECC=1 in
MSR0000_0411.

• Determine type of startup using D18F0x6C[ColdRstDet]:
• If this is a cold reset then BIOS must clear MSR0000_0411 [NB Machine Check Status
(MC4_STATUS)].
• If this is a warm reset then BIOS may check for valid MCA errors and if present save the status for later
use. See 2.13.1.6 [Handling Machine Check Exceptions].

note the “may”

1 Like

Wow, that is a 1000page document plus referenced documents. But a crash/panic is not a cold reset. It should be a warm reset so not cleared. It says if it’s not a cold reset then it may save the error register to a log. But after it may save the error to log is it implied that the register is cleared before it boots the OS?
In other words: assuming that the reboot is done before the errors has a chance to be seen by the Memtest86 and that the UEFI does not clear the register immediately at reboot: Can I read that MSR register from user process in Windows/Linux after the reboot? (msr-tools that you mentioned: Reading CPU model specific registers on linux - Super User ) Or is it already too late when Linux/Windows has booted? I could also compile a freestanding C program that reads the register and boot my program off a USB drive without OS Bare Bones - OSDev Wiki after a memtest panic.

Any ideas?

Please note that public BKDG documents are only available for 15h family (Bulldozer / Piledriver / Steamroller / Excavator).
BKDG for Zen seems conspicuously absent. (even though few years passed). AFAIK drafts are distributed under an NDA but I wouldn’t expect much to change when it comes to the MSR registers between 15h and 17h.
Just keep this in mind and don’t take that document for gospel.

Going back to your question:
You should be able to read it with msr-tool. But Depending on how linux handles it you may have to disable edac driver. It is handling ecc errors and may be clearing/checking those registers at startup.

There is also a possibility that msr-tool is redundant and the edac driver can already read/dump it to syslog.
I am not that familiar with the edac/mca internals but if you are interested then those are the links that should contain most of the meat:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/mce.h
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/uapi/asm/mce.h

+ maybe the related OSdev wiki pages for reference:
https://wiki.osdev.org/Interrupt
https://wiki.osdev.org/Exceptions#Machine_Check


I have zero experience/knowledge on windows apart from expecting something in the event logs in the administration tools section of control panel.


edit:

While possible I would bet on this being too much work for little reward, that should be possible to acheive with just careful linux-driver modification. Unless you are in market for a small self-development project. It would probably be a good learning experience. (+ there is less moving parts than linux)

And instead of trying to figure it out you may also just try contacting somone who might now how to check properly already: for example kernel mailing list or an email to one of the edac driver authors

1 Like

@Remisc Very well explained and great links. Thank you!

I am currently already working on two other side projects and time is limited but I will link here if I have some code/blog that I can share or if continue the discussion somewhere else like on the Linux kernel list.

In the mean time @wendell also hinted at a video on ECC on linux with AM4 so maybe he will share some more insights.

I just received a similar DDR4 riser. I wonder what resistor values the Passmark team used in their prototype? :slight_smile:

I think the color code on their twitter image is blue-brown-black-silver-brown. So 6.1 ohm. but a simple short also worked in the source I linked previously.

1 Like

How did you get on with that RAM riser? I am thinking of ordering one too!

To resurrect this thread a little in a context of ECC error injection:
I also finally got myself those risers and soldered 2 wires to 4th and 5th pins which according to this docs represent Vss (ground) and DQ0 (data pin):

Shorting the wires/pins got me those edac errors:

[  182.800367] mce: [Hardware Error]: Machine check events logged
[  182.806350] [Hardware Error]: Deferred error, no action required.
[  182.812647] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  182.824044] [Hardware Error]: Error Addr: 0x0000000122de4b80
[  182.829804] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  182.839057] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  182.848933] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x255bc9 offset:0x780 grain:64)
[  182.860114] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  182.876950] mce: [Hardware Error]: Machine check events logged
[  182.884294] [Hardware Error]: Deferred error, no action required.
[  182.891937] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  182.904666] [Hardware Error]: Error Addr: 0x000000000edd8680
[  182.911848] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  182.921163] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  182.931119] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1dbb0 offset:0xc80 grain:64)
[  182.942213] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  182.950302] [Hardware Error]: Deferred error, no action required.
[  182.957888] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  182.970694] [Hardware Error]: Error Addr: 0x0000000124286840
[  182.977881] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  182.987143] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  182.997021] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x25850d offset:0x140 grain:64)
[  183.008139] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.025391] [Hardware Error]: Deferred error, no action required.
[  183.032894] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.045615] [Hardware Error]: Error Addr: 0x000000002726fb40
[  183.052715] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  183.061932] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.071734] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x4e4df offset:0x640 grain:64)
[  183.082685] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.090620] [Hardware Error]: Deferred error, no action required.
[  183.098084] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.110746] [Hardware Error]: Error Addr: 0x000000013e967540
[  183.117790] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  183.126879] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.136633] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x28d2ce offset:0xb40 grain:64)
[  183.147617] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.155459] [Hardware Error]: Deferred error, no action required.
[  183.162858] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.175456] [Hardware Error]: Error Addr: 0x000000013e967540
[  183.182474] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  183.191547] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.201061] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x28d2ce offset:0xb40 grain:64)
[  183.211877] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.219633] [Hardware Error]: Deferred error, no action required.
[  183.226845] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.239196] [Hardware Error]: Error Addr: 0x0000000140f53980
[  183.245987] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  183.254886] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.264477] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x291ea7 offset:0x380 grain:64)
[  183.275340] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.283031] [Hardware Error]: Deferred error, no action required.
[  183.290292] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.302753] [Hardware Error]: Error Addr: 0x0000000141abca80
[  183.309539] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000036410b404002
[  183.318427] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.328047] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x293579 offset:0x580 grain:64)
[  183.338917] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.757058] Disabling lock debugging due to kernel taint
[  183.757323] [Hardware Error]: Deferred error, no action required.
[  183.764110] mce: Uncorrected hardware memory error in user-access at 1b8b2d40
[  183.770318] [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|-]: 0xdc2030000000011b
[  183.770327] [Hardware Error]: Error Addr: 0x000000000dc59640
[  183.770329] [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x0000e0420b404002
[  183.770330] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  183.770347] EDAC MC0: 1 UE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1b8b2 offset:0xd40 grain:64)
[  183.770350] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
[  183.770354] [Hardware Error]: Uncorrected, software restartable error.
[  183.770355] [Hardware Error]: CPU:3 (17:8:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
[  183.770360] [Hardware Error]: Error Addr: 0x000000001b8b2d40
[  183.867228] [Hardware Error]: IPID: 0x000000b000000000
[  183.873054] [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
[  183.883818] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
[  183.891009] Memory failure: 0x1b8b2: Sending SIGBUS to memtester:507 due to hardware memory corruption
[  183.901049] Memory failure: 0x1b8b2: recovery action for dirty LRU page: Recovered

This was tested on X470D4U + Ryzen 2600

I plan to test various pins, resistor values, kernels, bios settings like PFEH, etc…
and make a separate thread with more details but for now: spending just a little time already gave me some interesting things to expand upon:

  • too long wires prevented the board to post - I am fairly sure my soldering was good enough so I guess additional big antennas aren’t ideal.
  • When testing older lts kernel (I think it was 5.4) I didn’t get edac logs. Instead I only got “Machine check events logged” in dmesg. Probably just too old edac driver but at that point Ryzen should have been supported already. weird - need to check
  • Even though I shorted only one data pin It finally resulted in an uncorrected error - maybe there is a limit in a short period???
  • I got “Deferred errors” but not “Corrected errors” that I normally see - also not sure why
  • I confirmed that uncorrected error in userspace process results just in a SIGBUS signal being sent to it.
4 Likes