ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T

Did anyone manage to get ESXI working on X470D4U2-2T?
I keep getting stuck on “Initializing storage stack” when CSM is enabled or no drives detected when disabled.

BMC is on version 1.7 and bios 3.31.

specs:
2 X 2TB HP NVME drives
AMD Ryzen 9 3950X
128 GB DDR4

Thanks!

I can’t say anything about ESXI specifically but I did struggle with missing storage devices and CSM weirdness - I had for example a problem with USB install sticks not being recognized by the BIOS no matter the Boot options. (OS saw all devices without issues)

The issue was with VMedia that IPMI provides.
O had to go to the IPMI web interface> ‘Settings’ > ‘Media Redirection Settings’ > ‘VMedia Instance Settings’ and set all the device instances to zero. Only then my USB stick would become visible (and bootable).

You also may want to fiddle with the IPMI settings.

@Tenrag Thank you for that. I just got mine together tonight and at first it saw my USB and then it was giving me all of these virtual drives. I probably wouldn’t have figured that out tonight without your comment.

I just noticed: you are on BMC version 1.7 and BIOS 3.31. Those are both some beta versions.
As @badbui mentioned those can be flaky.

Depends on the version of ESXi you’re trying to load. At some point the NVMe driver got updated and the requirements to be recognized as a datastore got more stringent. Here’s an article stating some things about it. Recommend trying ESXi 6.5 instead of 6.7U1 or above

On the FreeNAS forum, Diversity managed to trigger “ECC Errors” on a Ryzen 3950 by shorting data-pins with ground-pins of the memory module.
https://www.ixsystems.com/community/threads/the-usefulness-of-ecc-if-we-cant-assess-its-working.83580/post-581515

Although I’m very happy to hear that he managed to see reporting of ECC Errors on the Asrock Rack X470D4U when using a Ryzen 3950 (Zen 2), I’m also confused and puzzled to what this means for my failing experiences to achieve the same…

Causing memory errors by shorting pins of memory modules

Firstly I did some research on this “Triggering memory errors using ‘needles’ or wires”-approach, which Diversity has used. I didn’t look into this method yet, as it seemed too risky and I didn’t want to damage or degrade any of my hardware.

I found a good (but complicated) description in the following paper:
https://download.vusec.net/papers/eccploit_sp19.pdf

So in short, you’re connecting a data-pin with a ground-pin, so that the current on the data-pin “flows away” into the ground-pin and this “flips a bit”. When using the correct pins and not accidently shorting anything wrong, this should actually be “reasonably safe” to do I think… (please correct me if I’m wrong :slight_smile: )

Why is this causing single-bit errors and not multi-bit errors? I THINK because every “clock tick” data is pulled from each data-pin of the memory module. So if you change only the “result” of 1 pin, you get a maximum 1 bit flipped per “clock tick”, which ECC can then correct. (not sure though)

This paper is already a bit older and was using DDR3. The ‘AMD-1’ configuration they are talking about (where interleaving is making things complicated) is an ‘AMD Opteron 6376 (Bulldozer (15h))’.

As far as I understand, the extra complexity of “interleaving”, that happens on the Opteron system, is only applicable on Ryzen, when using Dual Rank memory. As Diversity was using a single 8GB module, I suppose he only has single rank memory, so wasn’t confronted with the “interleaving-complexity”. However, if I would try this with my 16GB modules, I would be confronted with this extra complexity, because ?all? 16GB modules are dual rank…

I concluded this from this article (but I could misinterpreting things!):

As the paper was for DDR3, you of course need find the pin layout of DDR4, before you can apply it on Ryzen. Following datasheet has pretty clear pin layout of unregistered ECC DDR4 module:

http://www.supertalent.com/datasheets/SuperTalent%20Datasheet-DDR4-ECCUDIMM-F24EA8GS.pdf

On page 6, 7 and 8 you can see the description per pin and on page 17 you can see a picture of where those pin numbers are on the memory module. I suppose all VSS-pins are ground pins and DQ+number pins are data pins. So if we follow the example of the paper and short DQ0 with a VSS, the corresponds to shorting pin-4 (VSS) with pin-5 (DQ0). But I guess shorting pin-2 (VSS) with pin-3 (DQ4) could work equally fine.

This should help us “understand” a bit better what Diversity has done and how to “safely” reproduce it.

The results from Diversity

I was in contact with Diversity and have some more details on his testing (all pictures in this “chapter” are from Diversity himself). Diversity used the following video: https://www.youtube.com/watch?v=_npcmCxH2Ig. Instead of needles and tweezers, he used a thin wire, as in the picture below.

He was able to trigger “ECC errors” in Memtest86 and “Corrected errors” in Linux (Proxmox).



My testing / experiences

As you know, I also tried triggering reporting of corrected memory errors. I tried this by overclocking / undervolting the memory to the point where it is on the edge of stability. This edge is very “thin”, can be hard to reach and can result in the following scenarios in my understanding:

  1. Not unstable enough, so no errors at all

  2. Only just unstable enough, so that single-bit error occurs only sometimes when stressing the memory enough. These will then be corrected by ECC and will not cause faults or crashes.

  3. A little more unstable, so that single-bit errors occur a bit more often and less stress is required on the memory to achieve this. But also (uncorrected) multi-bit errors can occur sometimes, which could cause faults / crashes.

  4. Even a little bit more unstable, so that mostly multi-bit errors occur when stressing the memory and single bit errors might be rare. This also makes the system more prone to faults and crashes.

  5. Even more unstable, so the multi-bit errors occur even when hardly stressing the memory at all. This makes the system very unstable and probably will not be able to boot into OS all the time.

  6. Too unstable, so that it doesn’t boot at all.

Both scenario 2) and 3) are “good enough” for testing reporting of corrected memory errors. Perhaps even scenario 4), if you’re lucky…

During all my testing I tried 100+ possible memory settings, using all kinds of frequencies, timings and voltages, of which 10-30 were potentially in scenario 2) or 3). I “PARTLY” kept track of all testing in the below (incomplete) Excel:

This convinced me that I should have at least once been in scenario 2) or 3), where I should have seen corrected errors (but didn’t). That is why I concluded that it didn’t work and I contacted Asrock Rack and AMD to work on this.

Conclusions / Questions / Concerns

Now what does all of this means? Does this mean that I never reached scenario 2) or 3)? Does it mean scenario 2) and 3) are almost impossible to reach using the methods I tried? Or does it mean that Diversity perhaps triggered a different kind of memory error? I’m not sure and I hope someone can clarify…

I know there is error correction and reporting happening on many layers in a modern computer. As far as I know, there are these:

  1. Inside the memory modules itself (only when you have ECC modules). The memory module then has an extra chip on the module to store checksums. I think this works similar to RAID5 for HDDs. So that a memory error is detected and corrected in the module itself, even before it exits the memory module.

  2. On the way from the memory module to the memory controller on the CPU (databus). Error detecting / correcting / reporting for these kinds of errors are handled by the memory controller in the CPU, so ECC memory isn’t even required to make this work.

  3. Inside the CPU data is also transfered between L1/L2/L3 caches and the CPU. Also there Error detecting / correcting / reporting is possible I think.

All of these might look confusingly similar when reported to the OS, but I do think they are often reported in a slightly different manner. I’ve seen reports where the CPU cache (L1/L2/L3) was clearly mentioned when reporting a corrected error for example, but I’m not sure what the exact difference between reports of 1) and 2) would be.

In Proxmox screenshots I do read things like “Unified Memory Controller…” and “Cache level: L3/GEN…”, but again, I’m not entirely sure if these mean that the errors are in 2) or 3) instead of 1)…

Diversity is draining the current from a data-pin “outside” of the memory module, but I still see 2 possibilities of what’s happening:

1) The drain on the data-pin is pulling the current away also inside the memory module itself, where the memory module detects the error, corrects and reports it.

2) The drain on the data-pin only happens after the memory module has done its checks and the error is only detected on the databus and corrected / reported there (so not by the ECC functionality of the memory module).

To know which scenario is happening, we could:

  • try to find someone with knowledge on exactly how each type of error is reported and who can exactly identify what is being reported
  • perhaps using a non-ECC single-rank module also reports the same kind of ECC errors, which would proof it is happening on the databus.
  • perhaps someone with more (I don’t have any actually) electrical engineering knowledge can also say something more meaningful than myself?

Diversity did mention that his IPMI log was still empty after getting those errors. So there the motherboard / IPMI is certainly missing some required functionality.

Edit:
As Tenrag correctly commented, the memory modules don’t do “on-module-error-detection-and-correction” because they’re missing components to do so. So even when using ECC memory, it is the memory controller on the CPU, using the extra parity chip on the memory modules, that detects / corrects / reports the memory errors.
Also according to Tenrag, the EDAC output does point to actual memory errors being detected (-> DRAM ECC error) and not data bus (infinity fabric) errors (-> link error).

Finally MasterPhi also correctly questioned if I had disabled PFEH during my testing and I did not actually, since I only discovered this BIOS setting very late in my testing (I only did my final testing with PFEH disabled).

So I may need to do some more testing with PFEH disabled…

4 Likes

3.33 really hosed my machine.
I’ve actually had to go back to 3.20.
At least flashing bios is very convienient with ipmi.

3.33 would be a beta release.

3.30 is on their site, why did you go with 3.20?

I initially wanted to try out 3.33 for the agesa version that fixes pci pastthrough. I couldnt keep my machine up for more than 5mins.
I then tried to flash back 3.30 had instability where it was functioning perfectly for the past 3months. So at a last ditch i flashed 3.20 and it’s been running ok for the past 9 hrs. I’m running centos8/fedoraserver31.

@Mastakilla Thank you, great work, I remember reading this article about DDR3 and were wondering how long before someone would do this for DDR4.

Now I wonder how long before someone starts selling some ‘dimm risers’ with an ‘inject error’ button…

AFAIK nothing is checked on the memory module itself. Both data and parity is transferred to the memory controller and checked there.
If it was done on the memory dimm then we would see one of following 2 things on the PCB:

  1. 8 DRAM chips with increased capacity/integrated error correcting logic inside of each.
  2. 9 DRAM chips + a separate chip for the error correction

If you take a look at your ECC UDIMMs you will find neither. There are only 9 regular DRAM chips and a bunch of passive SMD components.


About the screen captures: if you go to the IPMI and launch a KVM you can make screenshots of the output. So you do not have to take pictures of the screen.

And you can also take advantage of the 2 integrated serial ports for easily copyable outputs:

  1. there is a traditional RS-232 connector
  2. (more convenient) Seial-Over-Lan.

You can enable both in the BIOS in the ‘Super IO Configuration’ menu
And you can also redirect the boot process itself - in the ‘Serial Port Console Redirection’ BIOS menu (You will be able to change BIOS options from the command line on the remote machine without launching full KVM)


I experienced the same with my ‘seemingly faulty’ board

Yes, or there are multiple error reporting protocols. And Ryzen simply does not implement all of them. But that’s just me speculating.

1 Like

About the ECC ‘cache’ errors in dmesg:
The word ‘cache’ doesn’t mean much in this context.

Looking at the EDAC driver code it is apparent that the cache-related line is always printed:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n1009
And the amd_decode_err_code() function itself is called regardless of the error type.

So If I understand correctly something like
[Hardware Error]: cache level: L3/GEN
means just that the error was detected while loading the data into L3 Cache. Which is expected since the memory is usually loaded into the highest level cache first.

1 Like

Hi, when you did these tests, have you already disabled PFEH ?

1 Like

Thanks, Good point about the memory modules missing components to handle the error detection themselves! That makes totally sense…

About the screenshots: I know… They are screenshots by Diversity. I didn’t try serial-over-lan yet though…

Not sure what you mean by “error reporting protocols” and Ryzen not implementing them all. If an OS like Linux can detect “Hardware Errors”, then the IPMI should be able to do the same, right? And it seems only logical to me to also expect this from the IPMI, no?

Hehe… c code… That has been awhile for me… Will try to look into this file a bit better…
Were you able to find anything in there that confirms that Diversity his errors are certainly errors from the memory modules ECC logic (and not from the infinity fabric or CPU cache)?

Well, that is a good point!

All my testing was done with PFEH enabled, except for my very last couple tests… (after discovering this BIOS setting)

I’ll check with Diversity if he had PFEH disabled during his tests. If he had, then I may need to do some more testing with PFEH handling and then maybe most of my previous testing is “invalid”…

I had PFEH = disabled. I never tried manually injecting errors with PFEH = enabled.

Whether of not PFEH was disabled (default) or enabled made no difference for memtest pro error injection. Never worked ;(

Not sure if this is of consequence but the ASUS prime x570-P I had success on does not have a PFEH setting in BIOS

Yes, that would be logical but I do not know how exactly those errors are supposed to be detected/delivered, maybe the CPU reports those errors in multiple ways.
For example take a look at the IPMI specification:

In chapter 16.1:

The figure shows a BMC with a shared system messaging interface where Event Messages can be delivered from
either BIOS, SMS (system management software / OS), or an SMI Handler, and an IPMB interface and through
which it can receive Event Messages from the Intelligent Platform Management bus. The BMC can also generate
‘internal’ Event Messages.

TL;DR: there are multiple ways that events can get into the IPMI log.

I am trying to say that without some more detailed knowledge it can be misleading to suggest that IPMI is seeing errors but does not report them.

Conspiracy theory: Maybe it is IPMI, maybe AMD forced Asrock to remove some features to force market segmentation. Maybe that is why when we ask Asrock about ECC we always get the same copy-paste answer saying that ‘AM4 does not support ECC error reporting function’.
For now we just can’t be sure where exactly is the missing part.


From what I can tell after reading the source: Seeing this in dmesg is a confirmation for me:
[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.

Those strings would be different if it was infinity fabric or cache itself.

The bank type breakdown is the structure here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce/amd.c?h=v5.6.7#n134

bank names:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce/amd.c?h=v5.6.7#n79

The bank type > string array struct is here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n406

and the strings themselves start here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n154

Example 1:
Memory controller errors are on ’ SMCA_UMC’ bus, which points to “Unified Memory Controller” name and points to ‘smca_umc_mce_desc’ string array:

static const char * const smca_umc_mce_desc[] = {
“DRAM ECC error”,
“Data poison error”,
“SDP parity error”,
“Advanced peripheral bus error”,
“Address/Command parity error”,
“Write data CRC error”,
“DCQ SRAM ECC error”,
“AES SRAM ECC error”,
};

So for memory controller ECC errors we see for example:

Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.

Example 2:
If this was an infinity fabric error then bank type would be one of those 3:

/* Data Fabric MCA types */
{ SMCA_CS, HWID_MCATYPE(0x2E, 0x0), 0x1FF },
{ SMCA_PIE, HWID_MCATYPE(0x2E, 0x1), 0x1F },
{ SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2), 0x3FFF },

So instead of “Unified Memory Controller” we would see “Coherent Slave” or “Power, Interrupts, etc.”.

And we would see one of the following description strings:

static const char * const smca_cs_mce_desc[] = {
“Illegal Request”,
“Address Violation”,
“Security Violation”,
“Illegal Response”,
“Unexpected Response”,
“Request or Probe Parity Error”,
“Read Response Parity Error”,
“Atomic Request Parity Error”,
“Probe Filter ECC Error”,
};

static const char * const smca_cs2_mce_desc[] = {
“Illegal Request”,
“Address Violation”,
“Security Violation”,
“Illegal Response”,
“Unexpected Response”,
“Request or Probe Parity Error”,
“Read Response Parity Error”,
“Atomic Request Parity Error”,
“SDP read response had no match in the CS queue”,
“Probe Filter Protocol Error”,
“Probe Filter ECC Error”,
“SDP read response had an unexpected RETRY error”,
“Counter overflow error”,
“Counter underflow error”,
};

static const char * const smca_pie_mce_desc[] = {
“Hardware Assert”,
“Register security violation”,
“Link Error”,
“Poison data consumption”,
“A deferred error was detected in the DF”
};

And this matches an example of infinity fabric error reported here:

[Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.


Also: it is possible to get test ECC errors in the IPMI log by using the test event injection with ipmitool:

for example:

ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin event 3

use ‘event help’ instead of ‘event 3’ for more details

After this I can see those test logs

Be aware that this does not emulate a real ECC error.

From the manual: https://linux.die.net/man/1/ipmitool

NOTE : These pre-defined events will likely not produce “accurate” SEL records for a particular system because they will not be correctly tied to a valid sensor number, but they are sufficient to verify correct operation of the SEL.

It is possible to inject errors to ‘real’ sensors using ‘event’ command but I can’t find a way to inject ECC error. I was able to inject the voltage and temp events (on the screenshot above).
I also checked ‘event’ command on my other Intel based Supermicro board and the result is the same. I can’t find a way to inject ECC errors. So I do not think that my inability to inject ECC errors here means anything in relation to Ryzen.

1 Like

A small cheat-sheet of ipmitool commands that were helpful for me:

Push the power button:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin chassis power on

Reset:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin chassis power reset

Start the Serial-Over-Lan console (after enabling it in the BIOS):
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sol activate

End the Serial-Over-Lan session (from another terminal):
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sol deactivate

print IPMI event list:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sel list

1 Like

Hey All,
I think I did a dumb and need a sanity check if anyone is willing.

I purchased the X470D4U as well as a 3950X for it. I’m moving off of an old eBay’d system that I was running unRAID on. I got the new components assembled and tested out. Put them into my system today and am running into some issues.

I have 3 PCIe devices I’m reusing from the old system:
Nvidia Quadro P2000
Mellanox ConnectX-2 dual SFP+ card
LSI SAS9211-8i

I’m able to get the system booted up with the HBA and network adapter but when I add the Quadro into the mix it’s a no-go. The system goes into a loop of giving me the SAS controller and ConnectX-2 information screens but then but then dumps back to the ASRockRack logo and starts over again. I’ve not gotten it to proceed to the point where the system tries to boot into unRAID or, oddly enough, even though I catch an F2 to get into the BIOS it hasn’t wanted to load into it for me

I’m hoping I didn’t make some egregious error when planning things out and mis-count the number of PCIe lanes I had available or something.

I’ve tried with keeping the Quadro in PCIe 6 and swapping the HBA and NIC between 5 and 4 (as I type this I’m pretty certain I haven’t tried putting the GPU in slot 4 but it’s been a long day…)

If my notes serve me well; I also found that if I remove the GPU and leave the HBA and NIC occupying slots 6 and 5, doesn’t matter the order, both of those devices are recognized and work correctly. However, if I put either of the cards in slot PCIe 4 it is recognized in the system inventory of the management UI but when booting into unRAID it complains of either not seeing the network interface or any of my drives (big oof).

All PCIe card were in working condition when they were pulled from the old system, I take precautions with regard to static discharge, etc

So I guess at the end of all this I’m wondering if something is wrong with the board or if quarantine brain got me doing stupid things.

Appreciate any input

On which BIOS?

Set the pcie to 2x8