ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T

Tenrag · April 24, 2020, 4:00pm

About the ECC ‘cache’ errors in dmesg:
The word ‘cache’ doesn’t mean much in this context.

Looking at the EDAC driver code it is apparent that the cache-related line is always printed:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n1009
And the amd_decode_err_code() function itself is called regardless of the error type.

So If I understand correctly something like
[Hardware Error]: cache level: L3/GEN
means just that the error was detected while loading the data into L3 Cache. Which is expected since the memory is usually loaded into the highest level cache first.

MasterPhi · April 24, 2020, 5:01pm

Hi, when you did these tests, have you already disabled PFEH ?

Mastakilla · April 25, 2020, 11:42pm

Thanks, Good point about the memory modules missing components to handle the error detection themselves! That makes totally sense…

About the screenshots: I know… They are screenshots by Diversity. I didn’t try serial-over-lan yet though…

Not sure what you mean by “error reporting protocols” and Ryzen not implementing them all. If an OS like Linux can detect “Hardware Errors”, then the IPMI should be able to do the same, right? And it seems only logical to me to also expect this from the IPMI, no?

Mastakilla · April 25, 2020, 11:47pm

Hehe… c code… That has been awhile for me… Will try to look into this file a bit better…
Were you able to find anything in there that confirms that Diversity his errors are certainly errors from the memory modules ECC logic (and not from the infinity fabric or CPU cache)?

Mastakilla · April 25, 2020, 11:50pm

Well, that is a good point!

All my testing was done with PFEH enabled, except for my very last couple tests… (after discovering this BIOS setting)

I’ll check with Diversity if he had PFEH disabled during his tests. If he had, then I may need to do some more testing with PFEH handling and then maybe most of my previous testing is “invalid”…

diversity · April 26, 2020, 9:31am

I had PFEH = disabled. I never tried manually injecting errors with PFEH = enabled.

Whether of not PFEH was disabled (default) or enabled made no difference for memtest pro error injection. Never worked ;(

Not sure if this is of consequence but the ASUS prime x570-P I had success on does not have a PFEH setting in BIOS

Tenrag · April 26, 2020, 12:46pm

Yes, that would be logical but I do not know how exactly those errors are supposed to be detected/delivered, maybe the CPU reports those errors in multiple ways.
For example take a look at the IPMI specification:

In chapter 16.1:

The figure shows a BMC with a shared system messaging interface where Event Messages can be delivered from
either BIOS, SMS (system management software / OS), or an SMI Handler, and an IPMB interface and through
which it can receive Event Messages from the Intelligent Platform Management bus. The BMC can also generate
‘internal’ Event Messages.

TL;DR: there are multiple ways that events can get into the IPMI log.

I am trying to say that without some more detailed knowledge it can be misleading to suggest that IPMI is seeing errors but does not report them.

Conspiracy theory: Maybe it is IPMI, maybe AMD forced Asrock to remove some features to force market segmentation. Maybe that is why when we ask Asrock about ECC we always get the same copy-paste answer saying that ‘AM4 does not support ECC error reporting function’.
For now we just can’t be sure where exactly is the missing part.

From what I can tell after reading the source: Seeing this in dmesg is a confirmation for me:
[Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.

Those strings would be different if it was infinity fabric or cache itself.

The bank type breakdown is the structure here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce/amd.c?h=v5.6.7#n134

bank names:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/mce/amd.c?h=v5.6.7#n79

The bank type > string array struct is here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n406

and the strings themselves start here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/edac/mce_amd.c?h=v5.6.7#n154

Example 1:
Memory controller errors are on ’ SMCA_UMC’ bus, which points to “Unified Memory Controller” name and points to ‘smca_umc_mce_desc’ string array:

static const char * const smca_umc_mce_desc[] = {
“DRAM ECC error”,
“Data poison error”,
“SDP parity error”,
“Advanced peripheral bus error”,
“Address/Command parity error”,
“Write data CRC error”,
“DCQ SRAM ECC error”,
“AES SRAM ECC error”,
};

So for memory controller ECC errors we see for example:

Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.

Example 2:
If this was an infinity fabric error then bank type would be one of those 3:

/* Data Fabric MCA types */
{ SMCA_CS, HWID_MCATYPE(0x2E, 0x0), 0x1FF },
{ SMCA_PIE, HWID_MCATYPE(0x2E, 0x1), 0x1F },
{ SMCA_CS_V2, HWID_MCATYPE(0x2E, 0x2), 0x3FFF },

So instead of “Unified Memory Controller” we would see “Coherent Slave” or “Power, Interrupts, etc.”.

And we would see one of the following description strings:

static const char * const smca_cs_mce_desc[] = {
“Illegal Request”,
“Address Violation”,
“Security Violation”,
“Illegal Response”,
“Unexpected Response”,
“Request or Probe Parity Error”,
“Read Response Parity Error”,
“Atomic Request Parity Error”,
“Probe Filter ECC Error”,
};

static const char * const smca_cs2_mce_desc[] = {
“Illegal Request”,
“Address Violation”,
“Security Violation”,
“Illegal Response”,
“Unexpected Response”,
“Request or Probe Parity Error”,
“Read Response Parity Error”,
“Atomic Request Parity Error”,
“SDP read response had no match in the CS queue”,
“Probe Filter Protocol Error”,
“Probe Filter ECC Error”,
“SDP read response had an unexpected RETRY error”,
“Counter overflow error”,
“Counter underflow error”,
};

static const char * const smca_pie_mce_desc[] = {
“Hardware Assert”,
“Register security violation”,
“Link Error”,
“Poison data consumption”,
“A deferred error was detected in the DF”
};

And this matches an example of infinity fabric error reported here:

[Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.

Also: it is possible to get test ECC errors in the IPMI log by using the test event injection with ipmitool:

for example:

ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin event 3

use ‘event help’ instead of ‘event 3’ for more details

After this I can see those test logs

Be aware that this does not emulate a real ECC error.

From the manual: https://linux.die.net/man/1/ipmitool

NOTE : These pre-defined events will likely not produce “accurate” SEL records for a particular system because they will not be correctly tied to a valid sensor number, but they are sufficient to verify correct operation of the SEL.

It is possible to inject errors to ‘real’ sensors using ‘event’ command but I can’t find a way to inject ECC error. I was able to inject the voltage and temp events (on the screenshot above).
I also checked ‘event’ command on my other Intel based Supermicro board and the result is the same. I can’t find a way to inject ECC errors. So I do not think that my inability to inject ECC errors here means anything in relation to Ryzen.

Tenrag · April 26, 2020, 1:13pm

A small cheat-sheet of ipmitool commands that were helpful for me:

Push the power button:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin chassis power on

Reset:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin chassis power reset

Start the Serial-Over-Lan console (after enabling it in the BIOS):
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sol activate

End the Serial-Over-Lan session (from another terminal):
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sol deactivate

print IPMI event list:
ipmitool -I lanplus -H <IP of the IPMI interface> -U admin -P admin sel list

fiore00713 · April 27, 2020, 9:48pm

Hey All,
I think I did a dumb and need a sanity check if anyone is willing.

I purchased the X470D4U as well as a 3950X for it. I’m moving off of an old eBay’d system that I was running unRAID on. I got the new components assembled and tested out. Put them into my system today and am running into some issues.

I have 3 PCIe devices I’m reusing from the old system:
Nvidia Quadro P2000
Mellanox ConnectX-2 dual SFP+ card
LSI SAS9211-8i

I’m able to get the system booted up with the HBA and network adapter but when I add the Quadro into the mix it’s a no-go. The system goes into a loop of giving me the SAS controller and ConnectX-2 information screens but then but then dumps back to the ASRockRack logo and starts over again. I’ve not gotten it to proceed to the point where the system tries to boot into unRAID or, oddly enough, even though I catch an F2 to get into the BIOS it hasn’t wanted to load into it for me

I’m hoping I didn’t make some egregious error when planning things out and mis-count the number of PCIe lanes I had available or something.

I’ve tried with keeping the Quadro in PCIe 6 and swapping the HBA and NIC between 5 and 4 (as I type this I’m pretty certain I haven’t tried putting the GPU in slot 4 but it’s been a long day…)

If my notes serve me well; I also found that if I remove the GPU and leave the HBA and NIC occupying slots 6 and 5, doesn’t matter the order, both of those devices are recognized and work correctly. However, if I put either of the cards in slot PCIe 4 it is recognized in the system inventory of the management UI but when booting into unRAID it complains of either not seeing the network interface or any of my drives (big oof).

All PCIe card were in working condition when they were pulled from the old system, I take precautions with regard to static discharge, etc

So I guess at the end of all this I’m wondering if something is wrong with the board or if quarantine brain got me doing stupid things.

Appreciate any input

nx2l · April 27, 2020, 10:32pm

On which BIOS?

Set the pcie to 2x8

fiore00713 · April 27, 2020, 10:44pm

P3.30

I’ll have to go digging through the menu, I must have overlooked it

cybrnook · April 27, 2020, 11:07pm

+1, this

JMono · April 28, 2020, 2:37am

Hey all, I’m planning to replace my home server that has just died and this board seems to be the best combinations of what I’m looking for as a replacement. Reading through the posts so far it seems like there are some varyingly serious problems with it so I have a couple of questions if you all wouldn’t mind helping me out.

When it comes to GPU pass through in an OS like unraid what is the state of that in the latest public bios (3.30) and BMC (01.90.00)? I’m considering a RX 5700 or second hand GTX 1070 if that has an impact on it.

With the ECC memory support it sounds like its not really known if it is actually functioning and that the reporting is essentially non-existent. How serious of an issue is this and do any of you think it would preclude this board for a server that I plan to use for years to come?

If either or both of these were serious enough to not consider this board do any of you have suggestions for other AMD based systems is the <600 USD range for CPU and MB?

Any help answering these questions would be greatly appreciated

JJJ65_Jones · April 28, 2020, 3:25am

@fiore00713

You also need to update the IPMI to be able to see the USB to boot Unraid. I just got mine up and running. Tenrag’s post helped me.

fiore00713 · April 28, 2020, 4:34pm

Yup, I actually updated the IPMI as well before swapping hardware around. So far I’ve got no issues actually booting from the USB drive, it’s just when I get all my PCIe cards installed. But appreciate the info!

fiore00713 · April 28, 2020, 5:34pm

Many thanks @nx2l and @cybrnook - Adjusting this setting within the BIOS is allow my system to boot and everything to be recognized on the system.

I’m subsequently running into an issue with unRAID being able to see the GPU for passing to my PLEX docker but I’m working on sorting that out.

Big hurdle passed though, my thanks again

Tenrag · April 28, 2020, 5:37pm

@fiore00713 glad to hear it

I too am getting into a boot-loop when I put 3 cards at once.
Switch to 2x8 should be automatic but I think it is flaky.
It works when I first put cards into top and bottom slots, boot, shutdown and then put in the 3rd card into the middle slot. (all without manual 2x8 option in the BIOS)

fiore00713 · April 28, 2020, 5:50pm

@Tenrag pretty much the same issue I was running into

I think it is a matter of the switching to 2x8 being flaky. The only thing I had to do was set it manually within the BIOS. I tested the 3rd slot which was not working properly prior to the change and the system booted successfully. After testing with this I added my GPU into the mix and everything is proceeding forward (at least so far)

nx2l · April 28, 2020, 6:17pm

it is when you have 2nd gen ryzen cpu

spiderben25 · April 29, 2020, 12:58pm

Hello,
New here. I’ve read this thread and learned a lot about this motherboard and its few caveats. I think it’s the best one suited for my needs (I’m building a Ryzen based Unraid home server and I need a M-ATX card with IPMI feature).
Unfortunately I can’t find a retailer in Europe who has some stock… The only one in my country (LDLC) has a shipment incoming in a few weeks which is quite a long time. Do you guys have any sources to buy this in Europe?
Thanks!