ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T

Yeah, tried an empty board and sadly no change there :frowning:

Of course, you can and should contact ASRock Rackā€™s support about your experiences, at least I hope that the more people report this bug the more focus will be shifted to fixing this issue.

On a side note: Are your DIMM/SPD readings from within Windows also not showing?

Yeah, theyā€™re absent in CPU-Z. Also not sure about MHz values. When I run a stress test it shows all cores going to 4GHz. Sure, thereā€™s turbo boost but surely not if all cores run at 100%? Core Temp shows 82Ā°C in my setup when stress testing. Passmark score is 24500 which Iā€™m happy with. I wonder if Iā€™m OCing without noticing. :slight_smile: I didnā€™t make any changes in the BIOS.

Anyway, hereā€™s my setup if anyoneā€™s interested:

Iā€™m using a Supermicro 1U case since I need to save space and avoid filling up the whole rack in case we need a lot more of these. Weā€™re in a colo so a second rack would cost double.

I had to remove half of a heat-spreader to make the last RAM module fit beside the Noctua cooler. Not much fun getting that thing off. Heating it up using a blow dryer helped somewhat.

Of course, the Noctua will never fit and even if it did, you need some space above the cooler to suck in enough air. So I ended up with the above abomination.

Iā€™ve tried passively cooling with the Supermicro heatsink for LGA1151 (which has optimized fins facing the fans at the front) and had to stop when the CPU reached 92Ā°C in no time. No idea what a plastic shroud could accomplish to channel the air from the front but I very much doubt itā€™ll beat the Noctua whatever I do.

Iā€™m planning to always put two of these in the rack with the tops facing each other and leaving 0.66U in between for air intake (you can do that since there are 3 screw holes on the side that make up 1U).

What I find fascinating is that you can get so much raw processing power for such a cheap price per unit. 8C is pretty much perfect for MS licensing as well. I wonder if anyone could do better on a tight budget. Not with the old Intel platforms, I bet. Using an i9 9900 (without K) might be an alternative, once it becomes more widely available.

PS: Now I need to figure out if I trust these boards enough to buy another 10 or even 15. Not sure I do.

PPS: Anyone else getting similar temperature values when stress testing? Or is my system just running hotter for some reason? Man, I wish I could have passively cooled this. Never mind that no such cooler fitting an 1U case exists AFAIK.

2 Likes

Canā€™t contribute much since I donā€™t have 1U build experience but if you can build a well-fitting air duct without much ā€œholesā€ in it from the front to the DIMM and VRM heatsinks I think that could work if you donā€™t care much abut noise.
You might even be able to upgrade to 16C in the near future if you can build a custom cooling solutionā€¦

Also Iā€™ve read that Core Temp doesnā€™t handle Ryzen 3000 well, Iā€™m using HWiNFO for most monitoring of test setups.

1 Like

I was having the same Updating FRU System device hang up. I thought I changed something in the BMC settings while experimenting. I reset the BMC to factory settings, and didnā€™t have an issue after that. But I did get locked out of the IPMI and had to use ipmicfg to reset the password.

1 Like

Hey there :slight_smile:

Weā€™ve also bought an Asrock X470D4U which should replace one of our intel servers.

Hardware:
AsRock X470D4U
Ryzen 7 2700X
KSM26ED8/16ME (QVL)
2x2TB Kingston SSD DC500M
450 Watt Seasonic Focus Modular 80+ Gold

However weā€™re getting massive DRAM ECC errors when booting Ubuntu 19.04 Server with Kernel 5.2.8 and also with the latest Debian Buster with Kernel 4.19:

Bank: Unified Memory Controller (UMC)
Error: DRAM ECC error. An ECC error on a DRAM read (DramEccErr 0x0)


The errors appear roughly every 10 seconds. When setting the memory speed from auto to fixed 2400 the errors appear less common and when setting the speed to 2133 theyā€™re pretty rare (so about every 5 minutes?)

Maybe a quick test that anybody can test on their boards which will show if our board might be dead:
What happens when you put one of your ECC module into a different slot than A1?

In our case:
In two slots the board doesnā€™t post at all and shows A8 or 0B on Dr. Debug (The manuel doesnā€™t state anything for this codes) . In the third one the board post absolutly fine but after selecting ā€œUbuntuā€ from the grub screen the system freezes after 3 normal kernel messages and the display turns off (KVM also shows NO SIGNAL). CTRL+ALT+DELETE doesnā€™t work anymore to restart the system. I dont think this behaviour is right, isnā€™t it :slight_smile:

This is already our second board with issues. The first board was directly send back to AsRock by our hardware provider which build the mainboard + processor + memory together and should run some tests. They said that after installing the newest 3.10 firmware they canā€™t get the board to work anymore (yes very detailed, I know :)).

Does anybody know what might cause this issue? Or even had this issue before? Maybe this is also realted to memory or the cpu which the hardware provider hasnā€™t changed after he replaced the first board which died.

P.S. We donā€™t get any of the CPU_PROCHOT errors yet, but we havenā€™t ran our system for a long time.

I am running a r5 2600, 32gb ram(2x16 crucial non ecc QVL), nvme 860evo boot.
Fedora Server 30.

I had lotā€™s of issues with random freezes until I disabled all related C6 settings in the bios and enabled the Typical idle power setting.
Iā€™ve never seen CPU_PROCHOT errors running my spare overkill Corsair AX760.

On page 20 of the manual, you can see the placement and recommended JEDEC speeds. A1 is definitely the right slot and then B1 for the second module. No point in putting the RAM into any other slots as those are not supposed to work.

Recommended memory speeds are as low as 1866MHz in some configurations, depending on how many modules you use, but as I understand it, nobody adheres to those recommendations. :slight_smile: So basically all RAM today is overclocked and it is not always guaranteed you reach the advertised speeds.

ECC memory might need more time anyway due to the additional tasks they perform. KSM26ED8/16ME seem to use 19-19-19-32 at 2666MHz. Thatā€™s probably for one module as they might not be sold in kits. CL19 would be considered poor for non-ECC RAM but that is probably the price to pay for getting ECC.

I would try running them at the lowest possible speed (800MHz with 20-20-20-38 timings) just to see if they work in principle and then increase speeds (and maybe timings later) from there.

Not sure how the latest MemTest86 reacts to ECC memory. Might be worth a try to rule out any problems with different Linux versions.

Iā€™ve read somewhere else that 1 ECC error per week is an acceptable rate and anything more should be cause for concern.

PS: Iā€™m no expert on RAM. Never used ECC myself. If anyone can shed more light on this or correct anything I said, please go ahead.

1 Like

I also had the updating fru issue. I reflashed the bios and bmc and it went away for now

Do you remember if the first board that cannot boot anymore with 3.10 also had ā€œfaintlyā€ flashing managenent NIC status LEDs and no green LED lightning up next to the BMC circuits?

If so then my dead one might have also been killed by the firmware update 3.10/1.60 after a few reboots for some reason.

The BIOS flash chip is socketed, where does the BMC store its image? Might be worth trying to use the flash chip from a not yet updated other board to test this.

I canā€™t say this, because our hardware provider had this issue, but I can ask them what the issue exactly was and let you know.

Weā€™re now running BIOS 3.10 and BMC 1.50 and havenā€™t updated to BMC 1.60 yet.

Do you think the board gets killed by the BMC update 1.60?

@roffel
However I think itā€™s strange that nobody encountered the ECC issue before with this board and they havenā€™t reduced their clock speed. I would expect from a board where ASRock is claiming that ECC is working with 2400 that theyā€™ve tested it (QVL) and it should work fine.

Does anybody run this board without any issues? After reading this thread it looks like the board is fine for testing or development purposes but when you would run it in a production environment it has too much issues.

Honestly I donā€™t know but since ASRock Rack has gone radio silent for over two weeks, not even reacting to the missing DIMM/SPD readings and then publishing such a BIOS version I wouldnā€™t be surprised if an update bricks the board at some point.

Iā€™ll try to test 4 x 32 GB UDIMM ECC with my remaining working unit regarding ECC issues.

1 Like

The BMC has its image stored in the larger sop16 socket, a 16 pin package, with mio spi ROM chip inside.
A version sticker is put onto showing the version it came with from the Manufacturer.

1 Like

Ahh yes I see it now, a cable was running over it (lower left corner).

Is there a safe way to remove the BMC and BIOS flash chips or do you need special tools?

That would be very helpful!

Could you also check if your board is posting and booting just fine when you put one memory to a different slot than A1? That would be a very good hint if our board might be broken. :slight_smile:

Iā€™ve had no issues with mine.

Been running it a week as my virtualization server.

On bmc 1.60 and bios 3.10

Tested two different types of ECC memory modules:

  1. Samsung 16 GB DDR4-2400 ECC UDIMM

  2. Samsung 32 GB DDR4-2666 ECC UDIMM

Both work fine in every slot even when only using a single module.

Nice bonus, so far 128 GB work fine at DDR4-3200 with loose Auto-Timings:

4 Likes

Thanks for letting me know about the ā€œUpdating FRU System deviceā€ problem being connected to the BMC.

Iā€™ve reflashed and reset both BMC and BIOS. Also tried BMC 1.5 and then went back to 1.6 again. It consistently hangs when the M.2 device is the only one connected, no matter if M.2_1 or M.2_2. If I add a SATA or USB device, or if I press F11 and manually select a device, or if I reset CMOS and it warns me about the clock being reset and continues after the timeout, it just works.

I think itā€™s a bug in the BIOS. It goes into an endless loop and thatā€™s why the cooler spins like crazy when it hangs. But it happens while it is writing info into the BMCā€™s memory.

So I disabled some BMC options in the BIOS and sure enough, itā€™s solved.

Setting ā€œServer Mgmt - Inventory supportā€ to ā€œDisabledā€ fixes it.

Thanks everyone. Iā€™d say the board doesnā€™t feel ready for production yet but maybe Iā€™ll buy another one and see if I get some consistency into it.

1 Like

Well, no special tools needed realy.

The sop8*? Bioschip has the risk of bending Pins but as long as you donā€™t like ripp it out of the socket, they shouldnā€™t break and bending them back is easy. Not like Pins in an 115X lga ā€¦

The BMC is easy as pie, just push on the sides so that the socket opens like a clampshell. The chip is surface mounteable so a nice package and easy to take out.

1 Like

Hello everyone. I got this board yesterday to upgrade my Ryzen 7 1800X system to a proper server, and little did I know what I was in for. It has been incredibly frustrating.

Here is the hardware:

  • ASRock X470D4U
  • Ryzen 7 1800X
  • RAM: 4x Kingston KSM26ED8/16ME (QVL - 2666 MHz) for a total of 64 GB
  • PSU: Rosewill VALENS-500 (500W 80+ Gold)

The full parts list is here: https://pcpartpicker.com/list/gGR6vn

Problem 1
I was reusing the Noctua NH-U12S cooler, and had the same problem as others with the mounting hardware nearly touching the closest RAM module. I wrapped a bit of electrical tape around the CPU cooler mount to prevent any wacky electrical short-circuiting. Probably unnecessary, but better safe than sorry. If this RAM had heatspreaders, it wouldnā€™t have fit.

Problem 2
It immediately became apparent that this motherboard does not support XMP profiles, and the RAM defaulted to 1866 MHz. I balked at the enormous list of RAM timing fields which I donā€™t know the meanings of, so I booted into Windows and tried to use software (Thaiphoon Burner, etc) to read the XMP profile of the RAM so I could input appropriate information manually in the BIOS. No software was able to do this, thanks to the screwy motherboard.

I then updated the BMC to 1.60 and BIOS to 3.10 (both done through IPMI which is really handy), and in order to minimize update issues, I let it clear the configuration during these updates.

After the updates, Windows software was still unable to read XMP profiles. I ended up pulling out one UDIMM and inserting it into a different PC which was able to read the XMP profiles just fine. Here is the Thaiphoon Burner output for Kingston KSM26ED8/16ME:

Next, I converted these values to hex so I could input them properly into the BIOS:

(donā€™t waste your time putting these in)

I could not boot into Windows with the memory configured like that. Upon trying, it just crashed with no error report whatsoever.

Next, I tried leaving more of these settings on Auto:

(donā€™t waste your time putting these in either)

It still couldnā€™t boot into Windows.

Finally I decided to just set the Memory Clock Speed to 1333MHz and leave everything below set to defaults (Auto). This worked, and allowed me to boot into Windows.

HWINFO showed what I am guessing is the actual timings the motherboard went with:

HWINFO RAM Data

It is close enough that I just donā€™t care anymore. Interestingly, HWINFO didnā€™t have timing data for the 4th slot. Iā€™ll chalk that up to the screwy motherboard again.

Problem 3

After BMC/BIOS updates, and getting the memory speed set and Windows bootable again, I tried using Handbrake to transcode a 2 hour H.264 1080p clip to H.264 (medium preset). It purred along fine achieving about 25 FPS transcode rate which is about what Iā€™m used to seeing from this CPU. An hour in, Handbrake crashed.

I didnā€™t mention this earlier, but back before I updated the BMC/BIOS, I noticed Task Manager reported the CPU was running at 0.54 GHz, Memory at 1866 MHz, and CoreTemp reported the CPU temperature locked at 30Ā°C (looking back, I think this may just be where it settles when CPU_PROCHOT is asserted).

After BMC/BIOS updates, Task Manager showed CPU speed bouncing all around the normal range, but RAM speed was missing, and CoreTemp shows the correct CPU temperature (the same value which HWINFO reports as ā€œtdieā€). Interestingly, the IPMI web interface seems to get its CPU temperature from ā€œtctlā€ which is consistently 20Ā°C higher than ā€œtdieā€. While running Handbrake, the CPU temperature reached 70-74Ā°C, which meant the board saw it as 90-94Ā°C. Looking at my IPMI log, it was constantly logging temperature changes and CPU_PROCHOT would briefly become asserted before being deasserted seconds later. See full log here: https://pastebin.com/U3qmd1Px

All the while, CPU temperature was only low to mid 70s.

Unfortunately, I also have the bug where CPU_PROCHOT gets stuck enabled for no apparent reason. It happened before and after the BMC/BIOS update. As I type this, CPU_PROCHOT has State Asserted and Task Manager shows 0.54 GHz unwavering. But for at least an hour last night, it was fine and running normally, and I have no idea why.

Problem 4

After the Handbrake crash, I wanted to run MemTest86+. But at some point, all 12 IPMI virtual media devices appeared in the boot list and pushed the real USB devices out of the list. I even tried booting from a MemTest86+ ISO mounted through the KVM web interface. It didnā€™t work. I tried Ultimate Boot CD too. Didnā€™t work. MemTest86 from PassMark. Didnā€™t work. This had me chasing my tail for two hours before I tried turning off most of the virtual media devices. That got USB devices showing again, but I still couldnā€™t boot into MemTest86+ or Ultimate Boot CD or MemTest86 from PassMark. Finally I turned off CSM entirely, and this allowed MemTest86 from PassMark to boot (the only one of the three that supports UEFI). I let this run overnight and it only got through about 2 and a half passes, but no errors.

Problem 5

This morning, after closing MemTest86, I wanted to boot into unRAID, which was the whole purpose of this server build. So I did that, got signed in, and began writing this post. I looked back a bit later and the system had crashed. Nothing to indicate a problem in IPMI logs. KVM showed no signal. A physical monitor attached to the machine showed no signal. IPMI said it was still powered on. The motherboardā€™s POST code display was blinking 00 on and off, and I noticed 4 small red LEDs next to the fan headers blinking red (maybe due to CPU_PROCHOT?).

I rebooted and looked for logs, but IPMI still showed no evidence of what happened, and it turns out unRAID doesnā€™t persist its logs between boots.

Conclusion thus far

I just ran a CPU MARK test (in PassMark PerformanceTest 9.0) with the CPU_PROCHOT thing asserted and the clock speeds limited, and got a whopping 2833 score. Yippee.

So yeah, stability and huge performance problems on a system specifically designed for greater stability. ASRock had better fix this.

3 Likes

Thanks, tried it out.

The sure thing seems to be that the BMC of my motherboard #1 killed itself - not sure how.

  • BMC chip #1 in known-working motherboard #2: System broken
  • Known-working BMC chip #2 in motherboard #1: The green LED next to the BMC chipset is lighting up for a second then goes out again - System broken
  • BMC chip #2 back in motherboard #2: System as fine as it can currently be

Strange :confused:

1 Like