I built a dual-Epyc 7773X server ! It's FAST, and now in V3.0 (updated 2, with pics)

Same here !

1 Like

Mine did too, as did the Gigabyte board. The ASRock ROME2D16-2T doesn’t…

Do you have ROME2D16-2T? ROMED8 is a single Epyc CPU board.

ROMED8-2T. Single cpu board. Running a 7713. Just wanted to state that not all Asrock Rack boards report VRM temps.

Maybe only the dual socket boards do.

Because ROMED8-2T is a uni-processor board, there will not be any issue with VRM overheating. Thus it makes sense for Asrock not reporting the data.

Asrock ROME2D16-2T has the same design flaw as Supermicro H12DSi. The CPU VRMs are sandwiched between the processor sockets. Under extreme loads when using 64-core processors, the heat output from the front CPU will overheat the VRM.

While this is basically correct, the three boards that I tried are not equal at all in this regard.

The Supermicro’s VRM heatsinks are tiny. I never managed to run 2x64 cores without overheating.
The Gigabyte’s are markedly bigger, and with a bit of care, I managed to have it run at full load.
The ASRock has an even (marginally) larger heatsink, and I’m pretty sure that the offset between both CPUs allows a bit more air to reach the heatsink. But since I’m only starting to test it, I’ll reserve my judgment until a bit later…

You are right. Now that I see ROME2D16 photo from an angle shot, the VRM heatsinks are indeed taller and bigger than the Supermicro board. Let’s know if overheating no longer occur on this board.

This is the experience of a friend with a SuperMicro H12DSi board with two cpus. The VRM heatsink sandwiched between the two sockets gets bathed in heat exhaust from the front socket.

He solved the issue with a custom water-cooled monoblock that cools both sockets and the VRMs. No throttling and no overtemps anymore.

Fwiw I omitted the exhaust fan on the first Arctic in the airflow chain when I build gkhs machine and there was enough turbulence from doing that the vrm got enough airflow to keep his dual 7763s cool

2 Likes

So here’s the long story…

Premices:

  • There has been a version 1.0 of this machine with a Supermicro H12DSi MB, but as I wrote just above, I couldn’t ever get it to full speed because of the tiny VRM heatsink and the resulting VRM overheating. V2.0, 2.1, and 2.2 were with the Gigabyte board, and different cooling solutions.
  • After the machine crashed, I was genuinely conviced the Gigabyte board was toast, and couldn’t get my hands on an identical replacement, so I ordered an ASRock Rack ROME2D16-2T.
  • The experience of the Gigabyte MB showed that : the cooling is just as good (or not worse ?) with the (somewhat) quiet bottom-to-top cooling and the Noctua coolers as with the (impossibly) loud front-to-back cooling and the Arctic coolers. So version 3.0 would use the Noctua coolers.

The build:

Pictures are worth a thousand words, but let me just say that the layout of the ASRock board alows for a much cleaner cabling of the machine.

One thing struck me as unusual : there are no dedicated CPU fan connectors. Just fans 1 to 8. Of course it makes sense in a server (i.e. rack) board, where you just push as much air as you need through the whole machine.

I love the dual M.2 on-board connectors, as they allowed me to get rid of the PCIE adaptor card (I run my system SSDs in raid 1). That was actually the reason why I initially wanted to buy the ASRock MB instead of the Gigabyte for V2.0, but it was out of stock everywhere at the time. And Wendell had made a great video of the Gigabyte with 128 cores :wink:

So here’s the populated board in the case before installing the coolers:

With the coolers, and the fan for the VRMs on the custom 3D-printed bracket:

The startup:

That’s where things started to go sideways…

First startup : IPMI startup, no BIOS startup… Hard reset.

Second startup : it goes to the BIOS (huge relief). Everything shows up correctly in the various hardware enumerations. I make the few settings I want (280W cTDP, “perfomance” profile, “memory throughput optimized” CPU setting), save and reboot… I get to the Debian startup menu, wait 5 seconds, Linux starts booting… and the machine hangs and reboots at the exact point where it did it on the Gigabyte board ! Arghhh ! For some reason, the crash takes a few fractions of a second longer than it did before, and I get the time to read it : “core perfctr but no constraints : unknown hardware”. WTF ? I’m pretty sure it always did that, though…

Third startup (that is power down, unplug at the wall, restart): Same thing, but I get one step further, with the same error message displayed in screen res after being shown in VGA. Hang, reboot loop…
Forth startup (yeah I know, I’m just naive…): Same error message… and the machine boots. I can’t believe it ! I even reboot it from there, three times, just to be sure. It’s alive !!!

The tests

Everything is back where it should be, I have lost no data, no config. IMPI works as well, there is no VRM temp sensor… How am I ever going to know if the temp’s alright or not ?

However, I waste no time, and run a duplicate of a previous calculation to stress the machine a bit and benchmark it. It’s a mild stress since I’m only using one half of the cores.

Oh not again…

It runs 1491 iterations out of 1500… 30% faster than the Gigabyte… and crashes again. I’m back to square one. Infinite boot loop after the Linux startup screen, and all my tries change nothing (naive, I’m telling you). S**t ! Have I burnt this one as well ?

So I do some research on this error message. All the instances I find online show this message appearing during the installation of Linux. Not after the machine has already worked.

It could be a CPU problem or maybe a RAM problem as well. I make a memtestx86+ memory stick, and test the machine… eveything’s fine (didn’t run every test, as it’s really long on 512GB, but it ran enough of them that I’m convinced there’s no obvious problem there).

It could be an OS version problem. Debian 10 is installed on the machine. So I make a Debian 11 Live memory stick and boot from it. Surprise , it works perfectly ! OK… Let’s go experimental, and make a Debian 10 Live disk. Shouldn’t work, right ? Right ? Well, it does, and I start thinking it could be one of the boot SSDs !!! But if it does work, let’s try one last time without the memory stick. The machine reboots, I still get the error message, as before, but Debian starts and the machine is perfectly functional again !!! I have changed NOTHING !!! I reboot several times, it does indeed work again !

I really don’t get it… I’d be happy to hear the opinion of the Linux specialists here…

Anyway, first thing I do is update all Debian 10 packages to their last stable version. Then I put the fans at 100% all the time in the IMPI for saferty.

Conclusion

The machine is working since (yesterday), even tested on 128 cores for an hour, and on less cores for longer. It is definitely 30% faster than the Gigabyte board, and I could pinpoint that the reason is not in the actual calculations (i.e. CPU operations), so it has to be in everything else.

Maybe the Gigabyte board isn’t dead after all. Maybe it was just a software problem. But it did crash in the middle of a calculation. And I had to reflash the BIOS from the IMPI, because it wouldn’t even get there… Anyway, for the moment, I’d say I’m glad I tested this ASRock board !

So that’s the current status of my emotional rollercoaster… :grin:
Any thoughts, remarks, advice, questions ?

Cheers,
David

2 Likes

I noted that the connector tubes inside of power supply connectors can get weaker ower the time, likely due to heat by increased current during higher loads. Using thin screwdriver pushed between plastic cower and connector tube one by one with slight and carefull push towards middle, I managed to partially squeeze the each tube resulting in tighter “grip” with the motherboard pins and ensuring more reliable conduction. I needed to apply firmer force when re-applying the connector onto mobo, supported the mobo from below. This technigue helped substantially decrease crushes if there was no other option, many times it eliminiated them. This also frequently helped “aging” power units to be usable again. Not sure if this helps, I see that you have quite a jurney with your system. Good luck!

If you have to run all fans at 100% speed, that means the CPU VRM is still overheating. I have the same Noctua NH-U14S TR3 coolers. They have no trouble cooling Epyc 7713X processors under extreme loads in standard fan mode.

Here is my suggestion. Remove the front CPU and its cooler (CPU2). Then boot the system and change the IPMI fan mode to normal speed. Finally run your codes again at full load. I don’t think the system will crash.

BTW, you should remove the kernel boot parameter ‘quiet’ from the Grub menu. This way you will see all the OS messages during boot. It’s very useful to debug OS boot issue.

Oh, I don’t know if I have to run them at 100%. But since there is no VRM temp sensor, I’m just not taking any chance at this stage ! In these conditions, the temperature of the CPUs at full load (64 cores per CPU @ 3250MHz !) remain under 67 “units” (probably closer to 60 °C) at all times…

Thanks for the tip !

It may be worth trying to make a shroud to duct air from the front intake under the right CPU-cooler to mimic server-style airflow.

the rome2d16 board does have vrm temp sensors. When I got the board, it showed up as Vcore1 MOS Temp, P0 DDR ABCD Temp (vrm temp for each 4 slot group of memory banks) etc. This was on BMC 1.00.00. Once I upgraded to BMC v1.16.00, I seem to have lost those specific sensor readings.

Better complain to William at Asrock Rack.

I figured the bug may have been introduced in

1.04.00 11/22/2021 BMC Megarac SP GUI 25.89MB Modify the threshold of power sensors.

I just rolled back to 1.00.00 as I don’t really need the functionality of the newer patches.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.