I built a dual-Epyc 7773X server ! It's FAST, and now in V3.0 (updated 2, with pics)

Hi everyone ! First post on this forum, which seems to be where I’m the most likely to find the help I need… I hope I did post in the right category…

So here goes…

I’m running very big CFD calculations (computational fluid dynamics) on a daily basis. Sometimes one calculation is several weeks long, and rarely less than a day. And as the beginning of an upgrade from my previous Dell dual Xeon 6248R machine, I got my hands on a pair of Epyc 7773X CPUs. Yay !

I built my server (running Debian 10) with the following parts :
Motherboard : Gigabyte MZ72-HB0 (after trying a Supermicro H12DSi-NT6 that absolutely couldn’t handle the VRM dissipation)
CPU coolers : 2x Noctua NH-U14S TR4-SP3 with additionnal 150mm fans
RAM : 16x Samsung 32GB DDR4-3200 ECC (total 512GB)
System SSD : 2x Samsung 980PRO 500GB M.2 in Raid 1 on a Startech PCIE4.0 8X → dual M.2 adapter
Calculation SSD : Samsung PM1735 3.2GB PCIE4.0
Storage HDD : 2x WD Ultrastar HC520 12TB in Raid 1
Power supply : Corsair AX1600i
Case : Fractal Define large tower with as many Noctua fans as I could install ! :wink:

The machine works. And it’s fast.

For those of you familair with this kind of application, yes, depending upon the size of the problem I’m trying to solve, it is sometimes faster to run the case on 64 cores instead of 128, because of the memory bandwidth. But large problems are faster on 128 cores.

The problem I have is that this server isn’t completely stable.
It will run for days and then crash. When it does, I believe it’s a hardware crash. No trace of anything in the OS logs or the IPMI logs.

These days, it crashes basically every night (not at a given hour). Of course, I loose part of my calculation every time !

It has a tendency to crash when under stress (and heat). There are occasionnal “CPU throttling” messages in the IPMI, although the CPU cooling is very efficient, and the temp sensors never rise above 75%. The VRM temp is often around 97°C when under full load (alarm temp is 115°C according to Gigabyte).

But it will also sometimes reboot three times in an hour with no load at all.

I haven’t figured out what’s causing this, and more importantly, how to solve it.

I have moved the PCIE cards (system and calculation SSDs) up to slots 5 and 6 to shorten the electrical links. I still have a suspicion that the Startech M.2 adapter isn’t as “professional” as the rest of the components, but the MB only has one M.2 slot…

If I don’t manage to solve this, my next move will be to install the machine in a Fractal Torrent or a 4-5U rack server case with Arctic Freezer 4U SP3 coolers (thanks Level1techs YT !), but that’s an expensive try if heat isn’t the main factor causing the issue…

Any idea, any help would be greatly appreciated !

Have a great day,

David

PS : Pics in the thread below (post 20)

1 Like

these are 3d vcache CPUs, rumor has it they are more likely to have issues with temps, and voltage regulation. so keeping the CPUs and VRMs under control and avoiding temp spikes is key.

also line power coming in is going to play a part in all this. the ax1600i is one of the best power supplies you can get in that range, but have you checked your house to see if any other large item is on that same circuit? something else kicking on and off on the same circuit can cause a major power ripple that could be causing you a power related instability on that machine.

if there are no other devices on that circuit, i would reseat and repaste the CPUs again, and then loop something like MemTest from a bootable USB for a day and see if the problem happens during that time. if no issues, move to looping something like pi calc and just individually test each subsystem under load until you find what it is. it could be as random as reseating an NVME device, or as major as 1 of the vcache CPUs having an issue under a specific load.

1 Like

Yeah, why not posting some pictures. Seems like a nice system with a fascinating application.

Regarding your problem, have you tried to run your system with side panel removed? And optionally blow an electric fan at the case?

@Zedicus :
The machine is plugged into an Eaton UPS, so I would expect that the power is clean…
Thanks for the hints ! I guess more testing is in order…

@vic :
Opening the side panel does not seem to improve things.

But writing all this, and thinking about the successive events in this machine’s life made me realize that it seemed to run more smoothy when I had a little fan (60mm Noctua) blowing directly on the VRM radiator.

It was mounted on a custom 3D-printed bracket… that completely warped when I moved the machine into our server room. This move increased the ambient temperature of the room from 26°C with just the Synology and the network system and 30°C with the Xeon machine running, to 40°C when running the 128 cores !!! The ambient inside the case was frequently around 50°C !

The machine is back in my office since, but it can’t stay there forever. It makes too much noise with the fans at 100% all the time !..

I’ll post pictures ASAP (I don’t have any with the Gigabyte MB…).

Eaton UPS are banned here. they have been the CAUSE of more of my power issues than they have ever solved. and i have used every range of UPS you can imagine from room sized that power entire buildings, all the way down to tiny under desk things.

1 Like

Interesting… :grimacing:
I haven’t had too many issues with mine. The small under-deck one has been flawless. The larger, data-center-class rack-mounted one in the server room had to be replaced just two weeks after I bought it, but the replacement has caused no problem since…
What brand do you recommend ?

currently we are using mostly CyberPower. we have several fairly large 5000va units all the way down to 1500va rack units and 500va desk units.

we still have a number of APC units installed too. they work fine but the cost of new APC units and batteries caused us to move to CyberPower. the CyberPower units have all been in place for around 2 years and have been great.

i have used Trip-Lites also. again, work fine, sometimes cost prohibitive. basically i buy whatever of those 3 i can get the best deal on at the time of replacement.

the Eatons that have tried over the years all would do weird things here. most would not phase sync to our power so they would charge and then anytime they got kicked into discharge mode, they would never move back to charge mode with out being physically reset with the button. we had Eaton come out to do service and support, they viewed the issue, and recommended we contact the power company to get the power to be phased differently. That is just one of the weird Eaton only issues i have had.

2 Likes

I’m doing large Computational magnetohydrodynamics simulations that I try my best to mesh the problems to use less than 256GB of solving memory and I’m seeing solve times in the 250 hour range.

In my use case the memory gets heavily stressed. With only 8 sticks of slower ddr4 memory, it will pull north of 150w on the memory alone, so ram cooling it paramount. I’d imagine your memory power consumption would crest over 300w if your hitting GMRES class solvers as hard as I am.
Do you have any way to measure your memory temperatures?

1 Like

@Zedicus : Maybe living in a country with 230V & 50Hz helped me…

1 Like

It sounds like the VRM is near it’s runaway temperature (125 C TJmax of the fets on average)

Is there anyway you can increase airflow over those heatsinks

While I never recommend janky hacks… the center heatsink is what you want somewhat direct airflow on. I am unsure of your heatsink setup but if the heatsink can be clocked. I would face the open side of the CPU heatsink fins where airflow exits facing towards the vrm heatsink so that when you have airflow through those fins at least some of it is directed at the VRM heatsink. Warm air is better than no air

Also see the memory vrms. I’ve seen people place generic heatsinks on those and use high quality thermal pads to bridge the gap. Some say it increased their stability but is first just try to get more airflow on that center heatsink

As for CPU heat. The answer very likely is to find the beefiest heatsink that fits the application and go with that.

If that still all isn’t enough you can try to improve the barrier between the VRM heatsink and the fets. What you would do in this case is see if the heatsink can be removed. Measure the thickness of the thermal pads… order the same thinkness from fujipoly … I did a similar mod on a graphics card and it did wonders

1 Like

I’d be hesitant to recommend firmware (EFI) based thermal curve limitations as AMDs stock handling of that is actually very good. Incredible actually. So try that first. Try getting food airflow on that in the case

The issue going on here is that typically server mobos and heatsinks are usually designed for the sanyo or delta fans that push gobs of air (and are loud)… through a rack chassis that is optimized for a certain direction of airflow. The fractal design is not made with this in mind.

I hope this helps. Often when it comes down to adapting something designed for a different purpose and application into a standard atx case is a matter of optimizing thermal conditions not accounted for by the manufacturer and from what you describe this definitely sounds like a heat related throttling of the VRM. It’s a lot of energy the epycs pull. It’s not easy to cool

@PhaseLockedLoop : Yep 100% agree ! On your previous post as well. Thanks !
I’ll try to print a new support for my little VRM fan (see my reply to vic above), and see if that makes things better !

1 Like

I just started a large calculation to check… Stay tuned while it heats up ! :wink:

2 Likes

So, remember that the nominal speed of the CPUs is 2200MHz, max speed is 3500MHz ?
My calculation started on all 128 cores… at 3250MHz !!! I’m always amazed ! I guess it also shows that my CPU cooling is quite efficient. (the BIOS is tuned for “CPU-intensive” loads with a max TDP of 280W per socket)

Used RAM is 295GB.

Some sensor readings :
CPU0 temp 73 (it shows °C, but I think they actually are % of the max for Epycs, right ?)
CPU1 temp 73 (same)
DIMM temp 62°C
MB temp 48°C
VR_DIMM temp 75°C
VR_CPU temp currently 91°C, peaks at 95°C
VR_CPU Iout currently 150A, peaks at 215A !!!

The room temp is still quite cool… 20°C or so, it will increase as the machine runs hot.

VRM ~97C ambient inside case ~50C. These temperatures are surely not on the comfortable side IMO. Tends to create stability issue. Seems like your server room is not well ventilated?! That seems an independent improvement you could work on regardless heat is the source of your instability issue or not.

Interesting detail. So I assume your instability issues were observed inside the server room. However, you can also observe all the issues inside your office now? Such as the following:

Most definitely ! In my case, not an easy task, though… Probably the subject for another post at another time :wink:

Well… yes, I think so, most of them anyway. It actually crashed 20 minutes ago. That’s not enough time for the room to get even a bit hot.

Anyway, a new VRM fan bracket is being printed as we… write !

No problem man. Hopefully it works

That said I do not recommend small VRM fans. The extreme heat often wears the bearings quickly and evaporates the quite lightweight oil in them. It will last a couple years but it’s not validated for a 24/7 ops type environment. I would strongly consider using your main case fans and heatsink orientation to do the cooling if you can

now i want pics!!! love me some sexy HPC hardware p*rn

4 Likes

11122_0
There’s a reason these bad boys exist

So guys, I re-installed my “VRM fan” this morning. Test is in progress. The temperature of the VRM is reduced by 3-4°C, everything else is unchanged. We’ll see if that’s enough margin to bring back the stability…

Here are a couple of pictures of the machine… On the first one, you can clearly see the additionnal fan on its 3D-printed bracket, with its 3D-printed “diffusor”, that sends the stream of air directly on the VRM radiator.


I’m also thinking about solutions for the future :
One solution could be a rack case and Arctic Freezer 4U SP3 coolers. But which case : Sliger CX4200a or Silverstone RM51 (as soon as it becomes available in Europe…) ?
Otherwise, there is a (very expensive) integrated watercooling block for this MB by Comino. If I had to go that way, I’d make the radiator external, and put it inside the office in the winter and outside in the summer !!! Still have to refine that thought… But is watercooling a real 24-365 solution ?

5 Likes