I built a dual-Epyc 7773X server ! It's FAST, and now in V3.0 (updated 2, with pics)

I have no technical advice but someone who might worth talking to is Don Kinghorn who works at Puget Systems. He is their scientific computing advisor and messes with this kind of stuff every day. Puget Systems are very open to communication even if you aren’t a customer. They are big proponents of publicly available information.

5 Likes

Nice build, that’s a BIG case! I’ve been running a 240mm AIO water cooler on my 240 watt Epyc 7713P for two years straight now with no issues whatsoever, so I’d say it definitely can be a 24/7/365 solution. Temperatures under sustained multi-day all core workloads never surpass 65C on the hottest CCD.

Boards like these are usually built with a high airflow rack mount style chassis in mind, and I would bet this is the source of your problems. Additionally, I moved to 64GB DIMMs and found they got pretty hot under memory-intensive workloads.

Although I didn’t encounter stability issues, my solution was to fix a small 40mm Noctua fan blowing down into each bank of memory modules. Additionally, I have a 60mm Noctua blowing down onto the VRMs. Temperatures on the modules dropped almost 20C!

Going with a server chassis would be an easy way to test if it’s an airflow issue, but sometimes it’s nice to have a quiet desktop-able machine. Plus, with all those Noctua fans you’ve clearly put a lot of money into this case setup :sweat_smile:

2 Likes

Does it crash with just one CPU? If so then I would suspect a power or motherboard issue.

The VRM temperature seems suspiciously high, but I’m not familiar with that Gigabyte board to know if it’s normal. Hopefully someone with a similar config will chime in. On my own EPYC 74F3 machine (only one socket), the VRM heatsink peaks at 45C (measured with FLIR camera and thermocouple) after an hour of running mprime stress test, and no temperature ever goes above 65C.

I’m guessing that is from the k10temp-pci driver for the on-die sensor? Do you have IPMI sensors for each socket? Usually there are thermistors under each socket.

This is from the IPMI’s “CPUx_temp” sensors. And the “CPUx_DTS” sensors are only at 30°C !

BTW : I just tried again opening the side panel… The CPU temp immediately dropped by 4-5°C. The VRM temp immediately increased by 4-5°C !!!

Excessively many fans IMHO. The airflow inside must be chaotic like hurricane. Geez…build a CFD model to sort out your airflow first… I can’t believe you’re a CFD guy :joy:

1 Like

:rofl:

Well, the machine has been stable at full load for 26 hours now…
Can’t believe I was so distracted that I didn’t try to ventilate the VRMs directly again, as before…

So, as many of you guys implied (or said :wink: ), my next course of action is :

  • Remove case top fans
  • Install solid top of the case
  • Replace CPU coolers with Arctic Freezer 4U SP3
  • Replace front fans and bottom fan with Noctua NF-A14 industrial PPC-3000 PWM

This should create a solid front-to-back airflow, and I hope this will be enough to create a steady flow of air on the VRM radiator.

I might do that as well…

@PhaseLockedLoop : Any specific Fujipoly product you’d recommend for this application ? Is the XR-m material one can find on Amaz*n any good ? Its thermal conductivity figure is excellent…

1 Like

I resurrected a X99 workstation board where the board would crash in the BIOS from overheated VRM’s and where it would not support 4 gpus because the PLX chips overheated.

Replaced the stock pads on the VRM’s and the PLX heatsink with FujiPoly Xe pads [11.0 W/mK] and that workstation is still running 5 years later and is rock-solid.

FujiPoly is good stuff.

2 Likes

Their extreme lineup is pretty nice. It is on all my graphics cards

Get a thermal camera, like a seek or flir, they plug into your smartphone.

When I get back home I will lookup my bookmark for a macro lens for the ir camera so you can determine which component is overheating.

Besides that, add a two hose portable air conditioner to the room with the computer, and run both hoses out of the room.

You can also increase the air humidity to about 60% to increase the amount of thermal transfer.

After that, get little dice sized copper (not aluminum) heat sinks, and some thermal pads, and add them to everything that is hot.

I have some thermal epoxy that I periodically use.

Love this thread as I am trying to build a similarly specced machine but with 7763s

I was looking at the Asrock Rack board since they in contrast to the other folks designate it in the manual as server/workstation. Also has more fan headers supposedly and if pictures are not deceiving they seem to have put more thoughts into VRM heatsinks.

But can’t find the board anywhere these days.

Adding more thermal mass to connect to the VRMs is something I would look at, unless you want to replace the heatsinks completely. Fuji poly-connect to a waaaaay larger heat sink?

I was looking at the ASrock Rack MB too… Especially because it has 2 M.2 slots. But I couldn’t find it anywhere either (as of last summer).

I’ve been looking at barebone servers as well… but AFAIK, there is none that garantees that it will dissipate the full 560W !

Increasing the size of the heatsinks isn’t easy at all, because of the likely interference with the CPU coolers.

Then do water cooling. I think the OP just needs more air conditioning.

Noctua fans are great at being silent but a little over hyped, especially under 120mm

You need a 3k rpm on that

1 Like

This is a good point that buyers forget sometimes. No amount of beige plastic or fancy fan blades is going to make up the 9000 RPM difference in airflow between a Noctua and a Sanyo :laughing:

But, you wouldn’t want the latter sitting in your office!

yeah I would also aim for a front to back airflow. I’ve build a threadripper pro system in a meshify 2XL case, and at first I also had the same noctua setup as you. The problem wasn’t the CPU or VRM temps, but I installed some PCIe cards that ran very hot (SAS HBA, mellanox card). I’ve built a 2nd fan wall right next to the motherboard to help with front to back airflow as I have quite a few 3.5 inch HDDs at the front of the case, that restricts airflow. I also switched to the Arctic Freezer 4U SP3.

I got some sanyo denki and delta fans (and a high current fan power board), that I run at 1-5% rated speed. It’s not silent but tolerable. :slight_smile:

how about water cooling:

1 Like

Just a tip from someone who has done it: Don’t use any non-SP3 specific watercooler (like the MSI) on an SP3 socket chip. The cold plate is not large enough, because it was designed with small consumer chips in mind, and therefore it will not cover the whole socket. Temps will be terrible and you risk damaging the outer-most CCDs that do not get coverage.

If you look at an AIO made for TR4/SP3 the difference in cold plate size is immediately visible.

2 Likes

It’s an extra bad idea. The water blocks aren’t designed to accommodate the chiplets on epyc at all.

Also it’s a server it must be validated for 24/7 ops. Odds of a pump failing are higher than a fan. Especially with the low quality asetek clones. They might seem like quality parts and get the job done on consumer but the type of system he has here is a different breed.

Some AIOs work but it’s far and few in between

1 Like

Go with a full custom loop for the cpu or one of the custom loop makers that also produce an AIO product. Better chances with those products compared to consumer level Asetek based AIO’s, even if the coldplate is properly sized.

There are also custom dual socket cpu/VRM monoblocks that work very well. Teammate has one on a SuperMicro H11 sporting dual 7742 cpus. He had issues throttling under high VRM temps with standard server air coolers because of inadequate airflow to the VRM heatsink that is shaded by the front socket and bathed in its exhaust. Water cooling solved the problem quite nicely.

2 Likes

I strongly do not recommend water cooling a server. Look I get it Linus did that but loops are a pain. Are we forgetting what a server is? A reliable practical set and forget machine.

Cooling loops take a ton of maintenance. How do I know? I’ve built a few really good ones and even the best stuff you can put together is a good bit of maintenance and monitoring. It’s also a massive amount of money for very little benefit in a server use case. And if your not maintaining them you should be. They have many failure points. They can spring leaks when you are away. If a server is actually going to store data that matters to you then this idea needs more than just the average attempt at cooling as you said but even those fall short. You seriously DO NOT want your loop leaking all over everything when your out. This definitely has happened to many of my friends and even myself. This could be disastrous for a server with essential data

Air may not be the best of the best but it just works and it works reliably. Also full custom loops only exacerbate his problem of where the VRM cooling is not happening if we all take a look at his specific motherboard again. Unless he manages to get a fan really pushing down on those vrms water cooling isn’t going to solve it or at the very least isn’t the only solution to solving it. Air is the most reliable and consistent option.

2 Likes