I built a dual-Epyc 7773X server ! It's FAST, and now in V3.0 (updated 2, with pics)

3-4C is too little to help. It just proves a point that your VRM temp under load is really within vicinity of derating. Once your machine in the server room where ambient temperature is >20C higher. It’s going to crash again.

Also note that the VRM overheat doesn’t explain random reboot at idle. So your system may have more than one source of instability issues.

2 Likes

More and more servers are water cooled! These days it’s proven reliable enough for mainstream.

Even with AIOs, I’ve had good luck with CoolIT and Enermax. Both the models I own have >2 years worth of hours on them.

Oh yeah don’t get me wrong. I know servers can be water cool and that’s not the issue here. The issue is We are making recommendations without even asking if the OP has the budget or capability to deal with water cooling. Recommending an AIO is plainly irresponsible since those have a very high failure rate versus a custom loop. As some of the other recommendations went, they usually recommended a custom loop water cooling system and that’s great. I don’t have a problem with that as long as the OP understands that that’s going to take a lot more maintenance than a air cooled system and it’s going to require some monitoring and ideally if he’s got some Linux chops he can trigger the system to shut down if the pump fails which is probably a very good safeguard to have placed into the operating system.

Water cooling systems take a lot of maintenance. You got to clean them regularly. If he’s willing to do it then that’s fine, but I think that for multiple thousands of dollars of server parts, we should do a little bit more due diligence here than just blindly recommend

That and given he dropped a lot of money on these parts gas anyone asked if he has the budget for that? All good questions to ask first

My approach to overheating components has usually been to add more cool airflow.

I think he has enough airflow to the known hot components, and needs to look into cooling his air. 13c air in your server room will work much better for cooling your server than 40c air.

Most components for cooling look into the delta-T. How much cooling can your solution perform for a given air temperature, measured as the degrees C over ambient air temperature. The OP has fine components, he just needs to lower the (ambient) air temperature to make the maximum component temperature below component failure temperature.

2 Likes

Oh yeah I never bothered to ask what ambient temperature was in his server room. It’s tough these CPUs really do boost to their limit. It’s pretty crazy

Temp in the server room was 40C, temp in the case was 50C.

1 Like

Thats a healthy delta. I wish server motherboards did a better job with VRM and PLX heat dissipation. I understand they are validated to operate at those temps but its like come on manufacturers… you can do better

1 Like

You missed the part in the post where I mentioned the cooling solution was a monoblock. Monoblock = cools both sockets + VRM’s.

Though the motherboard is a server motherboard, the system is deployed not in a traditional server datacenter function.

It is a dedicated cpu-gpu distributed processing platform running BOINC projects where it maintains 1st place position worldwide in many projects.

4 Likes

That’s pretty epic… look all I’m saying is you probably knew what you were getting into. You were probably experienced already with FCL-WC before I’m assuming. I’m saying if it’s a first server first time wc that might not be the best combination in my opinion. I leave it up to the end user ultimately

I was just offering one possible solution for the OP platform using high wattage cpus and running into stability issues almost certainly caused by VRM cooling and throttling issues.

I didn’t have any prior experience with custom water cooling but I transitioned very easily and quickly from AIO solutions for my BOINC server workstations. Two Epycs and 3 Ryzen 32 thread workstations all custom looped.

Crazy. What do you have strapped for storage and other tasks if you don’t mind me asking?

Storage is very minimal as not much is needed. Just M.2 drives for boot OS and and leftover SATA SSD’s or rust spinners for bulk storage and backups…

Use mostly 256GB or 512GB M.2 drives. Most BOINC projects don’t really need that much storage except for one of my biomedical projects that are running Python machine learning applications. That project chews up 25GB or so.

2 Likes

How about installing air conditioners or cold pumps in server rooms.

40c in your server room? Dear god

@KeithMyers very interested in the monoblock solution. How was that done? Considering the offset CPUs and RAM around it and all… Do you have a link or further info? Can you ask your teammate?

1 Like

Hi guys,

I’ll try replying to all the topics raised over the last 2 days in just one post…

First, a quick report on the current situation of the server : With the fans blowing on the VRM, and the machine back into my office (where it has raised the temperature to 26°C !), it has run for the last 72 hours without a glich. Occasionnal throttling of the CPU (for 6-7 seconds at a time, a few hours apart…) shows that we’re still close to the limit. But all in all, I think VRM cooling is definitely the main culprit. If there is a second factor, it hasn’t showed up again until now.

As a first conter-measure, I have ordered the parts to convert the build into strong front-to-back-only airflow within the current case (and high-quality thermal pads for the VRM heatsink). This should be done next week. Of course, I’ll let you know how it goes, and if the results are good or not.

If I end-up water-cooling this machine (which I consider a last resort), it will be using this monoblock. The rest of the water-cooling system is more of a mystery to me. I have assembled many machines over the years, but no water-cooling yet… I guess I’ll come back for advice when (if) the time comes…

The thing is, as some of you noted, that while water-cooling can help making the machine work in a hot environment, it wil not change the problem of the ambient temperature in the server room. This small room was originaly meant to just host the network hardware. I added a rack-mounted NAS, still no problem. But with both HPC servers, it just doesn’t work anymore. And I’m pretty sure the temperature isn’t ideal for everything else in the room, either…

So I need work on ventilating the room (I’d like to avoid the CO2/energy impact of an actual A/C system…). Installing the water-cooling radiator outside of the room, where it makes sense to save energy is also a possibility I’m looking into…

I thank you again for your help and suggestions, and I wish you a great week !

5 Likes

Glad you have isolated the cause! There’s always Delta E-series fans if you don’t care about anyone’s hearing :slight_smile:

A request: Could you see if you can find a part number/code on the VRM chips if you remove the heatsink/pads? Would be interesting to see what they are and if other EPYC boards have different ones. I’m kinda suprised how cool my ASRockRack board VRMs are at full load - I need to find out what model they are too :slight_smile:

I have to wear a winter coat in our server room.

Anandtech has a full review of this board.

The CPU VRMs are 6 TDA21472 power stages driven by a IR35201 PWM controller (one set per socket).

1 Like

Sure. The motherboard is a H11DSi and this is the link to the company that makes the monoblock for it.
H11DSi/H11DSi-NT Monoblock Rev. 1
Image of installation
Radiator is a Watercool MO-RA3 420 in an outside window.

1 Like