Proxmox - Dual EPYC 7B13 CPUs shutting down under certain loads

As the title says, as soon as I put the system to work, specifically when I boot into HiveOs to mine monero the system shuts down after about 10/15 minutes.
I was having problems with the VRMs overheating before and that’s why I installed the 3D printed duct that pulls fresh air from the top to the VRMs.
After this I let Prime95 run for about 2 hours and it seemed stable with the max VRM temperature reaching 94 degrees.
When the system shuts down in HiveOs the VRMs are around 82 degrees, so I don’t think they’re still the cause.
The CPUs too are not the problem I think since they reach 65 degrees at most thanks to the two powerful coolers.

Another thing that doesn’t really make sense to me right now (I am new to Proxmox and I probably have not configured everything I have to yet), is the low CPU and RAM usage I am seeing inside the Proxmox dashboard.
I have set up two VMs with 64 Cores/128Gb Ram and 48 Cores/96Gb Ram and would have expected areound 90% CPU and Ram utilization whilst mining given that the server has 128 Cores/256Gb Ram in total. I have read online that you can manually assign the cores to a VM but I have not managed to do that yet since I hoped to solve the shut downs problem before. The VMs are configured to “max” as CPU type. I don’t really know what that means and if that creates problems, but I have followed a tutorial to get HiveOs up running in proxmox and the guy was choosing that type.

The complete specs of the system are:
Motherboard: SuperMicro H12DSI-N6
CPUs: 2x EPYC Milan 7B13 64 Core with Arctic SP3-4U coolers
RAM: 16x16Gb Samsung ECC 2933
SSDs inside 4x4x4x4 raid card {
Boot drive: 2x 128Gb nvme gen3 in raid1
Storage drive: 1Tb nvme gen4
}
Intake fans: 3x140mm on the front, 3x120mm on the back and 1x140mm on the top under the duct
Exhaust fans: 1x140mm on the back, 2x140mm on the top
PSU: BeQuiet! 1200w Platinum

I have tried to manually limit the cTDP and package TDP (If I remember the name correctly) under the “North Bridge” settings in the BIOS to 250w and the system was able to mine for the whole night without shutting down, but given that the hashrate I was getting was litterally a third of the one I was getting when HiveOs was the only Os installed on the boot drive itself I tried increasing it to 280w (the TDP of the CPUs) and it was shutting down again. I am trying if 260w is stable at the moment, but time needs to pass.
I was mining on an open bench before and wasn’t using any manual cTDP limit.

I am still very new to server grade hardware and home server in general so thank you for any kind of help.
I would really want to ensure system stability and increase the very low output I am getting since switching over to Proxmox.
I was getting 88-90 KH/s before and am seeing 16 KH/s at the moment on moneroocean.stream.
I don’t know if this is accurate tho since HiveOs itself is reporting 41 KH/s and 33 KH/s on the 64C and 48C workers respectively.

The IPMI screenshot below was taken after about 25 minutes mining with a cTDP limit of 260w set in the BIOS.

Thank you very much in advance!


Just wanted to add a picture of the system for visual clarity.

1 Like

Can you see how many watts it’s pulling at a wall socket to be sure your circuits can handle this device and that your CPU’s aren’t overloading the 1200W PSU? (I would like to imagine it would be adequate, maybe it’s working OK but hits a thermal limit. How is the ventilation for that PSU?)

K3n.

1 Like

Thank you very much for your reply.

It was pulling 780w at the wall a couple of months ago when only HiveOs was installed with the same hardware, my smart plug is not configured at the moment due to network changes so I’m not 100% sure at the moment, but I have the server in an ambient with adequate power delivery, so it should not be the problem, I think.

Do you think that I should set the plug up again and monitor the power draw from the wall?

The PSU should be adequately ventilated since it has an intake fan on the bottom and was used in my personal PC till 3 months ago and never gave any problems.

I have increased the assigned cores of the VMs to 128 and 96 (since i discovered that Cores in proxmox are equal to virtual cores and thus threads) and this resolved the CPU utilization problem and am testing cTDP and Package limits now, but it still keeps shutting down at 260w.
I will keep decreasing it till I get it stable, but I do think I’ll be loosing quite a lot of performance by doing so…

I really hope to get it managed eventually.

1 Like

on my r9 7900x i had limited power limit in bios to reduce heat generation in the room. it was stable. but when i started folding at home as is the winter tradition. it would hard crash without logs. lifting the power limits resolved the issue.

it may be that whatever the workload is is just hitting a specific part of the cpus harder than expected and you are suffering a similar thing. if you lift the power limits does the issue persist?

first,
you did a beautiful flow vane/3D print to the VRMs.
Quite impressed.
Really really impressed.

While I do not have that specific board as my personal PC, I do have extensive experience with running those boards as a workstation and not in a rack.

I will suggest two things that your first response is going to be that you have already done that, however please look at the detail and ascertain which parts apply.

Bios, heatsinks, substrate, power.

In the bios: don’t limit performance, open it up and don’t focus on limiting any power, cycles, cache, XGMI or any of those things.

Heatsinks.
The cpu needs to have a heatsink that covers the whole metal lid of the cpus. Edge to edge minimum. The heatsinks in your design are “not the best” I’ve tried them and they were at best borderline at absorbing the hotspots on the cpu.

Heatsinks:
the VRM Mosfet etc needs a substrate TIM such as Gelid Ultimate and make full contact to a much larger heatsink than the factory MB that was provided.

That MB is expecting a rackmount case worth of continuous airflow over all the MB and the MB has other critical hotspots beyond just the cpu and VRAM, such as the gigabit chip that get s 70C toasty and will cause a shutdown/lockup

Consider larger VRAM heatsinks and heatsink any of the smaller heat sources on the board.

Power, beyond the 24pin, sometimes power supplies are flaky on the dual 8 pin connectors and inconsistent on delivery, consider using single rail power to mitigate any potential drops.

I have (for my wife) a loaded dual 7773X on gigabyte MZ72-HB0 and mine is a dual 9684X on gigabyte MZ73-LM0

Please review the thread:
the ZEN of Air-cooling and see if some of the methods listed my help you.

where I had to cool and run whisper quiet dual 9684x, dual 4090s, and 24 dimms if DDR5 and very hot server nvme express drives.

Please review the additional heatsinks provided in a similar set up, look at all the chip/heat source mitigation on this board in order to run as a home workstation, not in a rack

Please look at the VRAM and Mosfet mitigation:

also please consider your LAN chip heatsinks mitigation in the thread link above.

I’m interested in helping you, please let me work with you to solve your issue.

3 Likes

First: Thank you so much for your reply. I really appreciate it.
Second: That build you have is incredibly well built and I really like it.

For the moment I ended up limiting “cTDP” and “Package Control Limit” both to 230w and I really (as you said) don’t think it is ideal. But it has been stable since Friday this way and the reported Hashrate is equal to when I was running the machine on full tilt in an open case withouth any power limits with all 256 threads mining. Around 90 KH/s.

In the end the low CPU utilization was due to me settings up the VMs with 64 and 48 Cores in Proxmox, when I should have selected 128 and 96 since Proxmox considers “Cores” as virtual cores and thus working threads.

Where thoose VRM heatsinks the stock ones of your motherboard or have you bought/built them afterwards? I have done some research, but was not able to find bigger ones specifically compatible with my motherboard.

Regarding the CPU coolers I honestly think they should be more then up to par with my CPUs. I had previously installed two “BeQuiet! Dark Rock Pro TR4” and ended up returning them because they were only rated up to 250w and were indeed not efficient enough in dissipating the heat. I have picked these coolers because of Wendell’s very positive review of them and the temperatures I was seeing in Prime95 never reached more that 65 degrees on any CPU related sensor in HWInfo.

I honestly don’t know what you mean with single rail power, but I will be soing some research on this subject to learn more.

Regarding those other components on the motherboard I need to cool, do you know any vendor/website that maybe sells “motherboard heatsink kits” or single component heatsinks? I would be more then ready to pull the trigger and buy VRM/IPMI and “whatever other component there is a need for” heatsinks, but as I said, I don’t know where to get them from.

For the moment I am happy with the achieved stability I got, but obviously I would prefer to not have to limit the power of my machine to get to that point.

So, thank you again for your help.
I am still new to server hardware in general and you seem pretty knowledgeable, so I am grateful for your tips!

1 Like

NikVince
glad you reversed engineered and solved your issue.

I will try so answer some of your questions:

Single rail is when the PSU can deliver the whole power curve to all outputs without any limitation per connector. Multi-rail is when each plug has a limited draw/trip per section.

Multi-rail can prevent issues such as an overdraw, but also simultaneously limits how much power can be delivered to any one connection.

The vram heatsinks for my board were found on Alibaba, they are much bigger and more effective than the stock ones and fit perfectly.

I also upgraded the TIM putty to the Gelid Ultimate (15w), this keeps everything from getting any higher than body temperature

As for the various hotspots on the board, I used generic amazon bought heatsinks (similar to the ones that come in RaspberryPI kits)

My CPUs, in my case the 9684x and for the other board the 7773x, can get warm when running max and all-core, so I had to take extra steps.

out of everything I had to do, the one thing I could not find out there was a better heatsink for the LAN chip which gets toasty and with some other internet research seems to be a widespread problem. So I made my own:

more fins, chemically etched for more porosity and surface area, all copper

Original

Process:





final product

went form 65-70 C to body temp and up to 50C under load, a difference of 20-30C compared to the flimsey meaningless aluminum wafer that came with the board OEM

4 Likes

@JayVenturi, I had another question, which I thought anyone reading this thread would have. I assume what @JayVenturi means by TIM putty is the stuff you apply between your CPU and the CPU heat sink. Am I right?

Well… specifcally: the TIM putty is clay like and that is what one would use on the VRAM and MOSFETs and sometimes the RAM for GPUs.

TIM paste / grease (in most cases, there are some exceptions) is the stuff you apply between a CPU and a heatsink. It has greater motility than the putty and spreads more evenly. It is way too “runny” to be used on the VRAM and MOSFETS

Example putty:

example paste/grease:

You can get the pad / putty in various thicknesses such as 3mm or 0.5mm - here is what the pad/ putty looks like when applied correctly (cut to size)

2 Likes

I was chasing something like this on both milan and genoa class CPUs; turns out I think its a kernel bug? happens with proxmox vm > ubuntu > docker nested virtualization and heavy load. system hard locks or just resets.

rolling back to kernel 6.8 seems to resolve it

Good job on the custom heatsink!! Impressive.
I will be looking more on Aliexpress and Alibaba then to see if I manage to find something compatible. Thanks again!

Stupid question first:
Are you actually the real Wendell? I am new to this forum and wasn’t expecting his reply honestly, but would be really cool xD

Real question:
Could you maybe give me some instructions/guides to follow for this?

I am quite afraid of copy-pasting commands from random articles that pop up on the internet or get generated by AI because I’m still pretty new to Linux as a whole and I don’t want to brick the installation I just got working after hours of trial and error…
A link to a guide or any kind of trusted source would be a godsent at this point.

I am unsure if I need to somehow revert the kernel version or install it from the ground up given that I have reinstalled proxmox from zero a couple of weeks ago and thus reckon there may be no previous kernel versions installed, besides the one that is curerntly active (?).
I am on version “proxmox-kernel-6.8.12-3-pve”.

In any case, thank you very much for the reply.
I will go further down this rabbit hole, because I think this could really solve the issue!