Dual EPYC 7B13 overheating issue

I start with the background:

I have ordered from the seller Tugm4770 on ebay since I have read great things about him/her.

My order:
Motherboard - SuperMicro H12DSI-N6
CPUs - 2x EPYC Milan 7B13
Ram - 256 Gb DDR4 ECC Samsung

The CPUs came preinstalled in the motherboard as it shipped and I immediately installed two BeQuiet! Dark Rock TR4s and started the machine.
I installed HiveOs to mine monero and immediately the machine sat at 80 degrees celsius at idle and at 105 it was shitting down under load after 3/4 minutes.
I took of the coolers and remounted them without cleaning of the MX-6 thermal paste I already applied and added another bit (yeah I know, stupid me). Booted immediately to 100 degrees at idle.
Rinse and repeat, I took the coolers off again and cleaned the thermal paste, placed everything again and we are back at 80 degrees idle.
I subsequently ordere two Arctic Freezer SP3 4U-m after realizing that the BeQiet! coolers weren’t strong enough and after installing the new coolers I immediately noticed that a lot more war air was coming out from the back of them.
The motherboard is on an open test bench style place at the moment.
The idle temperature displayed in HiveOs still stayed at 80 degrees and I don’t understand what’s happening.
I contacted the seller and he suggested me getting replacements for the CPUs, but given that they both are hanging pretty much at the exact same temperatures I think that something is off and that it isn’t very likely that both CPUs are faulty in the exact same way.
I asked him for a screwdriver and will proceed in reseating them both when it arrives (odd that I haven’t received one in the first place, but ok).

Is there anything I could try in the meantime?

Big Site Note:

I cannot see the temperatures from the BMC/IPMI dashboard as the sensors say N/A so even tho I have the temperatures displayed in HiveOs, I am not sure that they are right.
I even tried installing Windows 10 and the CPU Package temperature are the same 80 degrees even tho the CCD’s are sitting at much lower temperatures around 40 at idle.

Conclusion:

I have built various consumer PCs in the last years, but I have never worked with enterprise gear before, so if I’m missing something really obvious or you have any kind of suggestion I would really appreciate your input.


Are you looking at tCTL for temp? You won’t get an actuate gauge of actual processor temperature if you use it.

I am sorry but I don’t really understand your question.
I was seeing the temperatures from the web interface of HiveOs and from HWMonitor CPUID in Windows 10.

tCTL is the name of a specific temperature sensor output of the CPU that is known to give “wrong” results. If software reads from this sensor’s raw value it’ll look like the processor is overheating, when in reality it isn’t.

I say “wrong” but if recall correctly it’s actually the right temperature, just with an offset value applied to it, and different processors SKUs will have different offsets from each other.

Just in case this reply will be useful to someone…

I also have a Supermicro H12DSI paired with a couple of Epyc 7B13 CPUs and if I look at HWinfo or similar CPU temp reporting utilities well then my CPUs would be fried by now.

CPU package temperature seems to constantly be above 100c no matter what one does but imho that is being incorrectly reported.

If I check individual CCX temperatures they resemble temperatures I’d expect of this build - mainly around 60c when doing full on rendering.

Also the IPMI sensor logging corresponds to those CCX temperatures as well.

I have no idea why the OP is getting a N/A report but overall it seems to me like 7B13 are technically supported by the motherboard but there’s odds bits and pieces there that might not be 100% working well together. Just an impression I got, minus having to have additional cooling for the VRMs I find the build really coool.

Oh and if it is useful to anyone, with a full rendering load the CPUs tend to max out at 60c (or slightly below) with most fans in my Fractal Design Torrent case set to about 80%. The CPUs are being cooled with Noctua U14s coolers. The VRMs can get out of control hot if the office space heats up above 28c in which case they tend to hover around 95c (100c is the upper limit).

Oh, is that the 20 C offset that AMD uses to protect its CPUs?

Basically yes; I think the offset value varies alittle bit sku to sku though.

1 Like