Proxmox system under load randomly reboots

I have a system with the following configuration running Proxmox 6.2:

Threadripper 1950x
ASRock X399 Taichi
128 GB of HyperX Predator RAM
3 NVMe drives (256-512gb)
5 Samsung SSD's (512, and 4 1TB)
1 10Tb mechanical
2 2080 Ti FE GPUs (compute only)
1 GT 710B (main display)
1 10GB Intel 82599ES network controller

The system is water cooled and keeps good temps. The 2080’s are both passed through to VM’s managed by Proxmox and run compute tasks 24/7. There are a number of other VMs running doing various things but the CPU load and RAM consumption is actually pretty low most of the time while the GPU load is high almost all the time.

This thing can pull a lot of power and consistently hovers between 600-700W.

I’ve been having a problem the last few months where the computer will just all of a sudden power cycle. It just sort of “blinks” off - running fine one second and then in a reboot cycle the next. The whole systems comes right back online, spins up the VMs, and starts crunching again like nothing happened. There are no messages in the logs indicating anything software related causing the reboot - it just power cycles. This happens about 1-3 times in a 24-36 hour period.

I have tried disabling compute tasks one 1 GPU at a time which of course results in decreased power draw and the system runs stable for days. I tried disabling each of the GPUs to see if there was one that was a problem but both would run fine on their own.

I’ve tried a variety of things to narrow down the problem and now I’m pretty sure the issue is the PSU.

The PSU that was initially in this system was a Corsair HX1200 Platinum. I thought 1200W would be plenty and given the average draw is pretty near the 50% range I would be getting near peak efficiency from the PSU. I’ve tried replacing all the cables and running it in a variety of configurations but the problem persisted.

A few days ago, I replaced the HX1200 with a Corsair AXi1600 Titanium and the system has been running like a champ with no issues ever since. I’ve been up for over 48 hours now with everything running and things are happily chugging along.

The HX1200 is still under warranty so I’m planning to reach out to Corsair but I’m curious if anybody with more PSU experience might have some thoughts as to what might be going on. The HX1200 passes tests with Dr Power II and the fan seems to be functioning but I’m guessing the reboots were being triggered by OVP, OCP, OTP, or SCP protections provided by the PSU. With the exception of OTP, I would expect the same issues with the new PSU if there is an actual problem from any of the components in the system so my current guess is the HX1200 is hitting a thermal set point and power cycling (OTP) but I’m not even sure if that is how it would manifest. Would OTP trigger a power cycle? Any idea which would be the culprit or how I might test it?

Would love any ideas people might have and if perhaps there is something wrong with the way I’m using this HX1200 PSU to begin with.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.