How do I know if my computer suddenly shutting down is the result of a PSU being overloaded?
My computer has two 1600 watt power supplies (non-redundant). Three GPUs are on one power supply. The motherboard, and three more GPUs are on the other power supply.
When I watch amd-smi during a training run, I see it showing these 330 watt GPUs pulling as much as 400 watts. So let’s say that each GPU can peak at 400 watts. That’s 1200 watts per PSU, if they all peak at once.
Hmm… now that I’ve written that out, that’s a pretty lopsided situation. My CPU is 155w TDP. I’m not sure how to guess how many watts are consumed by the motherboard, and everything else, besides the GPUs. Could that be more than 400 watts?
If you have it available to you I’d use a kill a watt device that will give you actual readings of how much power a device is using.
Something along this line:
Secondary question is are both power supplies plugged into the same electrical circuit to the breaker? usually 1600 watts is only good for a 15 amp breaker.
Good questions. Indeed, I have a kill-a-watt. However, because of the 15amp breaker problem… I had a 240v 30amp circuit put in. So my outlets and plugs are all weird now. C19 C20
I run Linux. I don’t see anything in the logs, but I’m not sure what I should be looking for. It’s just goes instantly off. I can’t imagine it has a chance to log the event.
I sacrificed a power cable so I could put a clamp meter on it. However when I rebooted, two of the GPUs were undetected. (sigh) So now I’m distracted by that side-quest.
Good to know, I remember the big transient spike issues that started with the RTX3000s. I just want to note that I was referring to sustained loads when talking about OCP.
How do the three 7900 XTX scale?
I have two 7900 XTX on an X670e with 2x PCIe4 x8 and see a maximum of 60% compute load if the model fits the VRAM of both cards.
I’m now planning to use my Epyc system to run the cards with PCIe4 x16 and hope to achieve 75% load.
What are your results with “./rocm_bandwidth_test -A”?
Edit:
Please post your results if you have solved your problem with the setup
Unfortunately the claim made there about excursion handling is high by an order of magnitude for supplies over 450 W and wrong for lower rated supplies. See Table 3-3 in the ATX 3 design guide for correct values.
As a diagnostic idea, consider capping the power to each GPU, maybe start at 300W each, re-try. Then, if it passes, increased the wattage cap until the symptom occurs.
So if I have a 600W GPU, it is allowed to consume 3*600W = 1800W?
And is that only that the ATX specification allows that, not that the PSU necessarily is able to do that?
So in theory I could have a 800W PSU, that can endure a 1600W peak.
But since its 600W 12V-2×6 cable allows for 600W * 3 = 1800W
this could trip my PSU?
If your PSU has the option to configure the rails, try tweaking that. You may have overloaded a single rail on a multi-rail configuration, I know my corsair unit can be configured to either single or multi rail.
In multi-rail config, it would hit OCP and power off occasionally under Linux due to the Vega driver under linux allowing large transients, when switched to single rail it didn’t.
Something similar could be happening to you, symptom will be the box will just power off if you trip overcurrent protection on one of the rails in a multi-rail configuration.
I’ve read somewhere that single rails in a multi rail config may be up to 480 watts per rail, if you know you’re drawing 400 watts per GPU its quite possible (and AMD are shit for this) they are spiking significantly beyond that for very short periods and thus tripping your PSU OCP (and the results of that is power off).
If you set it to single rail, you can draw up to the entire capacity out of a single device/connector(?) if I’m not mistaken.
The side-quest was two GPUs disappeared from listing in rocm-smi and amd-smi. I haven’t yet found the cause, but a full power cycle, with reseating the riser cables, and now they’re back.
Where I’m at right now… I have five of the six GPUs plugged in.
The PSU with the motherboard, now has just two GPUs. Stability has seemed to improve, but that’s kind-of subjective.
unrelated:
OH! as a result of watching that AMD BIOS video, I set the workload profile to whatever looked like it was the one for GPU compute. That was the ticket to enabling PCIe p2p.