Is 3200 watts enough?

How do I know if my computer suddenly shutting down is the result of a PSU being overloaded?

My computer has two 1600 watt power supplies (non-redundant). Three GPUs are on one power supply. The motherboard, and three more GPUs are on the other power supply.

When I watch amd-smi during a training run, I see it showing these 330 watt GPUs pulling as much as 400 watts. So let’s say that each GPU can peak at 400 watts. That’s 1200 watts per PSU, if they all peak at once.

Here’s my setup:

PSU1:
ASRock ROMED8-2T
EPYC 7313P (16/32) 155w TDP
256 GB DDR4 - 3200
Crucial NVMe (not a fast one)
AMD 7900 XTX x3
Fans x10

PSU2:
AMD 7900 XTX x3

Hmm… now that I’ve written that out, that’s a pretty lopsided situation. My CPU is 155w TDP. I’m not sure how to guess how many watts are consumed by the motherboard, and everything else, besides the GPUs. Could that be more than 400 watts?

1 Like

If you have it available to you I’d use a kill a watt device that will give you actual readings of how much power a device is using.

Something along this line:

Secondary question is are both power supplies plugged into the same electrical circuit to the breaker? usually 1600 watts is only good for a 15 amp breaker.

2 Likes

Good questions. Indeed, I have a kill-a-watt. However, because of the 15amp breaker problem… I had a 240v 30amp circuit put in. So my outlets and plugs are all weird now. C19 C20
image

1600W should be enough for that load.
Maybe a PSU defect or some other problem?

What OS do you run?
Do you see anything in the logs (besides the unexpected shutdown)?

2 Likes

I run Linux. I don’t see anything in the logs, but I’m not sure what I should be looking for. It’s just goes instantly off. I can’t imagine it has a chance to log the event.

I sacrificed a power cable so I could put a clamp meter on it. However when I rebooted, two of the GPUs were undetected. (sigh) So now I’m distracted by that side-quest.
image

I’ll take another look later in the week.

Most PSUs don’t trip until the current hits 120-140% of its max rating. So a 1600W shouldn’t trip OCP until close to 2000W or even well over that.

All ATX 3.0 PSU don’t trip at double their rating for a few milliseconds. So his PSU can sustain 3200W for a very short time.

That is why buying a big PSU to handle GPU spikes is no longer necessary.

Finish that side quest first and you probably will also solve your main quest by doing so :grin:

Good to know, I remember the big transient spike issues that started with the RTX3000s. I just want to note that I was referring to sustained loads when talking about OCP.

1 Like

How do the three 7900 XTX scale?
I have two 7900 XTX on an X670e with 2x PCIe4 x8 and see a maximum of 60% compute load if the model fits the VRAM of both cards.
I’m now planning to use my Epyc system to run the cards with PCIe4 x16 and hope to achieve 75% load.

What are your results with “./rocm_bandwidth_test -A”?

Edit:
Please post your results if you have solved your problem with the setup

Unfortunately the claim made there about excursion handling is high by an order of magnitude for supplies over 450 W and wrong for lower rated supplies. See Table 3-3 in the ATX 3 design guide for correct values.

2 Likes

7e2db57b-7e03-4238-ad22-7ead7c6c7214_text-4231813797

2 Likes

As a diagnostic idea, consider capping the power to each GPU, maybe start at 300W each, re-try. Then, if it passes, increased the wattage cap until the symptom occurs.

example, sudo nvidia-smi -pl 300

2 Likes

I stand corrected, cheers for the source.
BTW this is the table:

1 Like

although I am not sure how this goes together with this table from the same doc:

So if I have a 600W GPU, it is allowed to consume 3*600W = 1800W?

And is that only that the ATX specification allows that, not that the PSU necessarily is able to do that?

So in theory I could have a 800W PSU, that can endure a 1600W peak.
But since its 600W 12V-2×6 cable allows for 600W * 3 = 1800W
this could trip my PSU?

If your PSU has the option to configure the rails, try tweaking that. You may have overloaded a single rail on a multi-rail configuration, I know my corsair unit can be configured to either single or multi rail.

In multi-rail config, it would hit OCP and power off occasionally under Linux due to the Vega driver under linux allowing large transients, when switched to single rail it didn’t.

Something similar could be happening to you, symptom will be the box will just power off if you trip overcurrent protection on one of the rails in a multi-rail configuration.

I’ve read somewhere that single rails in a multi rail config may be up to 480 watts per rail, if you know you’re drawing 400 watts per GPU its quite possible (and AMD are shit for this) they are spiking significantly beyond that for very short periods and thus tripping your PSU OCP (and the results of that is power off).

If you set it to single rail, you can draw up to the entire capacity out of a single device/connector(?) if I’m not mistaken.

1 Like

The table concerns Add-In cards what sort of spikes they are allowed to pull before the mainboard or PSU is to assume fault.

For the >75W cards, a 0.2 ms spike would allow for a 35% overload:

4 - 0.2171 x ln(200,000) = 1.35 

Yes, but for less than 0.1 milli seconds.

And Note 4 is also to be observed at all times:

1 Like

The side-quest was two GPUs disappeared from listing in rocm-smi and amd-smi. I haven’t yet found the cause, but a full power cycle, with reseating the riser cables, and now they’re back.

Where I’m at right now… I have five of the six GPUs plugged in.
The PSU with the motherboard, now has just two GPUs. Stability has seemed to improve, but that’s kind-of subjective.

unrelated:
OH! as a result of watching that AMD BIOS video, I set the workload profile to whatever looked like it was the one for GPU compute. That was the ticket to enabling PCIe p2p.

1 Like
RocmBandwidthTest Version: 2.6.0
Launch Command is: /opt/rocm/bin/rocm-bandwidth-test -A
Device: 0,  AMD EPYC 7313P 16-Core Processor
Device: 1,  Radeon RX 7900 XTX,  GPU-0baca002,  c3:0.0
Device: 2,  Radeon RX 7900 XTX,  GPU-497b46b0,  c6:0.0
Device: 3,  Radeon RX 7900 XTX,  GPU-6cc9e211,  83:0.0
Device: 4,  Radeon RX 7900 XTX,  GPU-488264eb,  86:0.0
Device: 5,  Radeon RX 7900 XTX,  GPU-c39535a1,  48:0.0

Inter-Device Access
D/D       0         1         2         3         4         5
0         1         1         1         1         1         1
1         1         1         1         1         1         1
2         1         1         1         1         1         1
3         1         1         1         1         1         1
4         1         1         1         1         1         1
5         1         1         1         1         1         1

Inter-Device Numa Distance
D/D       0         1         2         3         4         5
0         0         20        20        20        20        20
1         20        0         40        40        40        40
2         20        40        0         40        40        40
3         20        40        40        0         40        40
4         20        40        40        40        0         40
5         20        40        40        40        40        0

Bidirectional copy peak bandwidth GB/s
D/D       0           1           2           3           4           5
0         N/A         29.536      30.298      31.532      31.394      30.728
1         29.536      N/A         47.985      47.994      47.992      47.989
2         30.298      47.985      N/A         47.994      47.989      47.967
3         31.532      47.994      47.994      N/A         47.990      47.991
4         31.394      47.992      47.989      47.990      N/A         47.991
5         30.728      47.989      47.967      47.991      47.991      N/A
1 Like
$ ./all_reduce_perf -b 128M -e 2G -f 2 -g 5
# nThread 1 nGpus 5 minBytes 134217728 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:5b27b96
# Using devices
#  Rank  0 Group  0 Pid 3706379 on       wide device  0 [0000:c3:00] Radeon RX 7900 XTX
#  Rank  1 Group  0 Pid 3706379 on       wide device  1 [0000:c6:00] Radeon RX 7900 XTX
#  Rank  2 Group  0 Pid 3706379 on       wide device  2 [0000:83:00] Radeon RX 7900 XTX
#  Rank  3 Group  0 Pid 3706379 on       wide device  3 [0000:86:00] Radeon RX 7900 XTX
#  Rank  4 Group  0 Pid 3706379 on       wide device  4 [0000:48:00] Radeon RX 7900 XTX
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   134217728      33554432     float     sum      -1    10145   13.23   21.17      0   8165.4   16.44   26.30      0
   268435456      67108864     float     sum      -1    16277   16.49   26.39      0    17108   15.69   25.10      0
   536870912     134217728     float     sum      -1    35023   15.33   24.53      0    34194   15.70   25.12      0
  1073741824     268435456     float     sum      -1    70317   15.27   24.43      0    70291   15.28   24.44      0
  2147483648     536870912     float     sum      -1   147666   14.54   23.27      0   147669   14.54   23.27      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.4018