Building my first AI server with 6 x RTX3090

Hi Guys!

I am going to build my first AI Server based on 6 x RTX3090 GPUs. Previously I have built one but only for mining where those GPUs were connected via PCIE x1 risers. Now I want to use the full power of those GPUs using PCIE 4.0 x16 so the thing looks slightly different. I have a few question about this:

  1. What motherboard should I use? Probably I have a choice between ASRock ROMED8-2T and Pro WS WRX80E-SAGE SE. I think that the latter is better, it has a better power section and the brand is better too :slight_smile:
  2. Does those motherboards has enough power section for supplying PCIE slots? According to the spec one PCIE draws up to 75 W so for 6 is 450 W. I am afraid that ROMED8-2T is not enough.
  3. I want to power my server using one ATX PSU that will power the motherboard, CPU and other small stuff, and one Dell 2000 W server PSU with PCIE breakout board that will power GPUs only. Those PSU will be daisy chained. Is it ok for this setup when GPUs will be powered via PCIE slot from ATX psu and via PCIE power connectors from the server PSU?
  4. I want to use those risers: Thermaltake Riser Tt Premium Pci-E 4.0 600mm they should be ok fot this setup I think.

I will be grateful for your help.

Debatable but both would work well.

Are you going to waterblock them or use raisers to get them into the slots?

IDK if that would be an issue or not, but I dont think they would build a server board that couldnt support the 75w per slot.

Not an issue to daisy chain psu if you join them correctly

I mean will probably work but QC could be an issue and pushing 4.0 makes it harder.

I have sent an email to ASRock regarding ROMED8 and it’s power section. They released very good in my opinion a new motherboard ASRock WRX90 WS EVO but it is still unavailable, it is in newegg but out of stock. The price is similar to WRX80 from ASUS.

QC could be an issue

What is QC? Could you give me more details?

Hi

Interesting build, that’s a lot of compute. And energy cost!

As you no doubt know a multi GPU setup can be a complicated build, and may not always work as required or at all. Which is why many people go to a system integrator, or big brand. But some of us like a challenge right? :slight_smile:

It can be worth looking at the spec of some of the system integrator builds, to see what they think works reliably. The likes of Lambda Labs, Bizon Tech, Scan are good. PC Part picker can also highlight potential issues.

I had a read of the manual at https://dlcdnets.asus.com/pub/ASUS/mb/SocketTRX4/Pro_WS_WRX80E-SAGE_SE_WIFI/E19401_Pro_WS_WRX80E-SAGE_SE_WIFI_UM_V2_WEB.pdf?model=Pro%20WS%20WRX80E-SAGE%20SE%20WIFI

  • They recommend a minimum of a 1500W for system stability. But 6 x 3090 all running would be 1800W alone!
  • I believe the recommendation when running dual PSU is that they should be identical models and connected in parallel as described in the manual. I’m no electrical engineer, but I suspect serial connection (daisy chain) could cause issues eg. Earth loops, cable/connector power overload … unless as mutation666 comment the connection is done correctly. Beyond my skill to say.
  • The manual only covers support for 4 GPUs. Even if 6 will work, as mutation666 said, you will need to water block them to get them in the single slot spacings, or PCI riser them and mount them else where, or just hang them off the case as Tim Detmeers build! Maybe you are building in a mining rack so physical mount is less of a problem with risers?
  • As the manual only shows support for 4 GPUs, it unclear how PCI bifurcation would work or the PCI connection speed. 4 GPUS is x16 at PCIe 3 or 4. Maybe have a read of the BIOS manual too.

My gut feel is that 4 GPUs is prob the max you will get in a standard workstation/server setup. Anything beyond that always seems to have a lot of custom cabling and hardware, or switching to SXM GPUs. Of course the mining community has other ideas, but then their hardware demands and software are potentially v different to deep learning/ AI.

4 x 3090 is a lot of compute and vram that will get through some big training jobs, is more than enough for most transfer learning, and has enough memory to load most large language models with sufficient hyper parameters?

You could always run 4 GPUs in a single workstation/server, and if you need more compute or model memory many of the deep learning tool chains support multi node (computer) setups, with comms and data sent between nodes via MPI/NVlink etc which would add additional high speed networking connectivity complications and costs …

I’m intrigued, keep us posted on progress. I love this kind of stuff :slight_smile:

1 Like

Quality Control

Ok guys, after a few weeks I finally built my AI server. Now I am testing it and first problems have come.

For now I can’t use PCIE 4.0, it is too unstable for now with Thermaltake Riser Tt Premium Pci-E 4.0 600mm. I will try to turn off ASPM and change some other setting but I am afraid that I won’t get rid off AER errors on this PCIE 4.0.

The second problem or maybe it is not a problem at all… I am using Cooler Master MWE 750 GOLD ATX PSU for motherboard and 3 x DELL DPS-1200MB-1 for GPUs. I use 5 V line from ATX PSU to turn on my DELL PSUs via SSR so I have negligible delay between 5V line’s up and power on signal for DELL PSUs. Generally everything works, I have video signal on GPU when I connect monitor, bandwidth test from CUDA samples is OK, I have tested mining with 300W load per GPU, they works, but the LED on GPU does not work at all… I have RTX3090 ZOTAC and they have this fancy LED back light on ZOTAC logo that normally blinks when GPU is started. It blinks when I connect to ATX PSU or when I start manually DELL PSU before ATX PSU. I thought that this LED is like a status LED, it must be lit when GPU has started correctly but probably doesn’t. Should I be worried about this?

People say good things about there riser cables https://c-payne.com/

O thank you, they looks like small company building those risers.

plus

plus 2 x

Will do the job for sure but they are quite expensive. You know, I am the owner of small custom electronics company so we are able design stuff like this by ourselves :slight_smile: but for one machine its better to buy them…

I really need to save ever website I visit, this blog took far too long too find.
they may not have solved the pci4 issue :crying_cat_face:
https://nonint.com/2022/05/30/my-deep-learning-rig/

2 Likes

Crazy person built a system with 10x 3090’s,
Cpayne PCIe adapters used.
https://www.reddit.com/r/LocalLLaMA/comments/1c9l181/10x3090_rig_romed82tepyc_7502p_finally_complete/