New account, first post - I need some help, please. I’ve got 5 4070 ti GPUs that I want to put to use mostly for science applications. I learned the hard way that you need a high end cpu and mobo for this as my first build attempt with all of the GPUs was with a threadripper 1920x and an Asus Prime X399-A. Not enough PCIe lanes I discovered among other things.
Now I’ve got the following:
CPiU: used Threadripper Pro 5995WX
Mobo: Asrock WRX80 Creator (5 PCIex16 slots + 2x8)
RAM: 256GB OWC 3200MHz DDR4 RDIMM (4 x 64 GB)
PSU: 2 x 1200 Watt Thermaltake
AIO: Silverstone Icegem 360
SSD: Kingston NV2 1TB M.2
Everything is mounted on a mining rig frame.
The PSUs are connected with a Silverstone dual 24 pin adapter.
1 PSU is powering the 2 12V CPU connectors, the ATX, and 2 GPUs. The other powers 3 GPUs and the additional 6pin connector on the mobo.
I initially installed ubuntu 22. The SSD hadn’t been wiped from the previous build attempt which I think led to my initial difficulties. I completely wiped it and installed Ubuntu 24 and was able to boot with 1 GPU.
I installed the latest Nvidia drivers (560) and cuda toolkit which are apparently very unstable with my components. I downgraded to 535 and everything seemed smooth but nvidia-smi showed “Err!” under the fan column for any visible GPUs. Ultimately I settled on the 550 driver which got rid of the Err message.
Multiple GPUs led to problems. After messing around with drivers, deleting some xconfig file, and trying each GPU one at a time and each of the PCIe16 slots individually, I starting adding one GPU at a time and somehow made it to booting up the system with all 5 GPUs being recognized under nvidia-smi (This is after almost 2 days of struggle). I’ll add that other times when I’ve booted up with 2 or more GPUs, one or more does not appear in the nvidia-smi output but those invisible ones do appear under lspci output.
After trying to optimize the distribution of the power cables, nothing worked again and I returned to the cable configuration that worked that one time, and nothing. Now I can’t get anything to work with more than 1 GPU. I also can not boot after entering BIOS. I have to clear CMOS any time I change any BIOS setting in order to POST again.
I can’t find any obvious pattern why this isn’t working. My thinking now is that it’s the PCIe 3.0 riser cables. 3 of them are 20cm and 2 are 30cm. I’ve tried setting the PCIe slots to “Gen 3” in BIOS but then I can’t POST and get a “71” code on the mobo. The rest of the times I can’t POST it reads “42” or “94” (if I remember correctly). At this point I’m thinking I need to try gen 4 riser cables.
If anybody has any solutions, you will lead a fulfilling life. I did see this interesting thread (Help with WRX80E-Sage SE Render server - #12 by Nefastor) about PCIe redriver settings, but would rather not have to experiment with that.