Hi there! I figured that the community with the most chances to be able to help me is here!
I purchased a GPU server to use as a Render farm / Ollama-server machine.
Here is the machine: (can't post links, google this: inca-ep_10-gpu-compute-server)
It is this monster, with 10x AMD GRPro X070 (RX-5700xt) 8Gb GPUs inside.
Preface:
I understand that RDNA1 is not the best for ROCm as AMD abandoned it down the road for unknown reasons… But I had great succes with a single RX-5700xt on another server, using ROCm 5.7.x and some gfx1030 to gfx1010 bindings.
This device came with Cudo-MiningOS… It is mostly stock Ubuntu 20.04 with cudo-mining specific packages from what I can tell… It had older OpenCL drivers installed
Issues:
I tried installing fresh Linux distros on it (tried Ubuntu 24.04, Debian 12 and Pop!OS 22.04) and they all failed at some point.
The issue seems to be with the handling of the GPU and PCIe switches… The PC would boot correctly, but as soon as I try to poke at the GPUs, it hangs and sometimes crashes (example, running rocm-smi works sometimes, but with a major delay)
With the stock OS, responses from the GPUs is instantaneous. It is just that the packages are too outdated for me to use as-is.
Can someone here help me sort this out? I have spent at least 20-30 hours on this already.
Here is the PCIe Layout with lspci -tv
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
+-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
+-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-01.3-[01-08]----00.0-[02-08]--+-00.0-[03-05]----00.0-[04-05]----00.0-[05]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
| \-08.0-[06-08]----00.0-[07-08]----00.0-[08]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
+-01.5-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
+-01.6-[0a]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
+-02.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.1-[0b-12]----00.0-[0c-12]--+-00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
| \-08.0-[10-12]----00.0-[11-12]----00.0-[12]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
+-03.2-[13-1a]----00.0-[14-1a]--+-00.0-[15-17]----00.0-[16-17]----00.0-[17]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
| \-08.0-[18-1a]----00.0-[19-1a]----00.0-[1a]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
+-03.3-[1b-22]----00.0-[1c-22]--+-00.0-[1d-1f]----00.0-[1e-1f]----00.0-[1f]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
| \-08.0-[20-22]----00.0-[21-22]----00.0-[22]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
+-03.4-[23-2a]----00.0-[24-2a]--+-00.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
| \-08.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
+-04.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-07.1-[2b]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
| +-00.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
| \-00.3 Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
+-08.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-08.1-[2c]--+-00.0 Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
| +-00.1 Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP
| \-00.2 Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
+-14.0 Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
+-14.3 Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
+-18.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
+-18.1 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
+-18.2 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
+-18.3 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
+-18.4 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
+-18.5 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
+-18.6 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
\-18.7 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
By the way, I tried reaching-out to Sapphiretech, but they are clueless! It is like this product never existed for them!