Need help, ROCm on a 10x AMD GPU server - Sapphire INCA EP-10

Hi there! I figured that the community with the most chances to be able to help me is here!

I purchased a GPU server to use as a Render farm / Ollama-server machine.
Here is the machine: (can't post links, google this: inca-ep_10-gpu-compute-server)
It is this monster, with 10x AMD GRPro X070 (RX-5700xt) 8Gb GPUs inside.

Preface:
I understand that RDNA1 is not the best for ROCm as AMD abandoned it down the road for unknown reasons… But I had great succes with a single RX-5700xt on another server, using ROCm 5.7.x and some gfx1030 to gfx1010 bindings.

This device came with Cudo-MiningOS… It is mostly stock Ubuntu 20.04 with cudo-mining specific packages from what I can tell… It had older OpenCL drivers installed

Issues:
I tried installing fresh Linux distros on it (tried Ubuntu 24.04, Debian 12 and Pop!OS 22.04) and they all failed at some point.
The issue seems to be with the handling of the GPU and PCIe switches… The PC would boot correctly, but as soon as I try to poke at the GPUs, it hangs and sometimes crashes (example, running rocm-smi works sometimes, but with a major delay)

With the stock OS, responses from the GPUs is instantaneous. It is just that the packages are too outdated for me to use as-is.

Can someone here help me sort this out? I have spent at least 20-30 hours on this already.

Here is the PCIe Layout with lspci -tv

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-01.3-[01-08]----00.0-[02-08]--+-00.0-[03-05]----00.0-[04-05]----00.0-[05]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                               \-08.0-[06-08]----00.0-[07-08]----00.0-[08]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           +-01.5-[09]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-01.6-[0a]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-03.1-[0b-12]----00.0-[0c-12]--+-00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                               \-08.0-[10-12]----00.0-[11-12]----00.0-[12]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           +-03.2-[13-1a]----00.0-[14-1a]--+-00.0-[15-17]----00.0-[16-17]----00.0-[17]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                               \-08.0-[18-1a]----00.0-[19-1a]----00.0-[1a]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           +-03.3-[1b-22]----00.0-[1c-22]--+-00.0-[1d-1f]----00.0-[1e-1f]----00.0-[1f]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                               \-08.0-[20-22]----00.0-[21-22]----00.0-[22]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           +-03.4-[23-2a]----00.0-[24-2a]--+-00.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           |                               \-08.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-07.1-[2b]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
           +-08.1-[2c]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
           |            +-00.1  Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP
           |            \-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7

By the way, I tried reaching-out to Sapphiretech, but they are clueless! It is like this product never existed for them!

This is an image of the beast pulled from Sapphire’s website, if it can be interesting to someone…

1 Like

Link to your thread on LTT:
https://linustechtips.com/topic/1592590-just-purchased-2-super-rare-gpu-compute-servers-sapphire-inca-ep-10-what-do-you-think/

Have you tried installing Linux with all GPU’s removed?

Do live images boot?

I am assuming the PCIe switcher driver is missing from the Linux Kernel

1 Like

Haha! I originally posted on LTT, but this community is more “generalist” and I wanted to poke the L1 community because there are more Linux experts here!

Have you tried installing Linux with all GPU’s removed?

I tried each GPU on a secondary machine to ensure they were not faulty, but I did not try without GPUs. My next test was to test with a single GPU and then add them one-by-one.

Do live images boot?

Yes, without issue (as long as I connect the monitor on 9th GPU)
The machine will even reboot to the desktop, but will either crash after some time, or send me to a tty filled with errors linked to GPUs

I am assuming the PCIe switcher driver is missing from the Linux Kernel

That is also my assumption, but this is a blind spot for me, I am well versed in Linux, but the PCIe and driver’s side is not current knowledge for me (but my guess is that by the end of this process I will learn a lot on the subject)

1 Like

I made some progress today with the help of my dear friend Claude! :rofl:
This model has a lot of knowlege on ollama, rocm, GPU bindings, etc… Impressive!

So far, I was able to install all the components of ROCm and HIP. I can poke the hardware without issue with rocm[…] commands. All gpus are responding normally to commands from ROCm.

I am loading with a gazillion optimizations in the kernel for the GPUs

Now, when I run Ollama serve, no error… As soon as I try to load a model, it hangs forever and the logs show a permission error.

My next step will be to reinstall a complete OS now that I got a hang of what the kernel needs to function with this array of GPUs…
The OS I was eperimenting on was still the one that came with the server (cudo-mining) that I messed with for several days to get to this point…
My assumption right now is that there are still bits of cudo-mining in there and some settings are optimized for it. Namely, this server was never meant to run other than root user… I had to add bindings for many things to patch it to where I am…

Hope this will make a difference! I will report shortly™

2 Likes

Here is my progress so far…
I poked around in the system for days, with the help of a “custom AI model” on another machine, connected to Anthropic Claude Sonnet…
I created a knowledge-base of all of the settings and parameters of the machine and my previous attemps.
I build ROCm and all the libraries from source and from a nice repo that adds native support for RX5700 to ROCm
(lamikr—>rocm_sdk_builder on github - sorry, can’t post links)

The issue seems to simply be that the BIOS on this board is too ancient (although the machine was from 2021 or 2022, the BIOS is from 2020), and does not support resizable BAR. I am trying to find ways around this, but I am way passed my level of comprehension!

I reached-out to Sapphire Tech, but so far they were not really helpful… Let’s hope!

If someone has a clue for me, I am all ears! (or eyes should I say!)

3 Likes