We are in need of some help. Some 3 months ago we started building our servers.
Our setup
romed8-nl motherboard 7pcie slots
512 gb ecc ram
AMD EPYC 7502 32-Core Processor
7x rtx 3090
M.2 1tb
We are currently running 2 versions of ubuntu
ubuntu 18.04
ubuntu 20.04
All 7 gpu’s are connected through a pcie extender 3.0. Previously we also used a 4.0 in the shared M.2 pcie port. Later thinking this was the cause of the error.
About 3 months later and many more experiments later:
Using 6gpus and leaving the shared m.2 port as is
Changing the bios x16 to 8x8 on that port
Updating grub with pcie-aspm=off pcie=nommconf
But after all this we are still camping with the same issue.
It’s at random and we can’t seem to figure out what is causing this.
any expert in the feeld that is willing to take a look and help us out?
I am unsure if I can directly help you with that Motherboard.
When I had that issue it was directly related to the hardware. The link training wasn’t occurring correctly/dropping out during use and it would drop the GPU from the bus.
Maybe you could expand on this a little, have drilled down on that specific GPU & Motherboard Port?
That’s excatly what’s happening, “GPU had fallen from the bus”.
There is no specific port that gets the error. It is completely at random.
So far we have one test server running the same specs but the only difference is that there are no pcie 3.0 extension cables involved but rather 3 3090’s straight into the ports on the boards.
This one is stable.
My advice is that you should switch the Server with riser issues to Pci-e Gen 3 in the Bios. This was able to stabilize the greater transmit distance when risers are involved. Pci-e 4.0 is very sensitive to this signal loss.