Everything should be capable of PCIe 4.0.
I tried a few risers, some marked as “4.0” compatible, now I got one which was more expensive and claims to work with PCIe 5.0: https://files.catbox.moe/6zu9ak.jpg
The BIOS doesn’t give me an option to change the PCie version, which is clearly 4.0, only to enable bifurcation.
The PCIe slot does not matter. I have several 3090s and all of them have the same issue. I even tried a PCie 3.0 riser to see what would happen. Same thing. Enabling ASPM did not change anything.
Screenshots from the bios:
Since nvidia-smi shows they’re all idle, they would be in power-saving mode where it’s usual to disable PCIe lanes to save power. What does lspci show when you run one up to full load?
The GPUs work but they constantly have issues.
Randomly, they will “freeze”, spin up fans to 100% suddenly (and they won’t spin down again), all applications using the GPUs will freeze and stop responding, and tools like nvidia-smi if called will just not return. It’ll just hand without output. Nvtop will also freeze if already open, or do the same as nvidia-smi if I try to launch it.
The only solution is to reboot the entire system.
The system is more or less unusable as it is. It will work for a few hours and then do this, or even do this after a reboot as soon as I try to use the GPUs. It’s a crapshoot.
Reboot doesn’t even work in these situations without a hard reset, it blocks here
The performance was lower than what I got from the same GPUs in the cloud. So there definitely is something wrong.
At the moment I cannot get them to work again like yesterday because it keeps crashing like I described.
Interestingly it seems two of them got the non-downgraded bandwidth just randomly?
Keep in mind at least one is unavailable through nvidia-smi and CUDA so it could be the one that isn’t functioning.
Actually, none of them are working properly. Everything I do with them is hanging.
So, yeah, it seems it’s not working at all anymore today despite mostly working yesterday.
I just set up the entire system from scratch, it’s arch linux. It has all the latest drivers from the repositories. As for sanity testing them, I wanted to, but the case doesn’t fit any of them without risers. I can say that at least some of the GPUs were taken from my own working systems so they can’t all be broken.
I’m not having issues with other PCIe devices like the 10G NIC or the samsung SSD.
The motherboard was sold with the CPU already mounted, as Epycs often are, so I assume the mounting pressure is correct. I bought a similar system in the past and had no issues, though I also didn’t try doing this.
I might try to disassemble the entire thing to connect the GPUs directly without the case
You were correct. It gives me downgraded even without risers. It took me a while to disassemble everything and I had to remove the brackets from the GPU to attach it to the mobo without the case, but I was able to connect a GPU directly.
First command is before loading a model into the GPU, second is after.
So it upgrades PCie 1 to PCie 4 speeds.
This doesn’t explain:
Why some of these cards were downgraded to x8 (that could be the riser’s fault)
The constant crashing and freezing.
I wasn’t able to test this with risers even once yesterday since the system would not stop crashing when I tried.
Strange since it did work the day before without anything changing hw wise.
But the issues aren’t consistent.
I must’ve spent almost $400 on risers.
I linked this pic in the OP
But this is just one of the models I tried. Then of course if the only issue is crashing it could be just one riser or one GPU that’s causing trouble.
LINKUP was one of the brands whose “PCIe 4.0” OCuLink cables could not actually do PCIe 4.0 without AER reports cluttering my logs. That was with a redriver even. Only their 25-centimeter cable worked, and even then the connected device would drop to PCIe 3.0 speeds once in a blue moon.
I bought the motherboard with the CPU already seated because I wanted to avoid that problem, and I don’t currently have the tools to do it anyways.
In any case, after many hours of testing different combinations, I have found that one GPU consistently causes issues with POSTing. Sure enough, the bracket looks somewhat bent. It must’ve been damaged in the past.
It could have physical damage that is causing the freezes and crashes.
I’m getting this issue where the system just won’t turn on at all sometimes.
I mean no fans spin up, no lights, nothing. Then I try to turn it on a few times, pull the plug, wait a few minutes, connect the power again, sometimes I even have to remove the CMOS battery, and finally it’ll turn on. Without any hardware changes. Why?
IPMI doesn’t really report anything in its logs.
The replacement to the broken GPU arrived today.
Now that I know the PCie version downgrade goes away when using the GPUs, that issue is no longer a problem.
With the new GPU, I am not experiencing any crashes anymore for now.
It’s safe to say the GPU was probably the cause.
As for the system not turning on, it hasn’t happened the past couple times I tried to turn it on.
The only remaining issue is one particular GPU being downgraded to x8 PCie, but I can’t imagine this really affects performance in an appreciable way given it’s PCie 4.0, and everything works fine.
Until something bad happens I think I can consider this issue resolved. Big thanks to @Mariuspersen for pointing out that the PCie version downgrade is temporary.
I checked the PCie speeds using a manual command, and it always immediately goes back to PCIe 1.0 a second after the GPU stops being actively used and becomes idle, even if there is a CUDA context open, so I just didn’t see that it was upgrading to 4.0 whenever the AI model received a prompt (especially given I wasn’t using the server much, and rather spent all the time trying to fix this nonexistent issue)
The crashing, which I thought was related, turned out to be a problem with a likely physically broken GPU. With 5 GPUs, it took a while to even suspect it was a problem with that particular one and testing it was very cumbersome.
I don’t know why it sometimes refused to turn on at all. It’s not doing it currently, but it does struggle to turn off. I can turn it off via IPMI but it seems like it doesn’t fully shut down if I try to turn it off via OS (halt command). That’s the only weird thing left at the moment.