3090 "Downgraded" to PCIe 1.0 speeds, tried multiple different risers

As the title says, my problem is that my 3090, connected through a riser, is being downgraded to PCIe 1.0 speeds (2.5GT/s)

lspci -vv -s 81:00.0 | grep Lnk
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
LnkSta: Speed 2.5GT/s (downgraded), Width x16
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
LnkCtl3: LnkEquIntrruptEn- PerformEqu-

Motherboard: Supermicro H12SSL-i
CPU: Epyc 7282
GPU: Nvidia RTX 3090 FE

Everything should be capable of PCIe 4.0.
I tried a few risers, some marked as “4.0” compatible, now I got one which was more expensive and claims to work with PCIe 5.0: https://files.catbox.moe/6zu9ak.jpg

The BIOS doesn’t give me an option to change the PCie version, which is clearly 4.0, only to enable bifurcation.
The PCIe slot does not matter. I have several 3090s and all of them have the same issue. I even tried a PCie 3.0 riser to see what would happen. Same thing. Enabling ASPM did not change anything.
Screenshots from the bios:


1 Like


I connected all the GPUs, and two of them are even worse, downgraded to x8 for some reason.
Those two use different risers so maybe that’s why

Since nvidia-smi shows they’re all idle, they would be in power-saving mode where it’s usual to disable PCIe lanes to save power. What does lspci show when you run one up to full load?

2 Likes

Picture one the bifurcation ones should be your pcie Options, what is there instead of „Auto“?

The only thing that lets me set is the bifurcation. x4x4 etc.

Put a load on the GPU and see if it changes. It won’t use its bandwidth if it doesn’t need to.

That’s the behavior I’ve observed at least

I have a 3090 with a riser in my main rig. Currently at work, but I will look at the behavior a little closer when I get home.

The GPUs work but they constantly have issues.
Randomly, they will “freeze”, spin up fans to 100% suddenly (and they won’t spin down again), all applications using the GPUs will freeze and stop responding, and tools like nvidia-smi if called will just not return. It’ll just hand without output. Nvtop will also freeze if already open, or do the same as nvidia-smi if I try to launch it.

The only solution is to reboot the entire system.
The system is more or less unusable as it is. It will work for a few hours and then do this, or even do this after a reboot as soon as I try to use the GPUs. It’s a crapshoot.

Reboot doesn’t even work in these situations without a hard reset, it blocks here


After rebooting again, one of the GPUs was no longer available and I got this:

The performance was lower than what I got from the same GPUs in the cloud. So there definitely is something wrong.
At the moment I cannot get them to work again like yesterday because it keeps crashing like I described.

Interestingly it seems two of them got the non-downgraded bandwidth just randomly?


Keep in mind at least one is unavailable through nvidia-smi and CUDA so it could be the one that isn’t functioning.
Actually, none of them are working properly. Everything I do with them is hanging.
So, yeah, it seems it’s not working at all anymore today despite mostly working yesterday.

Hmm, this could be a lot of things.

Did you sanity test this without risers?

It could also be a driver issue, did you upgrade the driver recently?

Does it work under Windows, etc, etc.

I don’t know what the failure mode for a faulty riser would be, but I have a hunch there is something else causing this issue

Heck, maybe even the mounting pressure on the CPU is causing issues.

A lot of variables here

1 Like

I just set up the entire system from scratch, it’s arch linux. It has all the latest drivers from the repositories. As for sanity testing them, I wanted to, but the case doesn’t fit any of them without risers. I can say that at least some of the GPUs were taken from my own working systems so they can’t all be broken.

I’m not having issues with other PCIe devices like the 10G NIC or the samsung SSD.
The motherboard was sold with the CPU already mounted, as Epycs often are, so I assume the mounting pressure is correct. I bought a similar system in the past and had no issues, though I also didn’t try doing this.

I might try to disassemble the entire thing to connect the GPUs directly without the case

1 Like

What risers specifically are you using, typical AliExpress slop won’t cut it for 4.0
Most branded risers won’t even do that

Also try having the riser closer to the top slots, less distance to travel from the CPU

1 Like

You were correct. It gives me downgraded even without risers. It took me a while to disassemble everything and I had to remove the brackets from the GPU to attach it to the mobo without the case, but I was able to connect a GPU directly.


First command is before loading a model into the GPU, second is after.
So it upgrades PCie 1 to PCie 4 speeds.

This doesn’t explain:

  • Why some of these cards were downgraded to x8 (that could be the riser’s fault)
  • The constant crashing and freezing.
    I wasn’t able to test this with risers even once yesterday since the system would not stop crashing when I tried.

Strange since it did work the day before without anything changing hw wise.
But the issues aren’t consistent.

I must’ve spent almost $400 on risers.
I linked this pic in the OP

But this is just one of the models I tried. Then of course if the only issue is crashing it could be just one riser or one GPU that’s causing trouble.

1 Like

LINKUP was one of the brands whose “PCIe 4.0” OCuLink cables could not actually do PCIe 4.0 without AER reports cluttering my logs. That was with a redriver even. Only their 25-centimeter cable worked, and even then the connected device would drop to PCIe 3.0 speeds once in a blue moon.

Have you tried reseating the CPU, those large chips are really picky

I bought the motherboard with the CPU already seated because I wanted to avoid that problem, and I don’t currently have the tools to do it anyways.

In any case, after many hours of testing different combinations, I have found that one GPU consistently causes issues with POSTing. Sure enough, the bracket looks somewhat bent. It must’ve been damaged in the past.
It could have physical damage that is causing the freezes and crashes.

1 Like

I’m getting this issue where the system just won’t turn on at all sometimes.
I mean no fans spin up, no lights, nothing. Then I try to turn it on a few times, pull the plug, wait a few minutes, connect the power again, sometimes I even have to remove the CMOS battery, and finally it’ll turn on. Without any hardware changes. Why?

Sounds like socket problems to me sorry pal
Maybe it’s not bent pins, hopefully it just needs reseated

Epyc and threadripper all use the same spec torque so maybe you can find one cheap

Oh boy, I might be leading you down the wrong path, but those symptoms sounds vaguely similar to OCP or some sort of protection triggering in the PSU.

Though I think you are experiencing a compound of issues. Have you tried connecting to the IPMI and see what it says about the board?

IPMI doesn’t really report anything in its logs.

The replacement to the broken GPU arrived today.
Now that I know the PCie version downgrade goes away when using the GPUs, that issue is no longer a problem.

With the new GPU, I am not experiencing any crashes anymore for now.
It’s safe to say the GPU was probably the cause.

As for the system not turning on, it hasn’t happened the past couple times I tried to turn it on.

The only remaining issue is one particular GPU being downgraded to x8 PCie, but I can’t imagine this really affects performance in an appreciable way given it’s PCie 4.0, and everything works fine.

Until something bad happens I think I can consider this issue resolved. Big thanks to @Mariuspersen for pointing out that the PCie version downgrade is temporary.
I checked the PCie speeds using a manual command, and it always immediately goes back to PCIe 1.0 a second after the GPU stops being actively used and becomes idle, even if there is a CUDA context open, so I just didn’t see that it was upgrading to 4.0 whenever the AI model received a prompt (especially given I wasn’t using the server much, and rather spent all the time trying to fix this nonexistent issue)

The crashing, which I thought was related, turned out to be a problem with a likely physically broken GPU. With 5 GPUs, it took a while to even suspect it was a problem with that particular one and testing it was very cumbersome.

I don’t know why it sometimes refused to turn on at all. It’s not doing it currently, but it does struggle to turn off. I can turn it off via IPMI but it seems like it doesn’t fully shut down if I try to turn it off via OS (halt command). That’s the only weird thing left at the moment.

2 Likes

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.