Odd issue with VFIO / PCIe link width, errors, and apparent culprit Updated conclusion

r3Dg · December 23, 2022, 9:11pm

UPDATE: My conclusions were wrong, this was evidently not caused by the power supply itself, see follow-up message.

Original post below:

This is more of a report on some initial findings I’ve had with a bit of a perplexing issue I’ve been experiencing, and may be of some value to other people.

I have a fairly old system - a first gen Threadripper 1950x in a Zenith Extreme motherboard, and have been running it in a VFIO configuration for quite some time, with one GPU for a Linux VM and a second GPU for a Windows VM. While there were some problems initially, the system has generally been fairly robust.

Lately I’ve been planning a major system change, and with the release of the RTX 4090 in October I decided to buy one as the first major upgrade, even though it’s severely bottlenecked until I upgrade the rest of my system.

Things seemed to be fine, initially. Then I made some changes to the system, including replacing the power supply, updating the BIOS, playing with some new USB controllers etc. At the same time I’ve been doing further experimentation with PCIe bifurcation and an M.2 eGPU (ADT-Link), so there’s quite a few variables at play here.

Afterwards I started noticing some odd behavior.

I could no longer reboot my Windows 10 VM without encountering what appeared to be a reset bug. The whole GPU (RTX 4090) would essentially ‘fall off’ the bus, no longer correctly report in lspci (don’t recall the error code, I’ll take a screencap once the opportunity arises again). Forcing the VM to shut off and attempting to boot it again would yield “Unknown PCI header type ‘127’”, only a full host reboot would bring it back up again. My other GPU, a Quadro M2000 on my Linux VM, continued to function fine. This card (RTX 4090) is on the bottom slot of the board due to its ridiculous size, as I’m using ~every other slot in the system I can’t obstruct them. This limits the card to a maximum of x8 PCIe 3.0, unfortunately.
Increased count of PCIe link errors, all corrected, but only on one PCIe link. This was occurring on the link to my eGPU that I was messing around with, so I figured I just had some connection issue or I was abusing the max signaling length far too much. (It’s a ridiculous setup - I have a M.2 extender + the eGPU’s M.2 cable, the full length is quite long, so this didn’t seem to be a major shock to me.) This link also came from the primary PCIe x16 slot, which I had bifurcated with a Hyper M.2 card (lol).
Yesterday (and it’s been weeks since I started encountering these issues, but just haven’t had time to deal with them due to working very long hours) I decided to try and address some of these issues.
I took a look at lspci on the RTX 4090 again. This time I noticed it’s only negotiated to PCIe x4. I rebooted into the BIOS, and there, too, it only listed as PCIe x4. This slot can be configured to operate in x4, and I figured it may be some issue with the BIOS not respecting my preference for x8 (and thus disabling the U.2 port), so I decided to revert to an older BIOS. No dice, same problem. Clearing BIOS settings didn’t help, either.

The apparent culprit.

As I’m sure many of you know, when the RTX 4090 came out there were a lot of concerns about the 12VHPWR cable. I was concerned as well, but my bigger concern was that I couldn’t close the side panel on my case without pressing down on the cable quite a bit, so I had the system running with the side panel off. Additionally, I’ve been running a Rosewill Hercules 1600W power supply for about 8 years now. It’s quite old and there’s presumably zero chance I could get a better 12VHPWR cable for it. I found out that Seasonic was shipping free 12VHPWR cables for those who owned Seasonic units, and I found a Seasonic GX-1300 unit on Newegg for a very decent price and decided to order one.

Eventually I got the power supply and the 12VHPWR cable and I was able to replace the old Rosewill Hercules, and although it didn’t seem clear at the time (because of the many other changes I’ve made to my system), it seems like this is when my problems began.

Last night I ad-hoc connected my old 1600W unit to my system, 12VHPWR adapter and all. The system appears to be completely stable. I’m not getting any link errors on the eGPU despite the ridiculous cable length, the RTX 4090 is linking at x8 width, and the GPU reset problem has vanished.

I haven’t yet tried swapping back to the Seasonic GX-1300 to ensure this is 100% the cause, but I’m fairly certain it is.

Has anyone here experienced anything like this before?

TL;DR:

Switching from a Rosewill Hercules 1600W power supply to a Seasonic GX-1300 appears to be resonsible for a GPU failing to negotiate full link width, PCIe errors on an external GPU, and GPU reset failure when rebooting a VM.

UPDATE:

All of my conclusions were wrong, see follow up message.

Conclusion: (I’ll update this as I go)

If the graphics card is not properly seated, the card may no longer correctly reset when the VM is rebooted, even though it otherwise appears to work correctly. This is additionally different from a card that is properly seated, but explicitly set to operate at a reduced link width.

Example:
Scenario 1: Slot configured to x8 in BIOS, card improperly seated running at x4, fails to reset on VM reboot.

Scenario 2: Slot configured to x4 in BIOS, card properly seated running at x4, succeeds in reset on VM reboot.

Now I have more questions…

twin_savage · December 24, 2022, 2:12am

I’m not sure if my problem is exactly the same but you gave me the idea of power transients on the pcie bus causing issues, hadn’t really considered that before.

My setup is a Quadro P2200 passed through to a Dell M630 via vrtx pcie fabric switches and then DDA’d to a windows VM which is accessed through RDP. whenever I substantially stress the graphics I get a host crash and the graphics card won’t reinitialize until I power off the machine and then power it back on; my initial thoughts were that somehow the pcie link to the card being small (pcie v2 x8) was causing the problem.

r3Dg · December 24, 2022, 3:00am

I ended up being completely wrong, it’s not the power supply, at least not wrt. the GPU reset issue + link width issue. The PCIe errors on the eGPU stuff seem like a red herring as they’re a lot more intermittent so I couldn’t be 100% sure with the swap regardless, but even my conclusion with the power supply as a probable culprit was wrong.

I’m now running the Seasonic with the RTX 4090 linking at x8, GPU reset problem is gone.

Steps taken since OP:

[Running Rosewill power supply]

Run benchmark in Windows to ensure PCIe link really is x8. Results confirm this - getting appropriate bandwidth. Rebooted host multiple times, rebooted VM multiple times, no issues.
Swap the Seasonic power supply back in.
Enter BIOS. No longer linking at x8, back at x4. Load VM, GPU reset issue present once again. (Not seeing any link errors with the eGPU though, really thinking this was a red herring at this point and that I jumped the gun with something that can be such an intermittent issue). I had also re-connected the M.2 connector, so that’s another confounding variable there.
Swap the Rosewill back in. This time it’s only linking at x4. Boot VM, GPU reset issue present, forced to reboot the entire host.

When this happens, this is what it looks like:

lspci -vv

image (7)

dmesg

Ok, so at this point it seems clear to me that it’s not the power supply, and I’m of course thinking it must be related to the process of swapping the hardware itself, and it seems like this is exactly what it was.

The connectors were secure, so it wasn’t that, but then I realized that when I install the Rosewill power supply I’m using a lot more force to remove / install the 12VHPWR connector than I do with the Seasonic one in order to get it to ‘click’ in place.

Therefore, I think if I’m installing the Nvidia-provided cable, I’m also pushing the card into the board in the process with a fair bit of force. When I remove it, and although I think the card is fairly well secured (also an absolute pain to install in this chassis) it must be slightly pulling the card out of the slot, even though it’s not perceptible.

I then re-installed the Seasonic unit and re-connected the cabling, but this time really pushed the card in to make sure it’s truly inserted in the slot. System now links at x8, GPU reset problem vanished.

On the surface this seems really stupid, but I promise I tried to re-seat the card a bit when I started troubleshooting.

This does raise further questions, though. If the GPU is able to negotiate at x4 but not x8, why is GPU reset failing?

Are there (some? all?) nvidia GPUs incapable of proper reset if they link at <= x8 link width? This is something I’m really interested in because I’d like to be able to use eGPUs with x4 PCIe connectivity, and if reset is simply not functional with this link width on certain hardware, then that’s a big concern for me. EDIT: See Update #2.

TL;DR:

Evidently not the power supply, rather, the GPU was not fully seated when the Seasonic unit was installed. Since the nvidia adapter had to be used with the Rosewill, it requires a lot more pressure to insert (than the 12VHPWR supplied by Seasonic), pushing the card more securely into the slot. The same effect is true if you remove it - it can slightly dislodge it from the slot because it takes more force to remove.

UPDATE #2: Answering my own question here wrt. whether or not my GPU can properly reset despite being limited to x4 link width. In the case of the card not being fully seated, apparently not (but why?). In the case of explicitly setting the slot to operate at x4, the GPU appears to reset correctly.

Evidently I need to read up on PCIe spec to see why this is, because apparently a card can ‘function’ with a reduced link width but lose the ability to properly reset when not correctly seated…!

r3Dg · December 24, 2022, 3:11am

my initial thoughts were that somehow the pcie link to the card being small (pcie v2 x8) was causing the problem.

Well, one conclusion that seems somewhat robust from my experience (even though my original conclusion was wrong) was that the ~~link width could very well have something to do with, at least, the ability to reset the GPU in a VM properly~~ *. Requires further investigation.

whenever I substantially stress the graphics I get a host crash and the graphics card won’t reinitialize until I power off the machine and then power it back on

Stressing the hardware resulting in a crash could be power related, though it might also be related to uncorrected PCIe link errors, maybe?

Years and years ago (this was like 2007) I had a power supply that would cause a graphics card to shut off after a few minutes of load while the rest of the machine remained on. Using a molex → PCIe converter worked as a temp work-around. I think it was a dual rail PSU and one of them was not responding well under load, but I’m not certain. Replacing the power supply solved it.

EDIT:
* Tested by setting link to x4 in BIOS, appears to be able to reset properly (multiple times).

twin_savage · December 24, 2022, 4:43am

hahaha the 12VHPWR connector strikes again! This time in a novel way not related to bursting into flames.

I’m not getting any pcie errors logged in event viewer (windows) other than after the crash reboots the computer and the pcie device won’t initialize; strange thing is I’m not seeing anything out of the ordinary leading up to the crash in event viewer and I can’t even get a memory dump when it does happen.

I can’t get gpu-z on the VM to give me the link width of the graphics card passthrough’d to it, even when running the benchmark. It’ll run the benchmark just fine and the power ramps up so perhaps it’s not a pcie bus power issue. I do get a “Rendering might not work over a remote session.” message though, can’t even find a mention of anyone else on the internet getting that, however clicking ignore starts the test and it runs fine.

I think the pcie link width not being reported is probably where I should start digging into more; I know I have a really weird configuration with how the pcie traffic is forwarded between a bunch of different pcie switches.

EDIT: I’m running a really old nvidia driver on purpose, it seemed to be the most stable

r3Dg · December 24, 2022, 5:36am

Interesting!

While our setups are very different in that I’m using VFIO instead of DDA and my PCIe devices are all directly connected (without the use of any sort of PCIe switch fabric), I do have a Win 10 VM with a Tesla M40 I was messing around with and I get proper PCIe link reporting in it with GPU-Z, but I do also get the same ‘Rendering might not work over remote session’ dialogue that you get.

On my virtualized desktop with the RTX 4090, on the other hand, I don’t get a proper reporting, I just get PCI in the Bus Interface field of GPU-Z, which is a little odd.

I have a tool I downloaded from… somewhere a long time ago called ‘concBandwidthTest’, some simple utility that someone wrote to just transfer buffers to/from the GPU and I’ve found it useful as a simple sanity check for this kind of thing.

What does Nvidia control panel show under System Information for your link speed? I get ‘PCI Express x8 Gen4’ on my RTX 4090. Correct link width but obviously incorrect link speed.

twin_savage · December 24, 2022, 6:15am

I almost want to go linux for my host to see if there are less problems but I am slightly more competent in windows than I am in linux.

I need to read up on how the teslas work in VMs, I was under the impression you needed some kind of special vGPU license from nvidia in order for them to work but the amount of people running them would seem to indicate otherwise.

thanks for the tip! I downloaded it and I’m getting ~2800MB/s DtoH which aligns with PCIe2.0 x8 as expected.

I get this when I try launching it. I thought about getting a displayport dummy plug (or I suppose I could just plug in a monitor) to see if it’ll let me through or change behavior.

system · September 24, 2023, 12:16am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.

Odd issue with VFIO / PCIe link width, errors, and apparent culprit *Updated conclusion*

Odd issue with VFIO / PCIe link width, errors, and apparent culprit Updated conclusion