VFIO/Passthrough in 2023 - Call to Arms

I’m happy to share that I have finally managed to pass the AMD 7000 series/Raphael/RDNA2 iGPU through to window 10 guest.
Detailed instruction provided in this reddit post:
https://www.reddit.com/r/VFIO/comments/16mrk6j/amd_7000_seriesraphaelrdna2_igpu_passthrough/

5 Likes

After getting everything working and going through a lost ark fever dream for the past few days only going back to linux to work, it got me thinking… why not isolate everything? Has anyone tried that? Having a super stable system as the main host, a linux box for work and a windows box for windows stuff. Initially i was thinking of only using the VM for EAC or otherwise difficult to work with titles like GTA, lost ark or star citizen, but why bother when you can just have a barebones windows.

Currently my linux box has a billion scripts and workarounds which makes it a bit more fragile than i would like. But if it was in a VM… well… now that’s just magic. I was thinking of a very clean arch system that serves just as a host to boot everything else, but if anyone has a better suggestion i’m willing to give it a whirl.

I don’t suppose that this gives any hope to help fix the AMD Reset Bug? I’m guessing not, but…

Making GPU Resets Less Painful on Linux - André Almeida, Igalia

I just miss desktop machine lack lanes. I hoped for a new gen threadripper. I guess PCI5 x8 did double us. But !

Only if you can manage to find multiple PCIe5 cards. Bifurcating a slot means everything in that slot will run at the highest common denominator, which for a lot of NICs and other cards we might want to plug in is going to be PCIe3.

Is this for 5.0 specific thing?

I’ve got bunch of mixed stuff, my RTX2000 negotiates 4.0 link (16 GT/s on lspci) while other stuff (AX210 and Renesas USB card) have lower link speed (5 GT/s) on the same bifurcared slot.

You can do this with a hypervisor of your choice. Personally I use Proxmox which is essentially Debian and various services for virtualization and a web-GUI. Disadvantage is that you need at least another GPU and USB controllers to pass through if you want run your Linux VM alongside.

Other advantages: snapshot, duplicates, backup of whole systems, VM templates.

Not at all, this is about reporting what application was running that crashed the gpu and providing a standardised way for the OS to find out that the GPU internally did some kind of reset so it can re-initialise to get things working again. Note this is not a full true GPU reset, just a recovery of the GPU to a working state and is very different from a GPU reset which brings the GPU back to a pre-boot state.

This does not fix the fact that AMD gpus do not respond properly for the most part to the PCI bus reset or function level reset signals defined by the PCI standard.

It’s an absolute joke that this still has problems today for such a simple function that your PC does every time you turn it on, we just want to be able to trigger this reset so the GPU can POST inside the guest VM, not really a big ask.

None of the complexity of the GPU matters, if the GPU did a full reset like the PC had just been powered on again (like NVidia GPUs do) it would be fine, we don’t care about recovering the GPU, just reset it.

3 Likes

@wendell being able to control both my Linux host and VFIO guest with one mouse and keyboard would be a awesome use-case for this: 4-Port KM Switch with USB 3.2 Gen 1 Mouse Roaming Function — Level1Techs Store

But the store page doesn’t indicate any support for Linux.

This switch is pure hardware design, built-in 4 preset screen layout, no software or driver required, compatible with Windows and Mac system.

That is really weirdly written for L1. Were it any other store I would assume Linux was simply forgotten and it really was a pure hardware solution that would work just fine with it. But i think it’s very unlikely this shop in particular would do that.

Anyone else having this weird ‘issue’ with looking glass?

I had issues with keepassxc (my password manager of choice), where it did not seem to be launching in my Windows VM. Turns out, it was visible on my physical monitor but not on the LG client. I guess it’s a security feature of keepassxc, to not get captured (accidentally while streaming/sharing your screen; or by malware)?

Not sure if it is a bug or a feature…

1 Like

Applications that enable content protection can not be captured, this is by design.

3 Likes

So for avicnl, a pic, x2apic and x2avic … what’s the ideal situation on AMD? I’ve been out of the loop for a bit but it seems like x2avic plus vms doesn’t always work the way I expect.

Anything done a deep dive that can get me up to speed quick?? Or someone have a summary from the lmkl

… lol I should build an LLM I feed the lmk so I can ask questions hahaha

1 Like

The configuration of the sound card preset it as if it was put in its own PCIe port.
IIRC the adress part of the sound card should be

<address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x1"/>

Note the bus and function values.

I have my x2apic disabled under cputune, cause it is not booting into W11 if I enable it.
X670E, with 7950X3D
No secure boot, no TPM

1 Like

I got the itch to try this again and it was another “Oops all Rakes!” experience. I setup my system with Proxmox and managed to pass through my GeForce 3080 Ti with little trouble. But no matter what I did performance was horrible.

I had wanted to stream the games over the network using Sunshine+Moonlight. In Quake 1 it worked fine. In Quake 2 RTX I could get 40fps max through the stream, but the stream was smooth and locally (as in using a monitor attached to the VM) it ran fine. In the RE2 remake performance was horrible. Stuttering and hitching like crazy, both over the network and locally. I tried every trick I could find. I maxed the VM’s memory and CPU power. I used host CPU passthrough. I tried pinning the VM to the high performance cores. I even reduced the game’s settings so that it looked like a N64 game to try and figure out the issue. Even with almost no GPU utilization it didn’t work any better. Video encoding also seemed to be perform much worse than native. I played with it for hours trying to figure out the issue and it just wouldn’t solve.

I set it aside and tried to set up another VM with my Intel GPU so that I could set up a couple containers inside of it for Jellyfin and an ffmpeg webui I cooked up myself. No matter what guidance or kernel argument sorcery I followed or employed it just didn’t work. And worse every time I tried to start the VM and Didn’t Get It the whole node would basically fall over, with qemu refusing to start/stop/kill any VMs unless I restarted the system. Trying to kill them directly would just hard lock the host. Supposedly this is possible but for me it didn’t even get close to working. I managed to find reddit posts of people using my exact motherboard that said they had it working, but they were using a different hypervisor than me, so their assurance basically meant nothing.

I just can’t imagine actually relying on this. This is such a cool capability and it would solve so many things about the systems in my house and let me save a bunch of electricity if it worked, but it just doesn’t work. Not reliably. Even if I found the right collection of spells and cantrips I can’t imagine exposing the result to another person who would then rely on it. I can’t set up a gaming VM for my wife and then have to lean into the game room and say “hey I bungled a container and the server is having a fit I need to restart it sorry”, or if she asks “hey why is this game running like crap?” and even after an afternoon of troubleshooting have no answer. I wish it worked. I’m sure I’ll get the desire to check in on it in another year or so and hope I have better luck then.

That said proxmox seems pretty nice and I’ll probably use it instead of Unraid when I rebuild my NAS. I didn’t run into any trouble doing non-GPU vfio and was able to jockey around USB controllers and network interfaces without any drama. At least no drama I could detect, there could have been performance bottlenecks I didn’t run up against.

1 Like

I’ve gotten GPU pass-through working on Manjaro Uranos running kernel 6.5.5 with a Sapphire Pulse 7800xt. Even got Looking Glass IO working. Thank you to gnif and everyone else that contributed and continues to contribute to the project. Unfortunately It seems the 7800xt suffers from the reset bug and immediately after the Windows 10 guest is shut down, the host locks up and about 10 seconds later auto reboots. My previous GPU, a EVGA GTX 1070 worked flawlessly.

I have a 7800xt (Saphire Pulse) on Taichin670e from Asrock.
Vfio and looking glass working perfect. But after shutdown the machine hangs and reboots after serveral seconds.
This is very mad. A 6950XT works perfect.
Any Ideas?

try the d3hot thing?

Since this thread has become too long, I have missed many replies.
Anybody came up with a solution about the issue of Kernel 6.x, when you boot to a black screen if you have your pass-through card’s DP port connected to a cable?

I have been trying all suggested solutions, but none worked for me.

Correct driver is loaded, but still X11 does not display anything…

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD103 [GeForce RTX 4080] [10de:2704] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Gigabyte Technology Co., Ltd AD103 [GeForce RTX 4080] [1458:40bc]
        Flags: fast devsel, IRQ 57, IOMMU group 13
        Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Memory at fcc0000000 (64-bit, prefetchable) [size=256M]
        Memory at fcd0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at f000 [size=128]
        Expansion ROM at fb000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau

01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:22bb] (rev a1)
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:40bc]
        Flags: fast devsel, IRQ 58, IOMMU group 14
        Memory at fb080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel