GPU Passthrough with Threadripper and Vega working but

OK, first the good: I got GPU passthrough to work on my Threadripper system. Had to cannibalize my old system to get a GPU for the host OS because the GPU I initially ordered for this purpose was DOA (well, at least it was probably the cheapest part in the whole computer…). But aside from that the whole process was remarkably painless.

I’m doing this under Debian BTW, though it’s the testing branch. Debian stable is just too out of date. But this was all pretty easy to setup if you use testing.

Because I know people will be curious, the motherboard I’m running is the ASRock X399 Taichi and I haven’t seen anyone talk about its IOMMU groups before. They seem basically perfect to me. Not only did I not have any conflicts I don’t really see any potential conflicts. Both GPUs are in their own groups (and they’re in the 16x slots). I have two M.2 slots occupied, one of which is in group 0 along with some random stuff like USB and SATA controllers and ethernet. But the other is completely on its own so unless for some reason you want to pass through all your NVMe storage there shouldn’t be a problem. So it’s definitely a good board to consider for this kind of stuff.

The bad: I have the same problem everyone does with Vega where you have to actually power down the whole system to do anything with your Vega card after shutting down a VM that’s using it. That seems a little ridiculous to me. It practically defeats the purpose of running a VM instead of just dual booting, unless you just leave the VM running all the time…

Also my performance under the VM is awful, though I suspect that might be related to the NPT bug I’ve heard people talk about. I haven’t actually done anything to bypass that yet and I’ve read conflicting claims about what you should do. I feel like this point really needs some clarification.

And interesting aside: Since no driver is loaded for Vega initially the card seems to default to running its fan at full speed all the time. Even though the card is not even being used for anything. This is really noisy (it’s a reference blower card after all)…and seems like a waste of power and a good way to wear out the fan too. However as soon as you start up a VM that uses ithuts up.

Anyway, that’s the current state of things. Since this is a hot topic right now I thought people might be interested.

3 Likes

Great to hear about IOMMU groups looking reasonable. A good way to indicate whether your game performance issues has to do with the NPT bug is to run the Unigine Heaven benchmark both with npt on and off via the kernel (or modprobe) command line, and check whether you get differences in performance similar to what is reported in this bug report.

Re the GPU reset problem: I don’t know if it helps, but in the past there was a similar reset bug with some AMD videocards which could be worked around by ejecting the GPU via device manager in the windows guest. This triggered re-initialisation of the card which usually made it work. Obviously you need a second display configured, like vnc or spice, in order to perform this function. But I have no idea whether it still works or whether it is the same bug that is affecting Vega.

That’s an interesting idea but I’m not quite sure how to go about it. If I load the Windows VM with the card passed through and the drivers installed it will always try to use my second card as a display device (and then fail if it’s the second time I’m booting it). But if I don’t pass it through obviously I can’t eject it in the device manager because Windows won’t even see it in the first place. How do I force it to use Spice? Sorry if there’s something obvious I’m missing here, I’m still fairly new to this virtualization stuff.

Unrelated but I figured I would try out the Xen hypervisor because it’s not supposed to suffer from this NPT issue at all (it’s a KVM-specific bug). Unfortunately when I choose my Xen kernel in GRUB it just gives me a blank screen after the “Loading initial RAMdisk” message. Fortunately the non-Xen kernel still works. I tried looking through the kernel log to see what the issue was but I don’t see anything obviously wrong. It says it’s detecting the correct card as a display device and there aren’t any obvious errors. But something clearly isn’t working…

Has anyone else tried Xen with Threadripper yet? Any luck there?

With NPT enabled, how many cores is windows using? Because for me (arch/x370/ryzen 1700) with NPT enabled, Windows only uses 1 core, 1 thread. And no amount of pinning or manual tweaking of threads/cores in virt-manager changes that. The only way I can get it to use more cores is by changing the CPU type to “penryn”

I believe that you could just add a virtual device type “vga” if you use virt-manager or similar, and you will get output in virt-manager’s window for the machine. (actually qemu will make a vnc-server and virt-manager connects to this, you can do it manually too, but I always used virt-manager)

It’s a long time since I used an AMD card this way, but back then I was using Seabios and passed through an AMD 7790 (and before that a 6850) as a secondary card, which means that the emulated vga was initialised by Seabios and the discrete GPU was added as any other PCI device later in the boot process. Then I could configure within windows so that output came on the discrete GPU when it was working, and otherwise on the emulated vga. After each VM reboot I logged on the VM via vnc through virt-manager, ejected the gpu, and then the emulated vga got black and all output was on the GPU.

If you pass the card as primary with OVMF it will be different and I have not tried the reset procedure under this circumstance. It might be that you cannot eject it if it is the primary display, I have not tried though. But I don’t see why secondary passthrough would not work for you as well, as AMD (as opposed to nVidia) tends to support it.

Edit: I think you should at least try your current config with npt=0 and see whether it is usable for you. Problems with having npt disabled seem to be mainly with disk I/O, and the more direct access the VM has to the disk the lesser you are affected. I predict, but I cannot verify, that if you manage to give your VM an M2 drive of its own via PCI passthrough it will not be much affected by having npt switched off. For the record, the rig I used before was K10-based and I kept npt off because of the bug, I do remember that disk IO were a bit slow, but it was never a dealbreaker for me. IO problems more or less disappeared when I started using a raw ssd device for the windows guest passed through as a virtio-scsi device.

I also used xen before that (around 2013), and switched to kvm when vfio arrived. But it’s so long time since so I don’t think I can give you much advice on getting it working, unfortunately.

I did try with npt=0 and performance was massively improved. Quite usable except for occasionally audio skipping. Not passing through the audio device (because I still want to use it on the host) so I’m still looking into ways to tweak that. This post is interesting.

One thing I did not expect was Synergy to completely crap out on me. Occasionally the game I was playing would freeze up for a second and then after a bit of delay my mouse cursor would reappear on my host machine. And yes, I did have the “Use relative mouse moves” setting enabled. It would be perfectly fine most of the time but then would occasionally just completely wig out on me. I later looked at the task manager in the Windows VM and Synergy was somehow using 1.4 GB of memory…

Yes, gigabytes. For friggin Synergy. I have no idea what was going on there…

I have a dedicated audio card passed through and I still get significant stuttering/skipping so I’m not sure where the skipping is coming from or how to fix it. It’s a little frustrating. For USB, I passed through a USB controller and have an extra keyboard and mouse just for windows 10.

Is there a reason you use relative mouse movement? I have learned as a best practice to use an emulated tablet device ( -usbdevice tablet) if you use an emulated mouse, in order to get absolute movements. But I might be wrong about which use cases this applies to - I only used it for mouse integration when running a VM in a window, otherwise I have passed through an entire USB controller for keyboard and mouse just as @hondaman. So I’m no authority on this, and I don’t find a good source by a quick googling.

I also only used USB audio through the same controller, so I don’t know much about mixing guest and host audio, unfortunately. It is something which might well be affected negatively by npt=0, also from a theoretical perspective (as it is about emulating IO).

Relative mouse movement is mostly for games and stuff. If you’re doing anything point and click menu-driven stuff it shouldn’t matter, but if you’re using the mouse to control something like the camera in a 3D game then you can get some very odd behavior if it isn’t on. It also only seems to be turned on when the pointer is locked to a screen, though that makes sense. The way Synergy normally works you can freely move the cursor from screen to screen as if it were a dual monitor setup on a single computer. But then you can also hit scroll lock to make the cursor confined to a single machine, and if you have “use relative mouse moves” enabled on the server it also goes into relative mode when the lock is on. Because you don’t want to be moving the camera around only to have your cursor wander off to your other computer and then find you’re not even controlling the game anymore.

Unfortunately that scenario is what was happening to me. Though it didn’t seem to have anything to do with how I was moving the mouse (it would happen even if I was moving the mouse away from the other screen for example) and just seemed to be a more that Synergy doesn’t know what’s going on and then reverts to the default state of having the cursor on the server’s screen. For instance after it did this the cursor would always start in the exact middle of the screen, not the edge or anything.

I certainly could plug in a dedicated mouse and keyboard for Windows…in fact I did exactly that when I was still setting it up. But I would really rather avoid doing this not because I don’t have extra mice and keyboards but because I don’t want to waste the desk space. Where do you guys even put them? My keyboard of choice is a classic Model M and I like it but it’s not small…

Synergy seemed like an ideal solution in theory and a lot of people praise it but it’s been quite buggy for me…

A few more updates…

So I tried disabling Vega in the Windows device manager before shutting down the VM to see if it would help reinitialize Vega. It didn’t. I didn’t go as far as uninstalling Vega entirely because then you would have to reinstall it to get it back and I’m pretty sure that would require rebooting the VM (and therefore the host because of this problem…) anyway.

I did notice something weird though. After disabling Vega and shutting down the Windows VM doing an lscpi just gives “unknown header type 7f” where Vega should be. If I shut down the Windows VM normally it doesn’t do that.

Now on to input. I have successfully passed USB devices to Windows but I wasn’t doing a whole controller. If I want to try using something like a switch for my keyboard and mouse I’m going to need the whole controller to have hotplug capability. So looking at lspci there appear to be 3 USB controllers. One of which is grouped with stuff I don’t want to pass through like my SATA controller but the other two are…kind of on their own. I hesitate there because there is some random other shit:

IOMMU Group 4 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
IOMMU Group 4 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1454]
IOMMU Group 4 0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:145a]
IOMMU Group 4 0c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Device [1022:1456]
IOMMU Group 4 0c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] USB3 Host Controller [1022:145c]

OK, host and PCI bridges are normal and I don’t think I have to worry about them. I have no idea what non-essential instrumentation is but it doesn’t sound very essential. The encryption controller I’m really not sure about. In any event if I try passing just the USB controller to the VM won’t start at all, crapping out with some long error message I don’t quite remember. But it’s definitely on the virt-manager side of things, it doesn’t even get to the point of trying to boot windows.

Now if I try passing the USB controller and that mysterious Encryption controller it actually start booting Windows but then it hangs. And everything in that group is left in that same “unknown header type 7f” state and I have to reboot the whole machine to get them back. So that’s a failure too…and it seems unrelated to whether or not I pass through the GPU.

Ok, you don’t find the GPU under the Windows taskbar item where you can eject USB sticks etc? Disabling it in device manager might not do the required work. (but it’s weird that it changes the GPU’s state in a way visible from Linux). If you cannot eject it via the taskbar then I don’t think that the method works for primary passthrough, unfortunately.

Thanks for trying out the passing through of onboard USB controllers. So it seems that it is not possible, luckily there should be PCIe lanes enough to support a discrete USB card alongside with two GPUs. But depending on whether the x1 PCIe slots are grouped with the chipset we may have to put it in the third x16 slot on the typical TR4 motherboard, which in turn exhausts our expansion options… so it’s not optimal but seemingly tractable.

Re the mouse question, I see your point, then my tablet suggestion would probably not be relevant. I used a KVM-switch to avoid having several heads with the USB solution, but of course this implies a cost.

Wow, that’s…really a thing. I guess it never even occurred to me to try and eject a GPU before…

Still doesn’t help though.

I don’t remember the details but I realise that it might be that it only works if you eject it from a working state, before rebooting the VM the first time. Then you have to reboot from the emulated display, unless you schedule a reboot before ejecting. I don’t remember exactly whether it was also possible to do the eject maneuvre after the card has failed to initialise.

A final fallback could be to use Seabios instead of OVMF, in case you are not already doing this, and try again. If this does not work then it might be a different problem with Vega compared to previous radeon cards.

Yes, ejecting a PCI device kind of creeped me out too the first times.

I am interested in the Unknown header type 7f
I do not have Threadripper but Ryzen and still have problems with it.
And its not only on Vega but also on the integrated audio.
Funny thing for Vega is it happens after the host is shutdown/rebooted.
The audio dies even before and is not visible to guest.

I had R9 290 in the slot i am using Vega now and had not the problem with the pci header.
So it may have something to do with infinity fabric protocol and pci reset?

If you are willing I would like you to test following:
bind the audio and sata (make sure its not used) from the same iommu group to vfio-pci and pass the audio controller. Check lspci -v for error.
And if you can reliably do the same to Vega, try passing the GPU only, without the audio. The audio needs to have snd_hda_intel loaded for it, not binded to vfio-pci.
I had found that if I do this i do not get the Unknown header type 7f and i can restart/reboot the VM 1 time.
Same for audio when I split it using ACS override patch. But i can restart/reboot the VM any number of times.

But i see following in the qemu log vfio: Cannot reset device 0000:2c:00.0, depends on group 25 which is not owned.
No other problems observed.