VFIO in 2019 -- Pop!_OS How-To (General Guide though) [DRAFT]

Ah, I hadn’t seen that one… is what I’d like to say, but I just noticed if I tried to start the VM after getting my D3 error I get the 127 error. However, I suspect this might just be due to bad house keeping from the previous error. Not sure the patch would help in my case.

I ve been trying to pass through a pci-e sata/usb controller upon which is the ssd for the following VM installation. Following your guide I have some differences as you can see below

I ve done identical steps but the command (differs a little on me due to Debian 10)

lsinitramfs /boot/initrd.img-4.19.0-5-amd64 | grep vfio

etc/modprobe.d/vfio_pci.conf
scripts/init-top/bind_vfio.sh
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio/pci
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio/pci/vfio-pci.ko
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio/vfio.ko
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio/vfio_iommu_type1.ko
usr/lib/modules/4.19.0-5-amd64/kernel/drivers/vfio/vfio_virqfd.ko

As you can see I dont have the mdev at all but I have others you dont. Where/WHat is the problem here?

ps1 Maybe the /etc/initramfs-tools/modules content is useful
softdep amdgpu pre: vfio vfio_pci

vfio
vfio_iommu_type1
vfio_virqfd options vfio_pci ids=1002:67df,1002:aaf0
vfio_pci ids=1002:67df,1002:aaf0
vfio_pci
amdgpu

The lines here are from a guide I followed trying to passthrough an amd gpu. This time I try to pass through the controller!!

ps2: I don know why but lspci -nnv doesnt outpu so much info like yours in the guide. Specially the sata controller doesnt have info about module kernel and module driver. Its like that :
05:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01) (prog-if 30 [XHCI])
Subsystem: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023]

What do you mean like auto switches without having to pass it through?

well, you do pass it through, but you can just do it in Qemu / KVM UI, same way you would be adding your dedicated GPU (after you would assign vfio-pcie driver to it in its case) and when VM is off, all the devices connected to that pci-e usb card are available in host OS

As example, if i would have keyboard and mouse connected to that USB card, i could use them in host OS as well, when virtual machine is off and as soon as i start virtual machine it is assigned to, it will get detached from host OS and passed to the VM

Did you use pci id guide like in this guide or you followed other steps?

I have big issues even starting the installation of the VM all PC freezes and hard shutdown is possible with the power button

USB card you literally add in GUI, no need to mess in configs, for GPU i followed originally wendells guide, which i updated with bit more stuff https://forum.level1techs.com/t/elementary-os-5-0-ubuntu-18-xx-vfio-pcie-passthrough-guide-tutorial/131420

So I’ve ran into issues that are so bad it renders my desktop unusable.

So far the behavior only seems to propagate when starting/running/turning off the VM. I have to verify that by letting it go a little longer. It could be unrelated but for the time being the VM is my current hunch.

Without warning my 10gig NIC will just offline itself. It still reports it’s there but states no cables are connected. Nothing I do short of a system restart brings them back online. Almost each restart has the kernel showing me a heap of errors coming from the downed interfaces.

With the VM running in the background if I’m just watching a video the whole system will randomly lock-up/freeze.

Just 5 mins ago starting the VM was followed by a spontaneous system crash/black screen & restart. I didn’t touch anything.

So far all of this behavior has only been observed while doing something with the VM. I haven’t seen any of it during normal operation so I’m uncertain exactly what the culprit is.

Clone your VM, remove any passed PCI/USB devices from clone, and run to see if you have same symptoms? This would at least determine if it’s a problem with IOMMU?

Honestly though if you’re getting errors at boot without even starting a VM (assuming you don’t have it launch at boot?) odds are it’s something else.

I don’t have the VM start at boot. I’ve only observed these issues after manually booting the VM & shutting down or restarting the system post-using the VM.

I do have one thing that’s unrelated to the VM but the VM may be influencing by being started. My 5V rail is reporting being around 4.5~4.6V. That’s -10%. Should be within tolerances but that’s under no load in the BIOS. Start the VM and it may drop more enough to cause unexplainable errors. Interesting but is a cheap multi-meter still reads 5V-ish so I don’t know what to trust. I’ll have to plug in my old PSU and see if anything changes.

As for errors at system boot. Nope, none. Everything’s hunky dory till I start the VM.

Something I would like to do if possible is instead of using a network bridge actually pass-though the port to the VM. virt-manager shows the dual-port NIC as independent interfaces but just trying to pass-though 1 runs me into the IOMMU issues since both ports are in the same group. Being a server NIC I really believe it’s possible to split them in such a way to let me pass-though just the one port on the card.

I took that first quote to mean you were having problems with your nic even at boot up.

It’s possible it might do so with SR-IOV… but that’s something I’ve never played with. SR-IOV did start out as a means of sharing ports of multi-port nics with VMs though.

I’ll have to look into it if I can somehow stabilize the VM. Right now the behavior doesn’t tell me the NIC is directly at fault rather it’s malfunction is the result of something else giving out.

So I got most things setup correctly, however I run into an issue where after I boot my VM, the lspci -nnv returns with my GPU IOMMU Group with “Unknown header type 7f.”

Before it shows the group running properly in vfio-pci. It also doesnt forward the PCI Devices properly to the VM. It shows them as generic ‘PCI Device’ in Device Manager.

was going through this and had everything working, then i installed looking glass and updated my system and it stopped working

double checked all the confirmations steps and everything is as it should be, however the gpu is not binding to VFIO on boot like it did earlier? any ideas why? i really want to get away from windows completly but there are a few games i still need windows for.

ill paste a pastebin here with what i have
pastebin dot com/aZxVVK62

you can see that i went through all the steps a second time

but my gpu will not bind!

any help would be appreciated

So this combination of Looking Glass, VFIO, & QEMU have been working pretty good. It feels very nearly native. (though I’ve had some issues with Linux crashing while starting the VM)

But that’s not why I’m commenting right now. Keeping this as short as possible I wanted to participate in BOINC. Unfortunately BOINC will not register any GPU’s with the default Ubuntu AMD driver. I managed to install the proper AMD GPU driver for Linux (19.30-855429). The GPU now registers with BOINC and is running like a dream (yay!)…but I knew this might happen…it “broke” the GPU I passed though to the VM:

Error starting domain: internal error: Failed to probe QEMU binary with QMP: /usr/bin/qemu-system-x86_64: symbol lookup error: /lib/x86_64-linux-gnu/libvirglrenderer.so.0: undefined symbol: drmPrimeHandleToFD


Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1400, in startup
    self._backend.create()
  File "/usr/lib/python3/dist-packages/libvirt.py", line 1080, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: Failed to probe QEMU binary with QMP: /usr/bin/qemu-system-x86_64: symbol lookup error: /lib/x86_64-linux-gnu/libvirglrenderer.so.0: undefined symbol: drmPrimeHandleToFD

I’m wondering does anybody think if I simply re-go over the pass-though instructions if this should “flush” the driver that I think got partially installed on the card?

So I figured out the problem. What I don’t know is how to fix it. What broke isn’t the GPU pass-though what broke is QEMU/KVM. It no longer recognizes that it’s installed:

Warning: KVM is not available. This may mean the KVM package is not installed, or the KVM kernel modules are not loaded. Your virtual machines may perform poorly.

I tried uninstalling various kvm qemu packages. and reinstalling them:

apt remove qemu-system qemu-kvm kvm
apt install qemu-system qemu-kvm kvm

And nothing has fixed it. Google’s not yielding results of anyone else having this issue (where uninstall/reinstall doesn’t help) so the AMDGPU driver install seriously broke something and I’m now 100% lost as to what’s causing it to not work when it IS installed.

I thought I’d share on this page a guide/tutorial I recently started working on for VFIO on Pop!_OS, using systemd-boot (Native bootloader to Pop!_OS). I have the wiki page set up for the VFIO attachment, but I’m not done with configuring VMs, Windows, and few other things I want to share one it. I added credit on the README of the github page to Wendell, as well as a few other guides that helped me out. https://github.com/rmayobre/PopOs-VFIO-Tutorial

Currently where @Rayxcer was with vfio_pci correctly binding to my 970 but does and shows nothing until you shut off the VM causing it to stick with an unknown header type and the VM launching with:

Domain: internal error: Unknown PCI header type ‘127’

My current hardware:
AMD Threadripper 1950x
ASRock 399x Taichi
Sapphire HD7870 GHz Edition (Host GPU)
Gigabyte Gaming G1 GTX970

My current BIOS version is P2.00, and at this point AGESA is now at v1.0.0.4. Has the passthrough bug ever been fixed in later BIOS versions or is it still worth downgrading or patching my current BIOS?

Can you output the full ls-iommu.sh so we can see your iommu groups? It’s on enabled. It auto right in uefi?

Any overclocks other than xmp?

Full ls-iommu.sh dump here, but the snippets just to show:

IOMMU Group 14 09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev ff)
IOMMU Group 14 09:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev ff)

IOMMU Group 22 41:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition] [1002:6818]
IOMMU Group 22 41:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series] [1002:aab0]

IOMMU, Virtualization and SR-IOV (even though the 970 doesn’t support it) are all explicitly set to enabled, yes. Booted in UEFI mode too.

No overclocks but XMP yes, 3200.

Passing through both devices in group 14? Can you try a different pcie slot just for experimentation?