TRX40 Nvidia Single GPU Passthrough

Sorry, deleted previous post due to editing issue.

Longtime Linux user here but with zero experience with passthrough. Stuck inside in quarantine, I decided to try setting up single GPU passthrough VFIO on my headless Threadripper 3960x development workstation. I read through the guide to OVMF and VFIO Passthrough at arch, but I’m unable to get my guest working and seem to be hitting some sort of error on guest reboot.

System:
3960x
TRX40 AORUS XTREME
NVIDIA GeForce RTX 2070 Super

grub config contains the following

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=on iommu=pt vfio-pci.ids=10de:1e84,10de:10f8,10de:1ad8,10de:1ad9 video=efifb:off,vesafb:off"

nouveau and i2c_nvidia_gpu are blacklisted in a file in /etc/modprobe.d

When I boot my Windows 10 guest for the first time since booting my host, I get the dreaded Nvidia error 43.

As soon as I shutdown or reboot the Windows guest to try troubleshooting, when when I boot the guest again I simply get a black screen on my monitor connected to the GPU until I fully reboot the host. Once, seemingly randomly, I was able to get a second boot to work by hard resetting rather than rebooting the guest, but I wasn’t able to replicate that success and when I tried again I went back to the black screen on boot (until rebooting the host). I’m puzzled as everything I’ve read about reset bugs applies to AMD GPUs, not Nvidia.

Here are the contents of my libvirt xml config file:

<features>
  <acpi/>
  <apic/>
  <hyperv>
    <relaxed state='on'/>
    <vapic state='on'/>
    <spinlocks state='on' retries='8191'/>
    <vendor_id state='on' value='whatever'/>
  </hyperv>
  <kvm>
    <hidden state='on'/>
  </kvm>
  <vmport state='off'/>
  <ioapic driver='kvm'/>
</features>

Any help would be greatly appreciated!

Because you are passing boot GPU you also should be passing vbios for that GPU as well. Try doing that.

If all fails, in terminal do journalctl -f, boot vm, wait for 1 minute, send that log here.

Okay, so I read out my vbios and provide it in my libvirt config, and that successfully fixed error 43. However, I’m still unable to start my guest again after shutting it down once until I reboot the host entirely. Whenever I boot the guest a second time my monitor lights up but only receives black.

Here’s the output of journalctl -f:

May 27 00:49:26  audit[4111]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4111 comm="apparmor_parser"
May 27 00:49:26  kernel: audit: type=1400 audit(1590540566.535:104): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4111 comm="apparmor_parser"
May 27 00:49:26  kernel: xhci_hcd 0000:45:00.0: remove, state 4
May 27 00:49:26  kernel: usb usb6: USB disconnect, device number 1
May 27 00:49:26  kernel: usb 6-1: USB disconnect, device number 2
May 27 00:49:26  kernel: xhci_hcd 0000:45:00.0: USB bus 6 deregistered
May 27 00:49:26  kernel: xhci_hcd 0000:45:00.0: remove, state 1
May 27 00:49:26  kernel: usb usb5: USB disconnect, device number 1
May 27 00:49:26  kernel: usb 5-1: USB disconnect, device number 2
May 27 00:49:26  kernel: usb 5-1.3: USB disconnect, device number 3
May 27 00:49:26  kernel: usb 5-1.4: USB disconnect, device number 4
May 27 00:49:27  kernel: usb 5-1.5: USB disconnect, device number 5
May 27 00:49:27  kernel: xhci_hcd 0000:45:00.0: USB bus 5 deregistered
May 27 00:49:27  audit[4119]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4119 comm="apparmor_parser"
May 27 00:49:27  kernel: audit: type=1400 audit(1590540567.307:105): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4119 comm="apparmor_parser"
May 27 00:49:27  audit[4123]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4123 comm="apparmor_parser"
May 27 00:49:27  kernel: audit: type=1400 audit(1590540567.431:106): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4123 comm="apparmor_parser"
May 27 00:49:27  audit[4127]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4127 comm="apparmor_parser"
May 27 00:49:27  kernel: audit: type=1400 audit(1590540567.559:107): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4127 comm="apparmor_parser"
May 27 00:49:27  systemd-networkd[1508]: vnet0: Link UP
May 27 00:49:27  systemd-networkd[1508]: vnet0: Gained carrier
May 27 00:49:27  networkd-dispatcher[1570]: WARNING:Unknown index 15 seen, reloading interface list
May 27 00:49:27  kernel: virbr0: port 2(vnet0) entered blocking state
May 27 00:49:27  kernel: virbr0: port 2(vnet0) entered disabled state
May 27 00:49:27  kernel: device vnet0 entered promiscuous mode
May 27 00:49:27  kernel: virbr0: port 2(vnet0) entered blocking state
May 27 00:49:27  kernel: virbr0: port 2(vnet0) entered listening state
May 27 00:49:27  systemd-udevd[4113]: Using default interface naming scheme 'v245'.
May 27 00:49:27  dnsmasq[1892]: reading /etc/resolv.conf
May 27 00:49:27  systemd-udevd[4113]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 27 00:49:27  dnsmasq[1892]: using nameserver 127.0.0.53#53
May 27 00:49:27  dnsmasq[1892]: reading /etc/resolv.conf
May 27 00:49:27  dnsmasq[1892]: using nameserver 127.0.0.53#53
May 27 00:49:27  dnsmasq[1892]: reading /etc/resolv.conf
May 27 00:49:27  dnsmasq[1892]: using nameserver 127.0.0.53#53
May 27 00:49:27  audit[4136]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4136 comm="apparmor_parser"
May 27 00:49:27  kernel: audit: type=1400 audit(1590540567.683:108): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789" pid=4136 comm="apparmor_parser"
May 27 00:49:27  libvirtd[3710]: Domain id=3 name='win10' uuid=c6ed8f05-60af-4c8d-a314-f11dbecfc789 is tainted: host-cpu
May 27 00:49:27  systemd-machined[1599]: New machine qemu-3-win10.
May 27 00:49:27  systemd[1]: Started Virtual Machine qemu-3-win10.
May 27 00:49:28  kernel: vfio-pci 0000:21:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
May 27 00:49:28  kernel: vfio-pci 0000:21:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
May 27 00:49:29  systemd-networkd[1508]: vnet0: Gained IPv6LL
May 27 00:49:29  dnsmasq[1892]: reading /etc/resolv.conf
May 27 00:49:29  dnsmasq[1892]: using nameserver 127.0.0.53#53
May 27 00:49:29  kernel: virbr0: port 2(vnet0) entered learning state
May 27 00:49:29  kernel: vfio-pci 0000:45:00.0: vfio_ecap_init: hiding ecap 0x19@0x200
May 27 00:49:29  kernel: vfio-pci 0000:45:00.0: vfio_ecap_init: hiding ecap 0x1e@0x400
May 27 00:49:31  kernel: virbr0: port 2(vnet0) entered forwarding state
May 27 00:49:31  kernel: virbr0: topology change detected, propagating
May 27 00:49:31  systemd-networkd[1508]: virbr0: Gained carrier
May 27 00:49:31  dnsmasq[1892]: reading /etc/resolv.conf
May 27 00:49:31  dnsmasq[1892]: using nameserver 127.0.0.53#53

More suspiciously, here’s the output of journalctl -f when I shutdown the guest that works the first time the host is cycled. Note the pci errors at the end.

May 27 01:07:21 audit[2209]: AVC apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2209 comm=“apparmor_parser”
May 27 01:07:21 kernel: kauditd_printk_skb: 21 callbacks suppressed
May 27 01:07:21 kernel: audit: type=1400 audit(1590541641.523:33): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2209 comm=“apparmor_parser”
May 27 01:07:21 kernel: xhci_hcd 0000:45:00.0: remove, state 4
May 27 01:07:21 kernel: usb usb6: USB disconnect, device number 1
May 27 01:07:21 kernel: usb 6-1: USB disconnect, device number 2
May 27 01:07:21 kernel: xhci_hcd 0000:45:00.0: USB bus 6 deregistered
May 27 01:07:21 kernel: xhci_hcd 0000:45:00.0: remove, state 1
May 27 01:07:21 kernel: usb usb5: USB disconnect, device number 1
May 27 01:07:21 kernel: usb 5-1: USB disconnect, device number 2
May 27 01:07:21 kernel: usb 5-1.3: USB disconnect, device number 3
May 27 01:07:21 kernel: usb 5-1.4: USB disconnect, device number 4
May 27 01:07:22 kernel: usb 5-1.5: USB disconnect, device number 5
May 27 01:07:22 kernel: xhci_hcd 0000:45:00.0: USB bus 5 deregistered
May 27 01:07:22 audit[2217]: AVC apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2217 comm=“apparmor_parser”
May 27 01:07:22 kernel: audit: type=1400 audit(1590541642.291:34): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2217 comm=“apparmor_parser”
May 27 01:07:22 audit[2221]: AVC apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2221 comm=“apparmor_parser”
May 27 01:07:22 kernel: audit: type=1400 audit(1590541642.419:35): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2221 comm=“apparmor_parser”
May 27 01:07:22 kernel: audit: type=1400 audit(1590541642.543:36): apparmor=“STATUS” operation=“profile_replace” info=“same as current profile, skipping” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2225 comm=“apparmor_parser”
May 27 01:07:22 audit[2225]: AVC apparmor=“STATUS” operation=“profile_replace” info=“same as current profile, skipping” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2225 comm=“apparmor_parser”
May 27 01:07:22 systemd[1]: Started Virtual machine log manager.
May 27 01:07:22 systemd-udevd[2229]: Using default interface naming scheme ‘v245’.
May 27 01:07:22 systemd-udevd[2229]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
May 27 01:07:22 networkd-dispatcher[1544]: WARNING:Unknown index 7 seen, reloading interface list
May 27 01:07:22 systemd-networkd[1491]: vnet0: Link UP
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 kernel: virbr0: port 2(vnet0) entered blocking state
May 27 01:07:22 kernel: virbr0: port 2(vnet0) entered disabled state
May 27 01:07:22 kernel: device vnet0 entered promiscuous mode
May 27 01:07:22 kernel: virbr0: port 2(vnet0) entered blocking state
May 27 01:07:22 kernel: virbr0: port 2(vnet0) entered listening state
May 27 01:07:22 systemd-networkd[1491]: vnet0: Gained carrier
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:22 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:22 audit[2239]: AVC apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2239 comm=“apparmor_parser”
May 27 01:07:22 kernel: audit: type=1400 audit(1590541642.682:37): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“libvirt-c6ed8f05-60af-4c8d-a314-f11dbecfc789” pid=2239 comm=“apparmor_parser”
May 27 01:07:22 libvirtd[1712]: Domain id=1 name=‘win10’ uuid=c6ed8f05-60af-4c8d-a314-f11dbecfc789 is tainted: host-cpu
May 27 01:07:22 systemd-machined[1560]: New machine qemu-1-win10.
May 27 01:07:22 systemd[1]: Started Virtual Machine qemu-1-win10.
May 27 01:07:23 kernel: vfio-pci 0000:21:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
May 27 01:07:23 kernel: vfio-pci 0000:21:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
May 27 01:07:23 kernel: vfio-pci 0000:21:00.1: enabling device (0000 -> 0002)
May 27 01:07:23 kernel: vfio-pci 0000:21:00.3: enabling device (0000 -> 0002)
May 27 01:07:24 systemd-networkd[1491]: vnet0: Gained IPv6LL
May 27 01:07:24 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:24 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:24 kernel: virbr0: port 2(vnet0) entered learning state
May 27 01:07:24 kernel: vfio-pci 0000:45:00.0: vfio_ecap_init: hiding ecap 0x19@0x200
May 27 01:07:24 kernel: vfio-pci 0000:45:00.0: vfio_ecap_init: hiding ecap 0x1e@0x400
May 27 01:07:26 kernel: virbr0: port 2(vnet0) entered forwarding state
May 27 01:07:26 kernel: virbr0: topology change detected, propagating
May 27 01:07:26 systemd-networkd[1491]: virbr0: Gained carrier
May 27 01:07:26 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:07:26 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:07:46 dnsmasq-dhcp[1843]: DHCPREQUEST(virbr0) 192.168.122.3 52:54:00:ed:e2:4f
May 27 01:07:46 dnsmasq-dhcp[1843]: DHCPACK(virbr0) 192.168.122.3 52:54:00:ed:e2:4f DESKTOP-S18UO7M
May 27 01:09:08 systemd-networkd[1491]: vnet0: Link DOWN
May 27 01:09:08 systemd-networkd[1491]: vnet0: Lost carrier
May 27 01:09:08 kernel: virbr0: port 2(vnet0) entered disabled state
May 27 01:09:08 kernel: device vnet0 left promiscuous mode
May 27 01:09:08 kernel: virbr0: port 2(vnet0) entered disabled state
May 27 01:09:08 systemd-networkd[1491]: rtnl: received neighbor for link ‘7’ we don’t know about, ignoring.
May 27 01:09:08 systemd-networkd[1491]: rtnl: received neighbor for link ‘7’ we don’t know about, ignoring.
May 27 01:09:08 systemd-networkd[1491]: virbr0: Lost carrier
May 27 01:09:08 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:09:08 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:09:08 dnsmasq[1843]: reading /etc/resolv.conf
May 27 01:09:08 dnsmasq[1843]: using nameserver 127.0.0.53#53
May 27 01:09:08 libvirtd[1712]: internal error: End of file from qemu monitor
May 27 01:09:09 kernel: pcieport 0000:20:03.1: DPC: containment event, status:0x1f01 source:0x0000
May 27 01:09:09 kernel: pcieport 0000:20:03.1: DPC: unmasked uncorrectable error detected
May 27 01:09:09 libvirtd[1712]: internal error: Unknown PCI header type ‘127’ for device ‘0000:21:00.0’
May 27 01:09:09 libvirtd[1712]: internal error: Unknown PCI header type ‘127’ for device ‘0000:21:00.1’
May 27 01:09:09 libvirtd[1712]: internal error: Unknown PCI header type ‘127’ for device ‘0000:21:00.2’
May 27 01:09:09 libvirtd[1712]: internal error: Unknown PCI header type ‘127’ for device ‘0000:21:00.3’
May 27 01:09:10 kernel: pcieport 0000:20:03.1: AER: Device recovery successful
May 27 01:09:10 systemd[1]: machine-qemu\x2d1\x2dwin10.scope: Succeeded.
May 27 01:09:10 systemd-machined[1560]: Machine qemu-1-win10 terminated.

So, what exactly lives on the address 20:03.1? That isn’t GPU since your GPU is 21:00.x. Have you checked that your gpu’s iommu group contains only GPU? Try removing all extra devices including USB devices from vm config leaving GPU and bare minimum required for a VM. Objective right now is to boot vm, get tiano core logo and get to windows login screen at least 2 times in a row.

PS. Reason I say removing USB devices is because I was passing through Das Keyboard to my vm which has USB hub in it. That caused screen to stay black during boot until it magically appeared on login screen. No tiano core boot logo, no spinning wheel. Later that keyboard would freeze my vm. Passing a much simpler keyboard solved all of that.

Edit: I just remembered briefly glancing over threadripper black screen thread before. Here it is and it seems like the issue you are having.

Wow, thanks for the pointer!

That PCI device (20:03.1) isn’t even one that’s being passed through!

It is this:
20:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

But I was able to fix this by removing the NVidia USB port (passing through the ASMedia hub my motherboard supports was fine). What an odd bug.

Reading through that thread sound to me like a mobo bios issue, not sure if it’s similar to one AGESA had with Ryzen CPUs a while back. You using latest bios?

Hello

I have the same motherboard (Gigabyte TRX40 Designare) and I try to passthrough a GPU for a windows 10 virtual machine.

I have the same problem with DPC error with the 20:03.1 device

How have you dealt with it ?

Thanks

Try not passing through the USB port on your GPU if you have one. TRX40 has an issue with certain types of USB passthrough unless you apply a kernel patch or run with PCI ASPM off (see the thread linked to above).

Hello and thank you for your answer

I don’t have a USB port on my graphic card… It’s a AMD RX480 I am trying to passthrough… I only have the GPU (21:00.0) and the HDMI sound card (21:00.1) in the IOMMU group… nothing more

Didn’t think this problem was limited to Nvidia card ???

Oh, well the RX480 has a reset issue so if you’re seeing this after guest reboots that’s probably the problem. If you search for ‘polaris reset bug’ you should be able to find some resources. In my opinion, the TL;DR is that you shouldn’t bother trying to use AMD GPUs for passthrough.