TRX40 Related - OVMF/QEMU Failing to boot - Pauses / Black Screen when starting VM

Hi all, I’m pretty sure I’ve taken all the steps I can before asking for help and I’m now in need of it. I think I’ve included all the needed details, but I started with the QEMU log as it has some error output.

The problem… When I first start up my host and then start the VM in virt-manager it runs for a second then immediately pauses. Powering the VM off then starting it back up just leaves it at a black screen, but no pause. No output via Spice or the GPU. It seems like OVMF is getting hung up trying to initialize the GPU.

If I remove the GPU passthrough from the VM it starts up as expected - TianoCore into Windows and complete control via spice.

I have tried to pass the ROM file to the VM with one from TechPowerUp, but same result.

I think I’ve included every detail below, but please let me know if I need to add something else.

qemu log (maybe the juicy part):

-chardev spicevmc,id=charredir1,name=usbredir \
-device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 \
-device vfio-pci,host=0000:4b:00.0,id=hostdev0,bus=pci.5,addr=0x0,romfile=/usr/share/vgabios/EVGA.RTX2080.8192.181022.rom \
-device vfio-pci,host=0000:4b:00.1,id=hostdev1,bus=pci.6,addr=0x0 \
-device vfio-pci,host=0000:4b:00.2,id=hostdev2,bus=pci.7,addr=0x0 \
-device vfio-pci,host=0000:4b:00.3,id=hostdev3,bus=pci.8,addr=0x0 \
-device virtio-balloon-pci,id=balloon0,bus=pci.4,addr=0x0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2020-02-12 13:55:58.198+0000: Domain id=2 is tainted: host-cpu
char device redirected to /dev/pts/3 (label charserial0)
2020-02-12T13:56:04.361898Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:04.362794Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.3) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:04.362835Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.2) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:04.362878Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:06.052811Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.3) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:06.052846Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.2) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:06.052852Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T13:56:06.052857Z qemu-system-x86_64: vfio_err_notifier_handler(0000:4b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-12T14:17:39.917346Z qemu-system-x86_64: terminating on signal 15 from pid 1428 (/usr/bin/libvirtd)

Relevant hardware:

Gigabyte AORUS XTREME TRX40 - BIOS F2 AGESA 1.0.0.1 and F4A AGESA 1.0.0.3
Host GPU: XFX RX580
Guest GPU: EVGA GeForce RTX 2080 XC

The RX580 is in the first slot and the RTX 2080 is in the third slot, both are PCI-E 4.0 16x slots. BIOS has SVM and IOMMU enabled. CSM is disabled (I tried it enable as well and same effect).

Software Details:

Manjaro KDE
Kernel version 5.4.18-1
QEMU version 4.2.0-1
libvirt version 5.10.0-2
ovmf version 1:r26976.bd85bf54c2-1
virt-manager version 2.2.1-2

lspci -nnk for 2080:

4b:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] [10de:1e87] (rev a1)
        Subsystem: eVga.com. Corp. TU104 [GeForce RTX 2080 Rev. A] [3842:2182]
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau
4b:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f8] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:2182]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
4b:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad8] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:2182]
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci
4b:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad9] (rev a1)
        Subsystem: eVga.com. Corp. Device [3842:2182]
        Kernel driver in use: vfio-pci
        Kernel modules: i2c_nvidia_gpu

IOMMU Group:

IOMMU Group 60:
        4b:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] [10de:1e87] (rev a1)
        4b:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f8] (rev a1)
        4b:00.2 USB controller [0c03]: NVIDIA Corporation Device [10de:1ad8] (rev a1)
        4b:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device [10de:1ad9] (rev a1)

modprobe:

blacklist nouveau
options nouveau modeset=0
blacklist nvidia
options nvidia modeset=0
options vfio-pci ids=10de:1e87,10de:10f8,10de:1ad8,10de:1ad9

GRUB / Kernel Parameters:

apparmor=1 security=apparmor udev.log_priority=3 amd_iommu=on iommu=pt video=efifb:off

dmesg for VFIO:

[    4.142913] VFIO - User Level meta-driver version: 0.3
[    4.147677] vfio-pci 0000:4b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    4.163403] vfio_pci: add [10de:1e87[ffffffff:ffffffff]] class 0x000000/00000000
[    4.180041] vfio_pci: add [10de:10f8[ffffffff:ffffffff]] class 0x000000/00000000
[    4.196707] vfio_pci: add [10de:1ad8[ffffffff:ffffffff]] class 0x000000/00000000
[    4.213497] vfio_pci: add [10de:1ad9[ffffffff:ffffffff]] class 0x000000/00000000
[   11.226451] vfio-pci 0000:4b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[  233.596604] vfio-pci 0000:4b:00.0: enabling device (0000 -> 0003)
[  233.703453] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[  233.703479] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[  233.726602] vfio-pci 0000:4b:00.1: enabling device (0000 -> 0002)
[  233.763274] vfio-pci 0000:4b:00.3: enabling device (0000 -> 0002)

Virtual Machine Basics:

Windows 10
KVM hypervisor
Chipset Q35
Firmware OVMF
SATA disk by-id

Virtual Machine XML:

<domain type="kvm">
  <name>win10</name>
  <uuid>2a213354-e82e-49b1-8342-b860db8b03f9</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">16384000</memory>
  <currentMemory unit="KiB">16384000</currentMemory>
  <vcpu placement="static">16</vcpu>
  <os>
    <type arch="x86_64" machine="pc-q35-4.2">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/ovmf/x64/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10_VARS.fd</nvram>
    <boot dev="hd"/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vendor_id state="on" value="123456"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
    <ioapic driver="kvm"/>
  </features>
  <cpu mode="host-passthrough" check="none">
    <topology sockets="1" cores="16" threads="1"/>
  </cpu>
  <clock offset="localtime">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
    <timer name="hypervclock" present="yes"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type="block" device="disk">
      <driver name="qemu" type="raw" cache="none" io="native"/>
      <source dev="/dev/disk/by-id/ata-CT500MX500SSD1_1803E10B4260"/>
      <target dev="sda" bus="sata"/>
      <address type="drive" controller="0" bus="0" target="0" unit="0"/>
    </disk>
    <disk type="file" device="cdrom">
      <driver name="qemu" type="raw"/>
      <source file="/var/lib/libvirt/images/Win10_1809_LTSC_64bit.iso"/>
      <target dev="sdb" bus="sata"/>
      <readonly/>
      <address type="drive" controller="0" bus="0" target="0" unit="1"/>
    </disk>
    <controller type="usb" index="0" model="qemu-xhci" ports="15">
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
    </controller>
    <controller type="pci" index="0" model="pcie-root"/>
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="1" port="0x10"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="2" port="0x11"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x1"/>
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="3" port="0x12"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x2"/>
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="4" port="0x13"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x3"/>
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="5" port="0x14"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x4"/>
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="6" port="0x15"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x5"/>
    </controller>
    <controller type="pci" index="7" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="7" port="0x16"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x6"/>
    </controller>
    <controller type="pci" index="8" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="8" port="0x17"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x7"/>
    </controller>
    <controller type="virtio-serial" index="0">
      <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
    </controller>
    <interface type="network">
      <mac address="52:54:00:c4:5e:71"/>
      <source network="default"/>
      <model type="e1000e"/>
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </interface>
    <serial type="pty">
      <target type="isa-serial" port="0">
        <model name="isa-serial"/>
      </target>
    </serial>
    <console type="pty">
      <target type="serial" port="0"/>
    </console>
    <channel type="spicevmc">
      <target type="virtio" name="com.redhat.spice.0"/>
      <address type="virtio-serial" controller="0" bus="0" port="1"/>
    </channel>
    <input type="tablet" bus="usb">
      <address type="usb" bus="0" port="1"/>
    </input>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <graphics type="spice" autoport="yes">
      <listen type="address"/>
      <image compression="off"/>
    </graphics>
    <sound model="ich9">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1b" function="0x0"/>
    </sound>
    <video>
      <model type="qxl" ram="65536" vram="65536" vgamem="16384" heads="1" primary="yes"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0"/>
    </video>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x4b" slot="0x00" function="0x0"/>
      </source>
      <rom file="/usr/share/vgabios/EVGA.RTX2080.8192.181022.rom"/>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x4b" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x4b" slot="0x00" function="0x2"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x4b" slot="0x00" function="0x3"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/>
    </hostdev>
    <redirdev bus="usb" type="spicevmc">
      <address type="usb" bus="0" port="2"/>
    </redirdev>
    <redirdev bus="usb" type="spicevmc">
      <address type="usb" bus="0" port="3"/>
    </redirdev>
    <memballoon model="virtio">
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </memballoon>
  </devices>
</domain>

Looks like there’s some other errors when I try to start the VM (dmesg | grep error):

[ 4884.738275] pcieport 0000:40:03.1: DPC: unmasked uncorrectable error detected
[ 4886.456572] pcieport 0000:40:03.1: DPC: unmasked uncorrectable error detected

PCI-E port it seems to be referring to is:

40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

lspci for that device:

40:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
        Kernel driver in use: pcieport

IOMMU group for that device:

IOMMU Group 40:
        40:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]

dmseg | grep iommu:

[  276.499497] vfio_iommu_type1_attach_group: No interrupt remapping support.  Use the module param "allow_unsafe_interrupts" to enable VFIO IOMMU support on this platform

I added this to my vfio modprobe:

options vfio_iommu_type1 allow_unsafe_interrupts=1

Rebuilt with:

mkinitcpio -p linux54

Then grub too for good measure:

grub-mkconfig -o /boot/grub/grub.cfg

Jumped up to kernel version 5.5.2-1 just to see if it made any difference.

Same results on the logs.

VM starts for second then pauses. Can’t resume, have to power off the VM.

Just rebuilt without using OVMF and just went with SeaBIOS Q35.

Same result.

Interestingly had some brief progress. I re-enabled CSM support in my BIOS both SeaBIOS and OVMF were booting and detecting the 2080 with the vBIOS rom file. Though I was getting error 43.

Just to verify if it was just the vBIOS file being need I disabled CSM and neither worked.

Now I reverted back to CSM being enabled and the same XML changes, but I’m still having the same problem.

Here’s the hostdev section:

<hostdev mode="subsystem" type="pci" managed="yes">
  <source>
    <address domain="0x0000" bus="0x4b" slot="0x00" function="0x1"/>
  </source>
  <rom bar="off" file="/usr/share/vgabios/EVGA.RTX2080.8192.181022.rom"/>
  <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
</hostdev>

Do I need ROM bar on or off?

Is CSM needed?

After chatting with a few helpful people in the VFIO discord they were able to figure out that something is wrong with passing the USB portion of the GPU through.

Likely could be a platform issue. I’m going to try to keep this thread up to date.

If anyone else out there is finding this sort of dead end only pass through the VGA and Audio portions of your GPU to get the VM to start.

Any feedback or help is still appreciated.

So it seems trying to pass ANY USB controllerthrough vfio_pci causes the guest not to boot.

I just tried to pass a renasas controller that’s on a startech. It’s worked in the past just like the rest of this stuff really.

dmesg for the VM not booting with that controller is here: https://clbin.com/vcsjd

Could you try booting with ‘pcie_aspm=off’ and then with pci=noaer while passing the usb controller? If that works could you try to use the usb port of the gpu.

I have the TRX40 desinare but can’t yet use it due to other parts missing.
I have a feeling that it will be very long road to get VFIO working on it :neutral_face:

2 Likes

I’ll make sure to add them to grub today and report back.

@maximlevitsky that did it.

I’m able to pass both the USB controller on the GPU and the separate renasas.

I’ll put together a full write up of my setup in my next reply just in case someone runs into this again or you need it as a reference.

1 Like

Did the VM boot or you also beeing able to attach some devices to the controllers. Which setting helped?

@maximlevitsky I had the VM booting when I didn’t pass anything but the VGA and Audio portion of my GPU. It seem related to anything USB, so the one on the GPU and the extra one I have in there.

With ASPM off I’m able to pass everything, including the problematic USB controllers.

Everything is now working as expected.

It seems the first gen and third gen threadripper have that fix in common.

No no.
I mean when you passed the USB controllers, did the VM just boot or did you also being able to attach
some devices to the controllers and see them in the guest?

Fascinating! To passthrough USB on Threadripper 3 / Ryzen 3000 is to use FLR patch, e.g: https://www.reddit.com/r/VFIO/comments/eba5mh/workaround_patch_for_passing_through_usb_and/

Given we run the same board, I’ll test this later this week.

@maximlevitsky I am able to attach USB devices to the USB controllers and see them in the guest. Keyboard, Mouse, USB DAC, and USB Mic all work as expected on either the 2080’s USB controller (type c to type a to a hub) or the Startech PEXUSB3S24 (renasas).

@FutureFade that is interesting. I wonder which method is more stable. Right now I haven’t ran it for more that a couple hours, but that’s usually about as long as any gaming session would go. We’ll see how it does on a full day of solid works or something similar. Also, I’m on BIOS F3 (AEGESA 1.0.0.3), not the latest F4A (AEGESA 1.0.0.3A?).

This is wonderful news!
Just to be sure, you only used pcie_aspm=off or you used both this and pci=noaer (later hides pcie errors)

Using pcie_aspm=off doesn’t work for these devices:

  • Matisse USB 3.0 Host Controller
  • Starship/Mattisse HD Audio Controller
  • Starship USB 3.0 Host Controller

Other then those, you’ll be good to. Though you can through asmedia controllers without pcie_aspm=off.

To conclude this little investigate. If you have 20 series Nvidia card, you’ll need to pcie_aspm=off to use the build in Type-C controller.

Al though I am little surprised that it worked, because pcie_aspm is a active state power management option. Something that shouldn’t really affect passthrough.

I finally got to test the TRX40 designare and I managed to make all the USB controllers to work (with acs and flr overrides).

@vljio could you check something small for me? I don’t have the RTX series card so I can’t check it yet.

Could you check if ‘pcie_aspm.policy=performance’ works the same as pcie_aspm=off on the kernel command line?

Sadly pcie_aspm=off here disables thunderbolt hotplug which otherwise works very well.

Yeah I can check tomorrow for you. I’ll report back.

1 Like