TRX40 Related - OVMF/QEMU Failing to boot - Pauses / Black Screen when starting VM

@maximlevitsky I had the VM booting when I didn’t pass anything but the VGA and Audio portion of my GPU. It seem related to anything USB, so the one on the GPU and the extra one I have in there.

With ASPM off I’m able to pass everything, including the problematic USB controllers.

Everything is now working as expected.

It seems the first gen and third gen threadripper have that fix in common.

No no.
I mean when you passed the USB controllers, did the VM just boot or did you also being able to attach
some devices to the controllers and see them in the guest?

Fascinating! To passthrough USB on Threadripper 3 / Ryzen 3000 is to use FLR patch, e.g: https://www.reddit.com/r/VFIO/comments/eba5mh/workaround_patch_for_passing_through_usb_and/

Given we run the same board, I’ll test this later this week.

@maximlevitsky I am able to attach USB devices to the USB controllers and see them in the guest. Keyboard, Mouse, USB DAC, and USB Mic all work as expected on either the 2080’s USB controller (type c to type a to a hub) or the Startech PEXUSB3S24 (renasas).

@FutureFade that is interesting. I wonder which method is more stable. Right now I haven’t ran it for more that a couple hours, but that’s usually about as long as any gaming session would go. We’ll see how it does on a full day of solid works or something similar. Also, I’m on BIOS F3 (AEGESA 1.0.0.3), not the latest F4A (AEGESA 1.0.0.3A?).

This is wonderful news!
Just to be sure, you only used pcie_aspm=off or you used both this and pci=noaer (later hides pcie errors)

Using pcie_aspm=off doesn’t work for these devices:

  • Matisse USB 3.0 Host Controller
  • Starship/Mattisse HD Audio Controller
  • Starship USB 3.0 Host Controller

Other then those, you’ll be good to. Though you can through asmedia controllers without pcie_aspm=off.

To conclude this little investigate. If you have 20 series Nvidia card, you’ll need to pcie_aspm=off to use the build in Type-C controller.

Al though I am little surprised that it worked, because pcie_aspm is a active state power management option. Something that shouldn’t really affect passthrough.

I finally got to test the TRX40 designare and I managed to make all the USB controllers to work (with acs and flr overrides).

@vljio could you check something small for me? I don’t have the RTX series card so I can’t check it yet.

Could you check if ‘pcie_aspm.policy=performance’ works the same as pcie_aspm=off on the kernel command line?

Sadly pcie_aspm=off here disables thunderbolt hotplug which otherwise works very well.

Yeah I can check tomorrow for you. I’ll report back.

1 Like

@maximlevitsky ‘pcie_aspm.policy=performance’ does not work. VM pauses right after I start it.

That sucks.
Could you then boot with ‘pci=noaer’ only and see if hiding AER errors makes it work?
(I assume that you don’t use currently pci=noaer)

Could you look at this? I can’t test it and it might be a bummer for me if that doesn’t work

Sorry for the slow reply.

I’ll check it out tomorrow for you and report back.

@maximlevitsky

1 Like

@vljio Thanks a lot!!

@maximlevitsky pci=noaeor does not have any effect unfortunately.

You typed this correctly? Thank you and I guess it will be fun making it work :frowning:
Thanks again!

I got it working.
I compiled my own kernel and disabled the DPC error driver, and volla, both my RTX2070S work and I don’t need to disable aspm so my thunderbolt driver works too.

By work I mean that USB controller passes through cleanly and so does all other nvidia devices. Now my only wish is that nvidia keeps that little USB port on RTX 3000 series I am waiting for (I will sell that card when RTX 3000 series is released)

Most likely the same can be done with blacklisting and/or poking at sysfs to disable that driver.

These DPC errors probably are just bogus, and are leftover from some enterprise EPYC features.

Hi—coming here from TRX40 Nvidia Single GPU Passthrough where I seemed to be hitting a similar issue (although weirdly mine worked completely fine for the first host boot but then failed after the guest was shut down the first time). Did you ever resolve this with blacklisting or anything other than a custom kernel?

Hi @allu what ultimately solved it for me was not passing the USB controller on my GTX 2080 through OR having the PCI-E ASPM setting off.

Use kernel version 5.4, that’s what I’m using today.

Here’s my vfio.conf (/etc/modprobe.d/vfio.conf):
options vfio-pci ids=10de:1e87,10de:10f8,10de:1ad8,10de:1ad9,1912:0015,8086:1528 options vfio_iommu_type1 allow_unsafe_interrupts=1

Here’s the “GRUB_CMD_LINUX_DEFAULT” pieces that matters for you (/etc/default/grub):
amd_iommu=on iommu=pt video=efifb:off pcie_aspm=off pci-stub.ids=144d:a808,144d:a801

Make sure to run `sudo mkinitcpio -p linux54’ (or linuxXX whatever version you’re using) after you edit your grub config then reboot.

Hello

How have you dealt with the DPC Error with the device 20.03.1

When I try to passthrough a GPU I always get this error

Thanks

I was facing this same issue on a Z390 system while passing through a Renasas (Startech) card. Adding pcie_aspm=off to grub fixed it. Thanks!