PCIE Passthrough - CPU maxed, black screen

I’m trying to set up a PCIE passthrough setup on a Ryzen 2700x with a secondary Vega 64 graphics card on Ubuntu 19.10. I installed Windows, then added the PCI devices. The problem is that when the guest VM boots, the window display goes blank, there’s no signal to either monitor, and the CPU reading in virt-manager maxes out at 100%.

The vfio driver appears to be loaded for the Vega card:

29:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c1) (prog-if 00 [VGA controller])
	Subsystem: Tul Corporation / PowerColor Vega 10 XL/XT [Radeon RX Vega 56/64] [148c:2387]
	Flags: fast devsel, IRQ 44
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f0000000 (64-bit, prefetchable) [size=2M]
	I/O ports at d000 [size=256]
	Memory at f7900000 (32-bit, non-prefetchable) [size=512K]
	Expansion ROM at f7980000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

Passthrough is set on the virtual machine:

<hostdev mode="subsystem" type="pci" managed="yes">
  <driver name="vfio"/>
  <source>
    <address domain="0x0000" bus="0x29" slot="0x00" function="0x0"/>
  </source>
  <alias name="hostdev0"/>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
</hostdev>

If I remove the video card device in virt-manager, Windows boots normally. If I remove the vfio configuration, the card works without problems. What am I doing wrong?

So I downgraded to Ubuntu 18.04, and tried again, following the guide here. I have also tried following the guide here in the past.

In both cases, we get a little further, but end up with the same problem. The Windows install goes ok, the desktop appears on the alternate monitor, and everything reboots. But after a certain period of time, Windows just locks up. Then it’s the same problem - processor 100% maxed, no video. This time it made it 40% of the way through the graphics card install before it stopped.

At one point the VM wouldn’t start giving me some weird error about a “unknown PCI code 127”. Rebooting the host seems to have fixed that for now. I feel like I’m missing something fundamental here…

So I tried switching to Fedora 31, roughly following the guides here and here. I verified that both the audio and video were using the vfio-pci driver at startup.

The same thing happened, but with a slight twist. I installed Windows, tried to install the AMD drivers again, and again saw them lock up at 40%. Right before the lockup, though, instead of desktops extended on the other monitors connected to the guest GPU, there was a brief flash and I saw “snow” (as in the old-time TV kind of snow) on one of them. It persisted for a minute, then it went black, and boom… lockup.

Again I saw PCI error 127 when trying to force off/power on the VM, and now, we’re stuck again - black screen, CPU maxed out. Whatever I’m doing wrong, I’ve now done it wrong against 3 different distributions. I hate to think it could be hardware. The video card works fine on the host system. I can’t think of what else to check…

PCI Error 127 is normal if you haven’t applied the VEGA Reset Patch!

Regarding the black screen, do you know if the VEGA is the primary video card? Can you grab the dmesg output? Just run dmesg | tail -n40 (wait at least 20 sec) after starting the VM.

You don’t see the TianoCore boot screen, right? Or does it go black when Windows boots?

Vega reset patch… I’d heard that there was a problem, no idea we had a patch for it yet. The Vega is the secondary video card; the primary video card is a 1080ti. I didn’t see anything in dmesg, but I’ll paste the output here.

When I enable the boot menu, I do see the TianoCore boot screen, but it freezes or goes black when the Windows bootloader takes over.

Try HDMI instead of DisplayPort. DisplayPort has some funkyness with UEFI boot sequences sometimes.

That is good to know, I fixed it by setting the PCI config to multifunction=on

Also:

You could also try removing the HDMI-Audio-Device first and then boot the VM. I guess not having the Root-Device set to multifunction stops the AMD Driver from loading!

But yeah dunno about the DP stuff tho.

EDIT: If you want a quick way to compile the Fedora Kernel with the VEGA Patch applied u can use the following Dockerfile (works pretty well with podman):

FROM fedora:31

ENV KVER kernel-5.3.9-301.fc31

RUN sudo dnf -y install fedpkg fedora-packager rpmdevtools ncurses-devel pesign
RUN rpmdev-setuptree && cd ~ && koji download-build --arch=src $KVER.src.rpm
RUN useradd -s /sbin/nologin mockbuild && rpm -Uvh ~/$KVER.src.rpm && cd ~/rpmbuild/SOURCES && curl -o vega.patch https://pastebin.com/raw/nWkzJcGj
RUN cd ~/rpmbuild/SPECS/ && sudo dnf -y builddep kernel.spec
RUN cd ~/rpmbuild/SPECS/ && sed -i -e 's/# define buildid .local/%define buildid .vega/g' kernel.spec && sed -i "s/# END OF PATCH DEFINITIONS/# END OF PATCH DEFINITIONS\nPatch9001: vega.patch/" kernel.spec
RUN cd ~/rpmbuild/SPECS/ && rpmbuild -bb --without debug --target=x86_64 kernel.spec

ENTRYPOINT cp ~/rpmbuild/RPMS/x86_64/kernel* /output

Man, I didn’t know virt-manager had a XML view mode…

Someone should seriously fork virt-manager to Qt since it was declared EOL/Deprecated.

1 Like

Bildschirmfoto%20von%202019-11-13%2010-35-19

Yeah u just gotta activate it in the settings! But apparently it’s not present in all distros for some reason…

Dunno I’m fine with GTK ^^

Yeah, but the official GTK version is officially dead as per Red Hat. Someone needs to fork and port it to Qt. I run KDE and Qt would be better integrated into KDE themes.

Oh I didn’t know that! But then it would be badly integrated with GTK :rofl:

I’d prefer better KDE integration over GTK any day.

GNOME FTW :wink: :stuck_out_tongue:

I tried both using HDMI and only adding the video portion of the card. In both cases, I got the same thing. CPU is maxed, no video.

The problem seems to happen when booting Windows, as when I enable the boot menu, I do see the TianoCore screen, it’s only when it hands off to the Windows bootloader that the processor spikes and the screen goes blank. In some situations, the TianoCore logo just hangs there, forever.

dmesg output follows…

[   24.778048] Bluetooth: RFCOMM ver 1.11
[   25.361835] rfkill: input handler disabled
[   26.275989] FAT-fs (sdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[   26.317302] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
[   67.896789] kauditd_printk_skb: 30 callbacks suppressed
[   67.896790] audit: type=1400 audit(1573933888.486:42): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=2750 comm="apparmor_parser"
[   68.034966] audit: type=1400 audit(1573933888.622:43): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=2753 comm="apparmor_parser"
[   68.172626] audit: type=1400 audit(1573933888.762:44): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=2756 comm="apparmor_parser"
[   68.310773] audit: type=1400 audit(1573933888.898:45): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=2759 comm="apparmor_parser"
[   68.320051] virbr0: port 2(vnet0) entered blocking state
[   68.320053] virbr0: port 2(vnet0) entered disabled state
[   68.320087] device vnet0 entered promiscuous mode
[   68.320327] virbr0: port 2(vnet0) entered blocking state
[   68.320328] virbr0: port 2(vnet0) entered listening state
[   68.464468] audit: type=1400 audit(1573933889.050:46): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=2781 comm="apparmor_parser"
[   70.336639] virbr0: port 2(vnet0) entered learning state
[   70.716673] vfio-pci 0000:29:00.0: enabling device (0000 -> 0003)
[   70.716951] vfio_ecap_init: 0000:29:00.0 hiding ecap 0x19@0x270
[   70.716960] vfio_ecap_init: 0000:29:00.0 hiding ecap 0x1b@0x2d0
[   72.352790] virbr0: port 2(vnet0) entered forwarding state
[   72.352805] virbr0: topology change detected, propagating
[  315.927886] virbr0: port 2(vnet0) entered disabled state
[  315.928702] device vnet0 left promiscuous mode
[  315.928703] virbr0: port 2(vnet0) entered disabled state
[  316.259862] audit: type=1400 audit(1573934136.846:47): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3448 comm="apparmor_parser"
[  511.949858] audit: type=1400 audit(1573934332.538:48): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3547 comm="apparmor_parser"
[  512.085979] audit: type=1400 audit(1573934332.674:49): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3550 comm="apparmor_parser"
[  512.222316] audit: type=1400 audit(1573934332.810:50): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3553 comm="apparmor_parser"
[  512.362834] audit: type=1400 audit(1573934332.950:51): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3556 comm="apparmor_parser"
[  512.365137] virbr0: port 2(vnet0) entered blocking state
[  512.365138] virbr0: port 2(vnet0) entered disabled state
[  512.365210] device vnet0 entered promiscuous mode
[  512.365356] virbr0: port 2(vnet0) entered blocking state
[  512.365357] virbr0: port 2(vnet0) entered listening state
[  512.508904] audit: type=1400 audit(1573934333.098:52): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-c8d1b7b2-fde4-4275-83f8-9cd2f0f687a6" pid=3580 comm="apparmor_parser"
[  514.368391] virbr0: port 2(vnet0) entered learning state
[  514.744686] vfio_ecap_init: 0000:29:00.0 hiding ecap 0x19@0x270
[  514.744696] vfio_ecap_init: 0000:29:00.0 hiding ecap 0x1b@0x2d0
[  516.384399] virbr0: port 2(vnet0) entered forwarding state
[  516.384410] virbr0: topology change detected, propagating

Some more interesting data…

I tried setting multifunction to on, with and without the ROM I dumped from the card. In both cases it did the same thing, maxing out the CPU and going to a black screen. However…

Windows started up the boot repair screen, so it at least knows it can’t boot with the card inserted, as long as both the ROM and multifunction=on are specified. With the driver that comes with Windows, the card does eventually get detected. This is beginning to look like a Radeon driver problem.

1 Like

And it’s… sort of fixed, I guess. After finding this thread on Reddit, I found out that the latest Adrenaline drivers require that you use the Q35 option instead of the i440FX option for the chipset of the virtual machine. Now it appears to be running…

But there’s a second problem. It only works with one monitor. Windows detects two monitors (both plugged into the alternate card), but it only displays on one of them. Suggestions?

1 Like

Managed to somehow get the second monitor working. I’m not quite sure what the magic was…

1 Like

Maybe it only works with hot plug, and not at boot. Remember the EFI takes over the card for display mode setting at boot.