Attempting single GPU passthrough but only getting a black screen

So I’ve been hoping to try some of the single GPU VFIO guides that have been floating around (mainly so I can abandon my work laptop to WFH). I’ve amended the scripts used in this video. I’ve changed the scripts to match my GPU and dekstop environment (Gnome).

I’m running Debian 10.7 and am using virt-manager to start the Windows 10 VM. My hardware is:

CPU: Ryzen 2700
Motherboard: Asus TUF GAMING X570-PRO (WI-FI) ATX AM4 Motherboard
GPU: Sapphire Radeon RX VEGA 56 8 GB Video Card (Reference, jet engine edition)

I think I’ve made some obvious error (e.g. probably some variable I’ve missed).

Starting the VM results in a black screen (though I can still SSH to the Debian host). Running “grep libvirtd /var/log/syslog” gives me the following:

Jan 22 16:38:32 userpc kernel: [   49.160548] CPU: 15 PID: 1118 Comm: libvirtd Tainted: G        W         4.19.0-13-amd64 #1 Debian 4.19.160-2
Jan 22 16:38:32 userpc kernel: [   49.160894] CPU: 15 PID: 1118 Comm: libvirtd Tainted: G        W         4.19.0-13-amd64 #1 Debian 4.19.160-2
Jan 22 16:40:11 userpc kernel: [   13.654154] audit: type=1400 audit(1611333607.820:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/libvirtd" pid=807 comm="apparmor_parser"
Jan 22 16:40:11 userpc kernel: [   13.654159] audit: type=1400 audit(1611333607.820:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/libvirtd//qemu_bridge_helper" pid=807 comm="apparmor_parser"
Jan 22 16:40:11 userpc libvirtd[1131]: libvirt version: 5.0.0, package: 4+deb10u1 (Guido Günther <[email protected]> Thu, 05 Dec 2019 00:22:14 +0100)
Jan 22 16:40:11 userpc libvirtd[1131]: hostname: userpc
Jan 22 16:40:11 userpc libvirtd[1131]: Non-executable hook script /etc/libvirt/hooks/qemu
Binary file /var/log/syslog matches

When I saw “Non-executable hook script /etc/libvirt/hooks/qemu”, I assumed I had just not set that file and the two scripts it triggers as executable. But I had done.

So I tried to SSH into my Debian host and run the “/bin/vega-vfio-startup.sh” script manually, which resulted in:

modprobe: FATAL: Module amdgpu is in use

At which point the script hangs. Isn’t that the point of the script though? I know it’s in use which is why I’m unloading the amdgpu module so it’s no longer in use so that the VM can use it?

Here’s the qemu file in /etc/libvirt/hooks/qemu that triggers to scripts:

#!/bin/sh

# Script for Windows_10_Pro
if [[ $1 == "Windows_10_Pro" ]]; then
   if [[ $2 == "prepare" ]]; then
        /bin/vega-vfio-startup.sh
   fi

   if [[ $2 == "release" ]]; then
        /bin/vega-vfio-teardown.sh
   fi
fi

Here is /bin/vega-vfio-startup.sh:

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop gdm.service

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

sleep 5

# Unload AMD drivers
modprobe -r amdgpu

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_0b_00_0
virsh nodedev-detach pci_0000_0b_00_1

# Load VFIO kernel module
modprobe vfio-pci

Here is the /bin/vega-vfio-teardown.sh script:

#!/bin/bash
set -x

# Unload VFIO-PCI Kernel Driver
modprobe -r vfio-pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

# Re-Bind GPU to AMD Driver
virsh nodedev-reattach pci_0000_0b_00_1
virsh nodedev-reattach pci_0000_0b_00_0

# Rebind VT consoles
echo 1 > /sys/class/vtconsole/vtcon0/bind

# Re-Bind EFI-Framebuffer
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

#Load amd driver
modprobe amdgpu

# Restart Display Manager
systemctl start gdm.service

Here’s the XML of my VM:

<!--
WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
OVERWRITTEN AND LOST. Changes to this xml configuration should be made using:
  virsh edit Windows_10_Pro
or other application using the libvirt API.
-->

<domain type='kvm'>
  <name>Windows_10_Pro</name>
  <uuid>75ab6ccc-91dd-426c-9a21-5ed5e05cc37f</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>14</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-3.1'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
    </hyperv>
    <vmport state='off'/>
  </features>
  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/home/user/Downloads/Win10_20H2_v2_English_x64.iso'/>
      <target dev='sdb' bus='sata'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/user/Media/4TB Internal/KVM Images/Windows_10_Pro.qcow2'/>
      <target dev='sdc' bus='sata'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:45:65:25'/>
      <source network='default'/>
      <model type='e1000e'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <sound model='ich9'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
    </sound>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>
    <redirdev bus='usb' type='spicevmc'>
      <address type='usb' bus='0' port='2'/>
    </redirdev>
    <redirdev bus='usb' type='spicevmc'>
      <address type='usb' bus='0' port='3'/>
    </redirdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Feel like I’ve not set something really obvious, but not sure what.

Okay, so I’ve now switched my setup to match this guide and this guide. I’m still using the same start and stop scripts though.

In the first GitHub guide they mention issues with GDM. So my systemctl stop gdm command seems to be at fault:

NOTE: Gnome/GDM users. You have to uncommment the line killall gdm-x-session in order for the script to work properly. Killing GDM does not destroy all users sessions like other display managers do.

So it looks like it can’t unload the amdgpu module because the display manager is still running.

I’ve switched my startup script to this but I’m still getting a black screen:

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service
killall gdm-x-session

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

sleep 5

# Unload AMD drivers
modprobe -r amdgpu

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_0b_00_0
virsh nodedev-detach pci_0000_0b_00_1

# Load VFIO kernel module
modprobe vfio-pci

Unfortunately, it cannot find a process called gdm-x-session which I’m guessing is because Debian uses Wayland rather than X in GNOME. Not sure what the equivalent process is in that case.

Nevermind, it’s gdm-wayland-session. I’ve amended the script to this:

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service
killall gdm-wayland-session

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

sleep 5

# Unload AMD drivers
modprobe -r amdgpu

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_0b_00_0
virsh nodedev-detach pci_0000_0b_00_1

# Load VFIO kernel module
modprobe vfio-pci

However, when I run the script through SSH it’s still giving me the modprobe: FATAL: Module amdgpu is in use error. It also says that there is no process with the name gdm-wayland-session. The process name is correct so I’m assuming it’s killed when systemctl stop display-manager.service is run before it.

So something graphical is still running? Splash screen?

Just looked at htop after running the script over SSH. Yeah, GNOME is still definitely running even after systemctl stop display-manager.service and killall gdm-wayland-session. All my applications are still running alongside GNOME shell etc.

I would have thought stopping the display manager service would kill everything running ¯\_(ツ)_/¯. Not sure how else to end everything in the session. Any suggestions?

Just in case it was a exclusive to Wayland I switched GNOME over to X.Org to test. Also switched to using killall gdm-x-session obviously.

Unfortunately, it’s still unable to unload the amdgpu module modprobe: FATAL: Module amdgpu is in use. Not sure what is using the module though.

Okay, so I’m able to stop GDM/GNOME by manually running sudo systemctl stop display-manager.service and then sudo pkill gdm-wayland-session. This doesn’t work in the hook script though for some reason as the session doesn’t get killed ¯\_(ツ)_/¯. I can’t explain why it works when it is manually run but not as a hook script.

Anyway, if I run those it drops me to a black screen with a flashing cursor in the top-left (guessing this is vconsole)? If I try run modprobe -r amdgpu from SSH it still does modprobe: FATAL: Module amdgpu is in use.

Output of lsmod | grep amdgpu (I can’t unload any of these modules):

amdgpu               3461120  21
chash                  16384  1 amdgpu
gpu_sched              28672  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
ttm                   126976  1 amdgpu
drm_kms_helper        208896  1 amdgpu
drm                   495616  25 gpu_sched,drm_kms_helper,amdgpu,ttm
mfd_core               16384  1 amdgpu

If I try to just run the VM with virsh start I just get a black screen.

I’m starting to wonder if my virtual consoles are just not unloading properly? Again here’s the script which does include commands needed to unbind the vconsoles:

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service
pkill gdm-wayland-session

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

sleep 5

# Unload AMD drivers
modprobe -r amdgpu

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_0b_00_0
virsh nodedev-detach pci_0000_0b_00_1

# Load VFIO kernel module
modprobe vfio-pci

I’ve installed the vendor-reset DKMS as well to appease the VFIO gods. Just in case. But otherwise I’m lost on this.

Same card, same problem. Im trying to solve this since days and cant get it to work at all. When I run my script over SSH from my phone, which btw looks similar to yours, just doenst have the unbind features for the console, I see that my Vega and its HDMI thingy are successfully detached, and that it started my “Win10 domain”, but only thing I get is a black screen.

1 Like

Have you tried unbinding the audio driver as well?

Yes, that’s virsh nodedev-detach pci_0000_0b_00_1 in the startup hook script. Neither virsh nodedev-detach pci_0000_0b_00_0 or virsh nodedev-detach pci_0000_0b_00_1 actually run properly, even if I manually run them (after running the hook script) as individual commands they just hang until I CTRL+C.

I think the problem is that those detach commands will not run until the amdgpu module is unloaded.

If I run the complete script as sudo manually, my screens completely go black but I still can’t unload the amdgpu module.

Here’s the relevant lspci output:

0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]

Have you started the pc, but chosen not to log in to gnome, switched to a text session (like Ctl+Alt+F3 or whatever) then closed the gui?

There is other ways to unbind and bind the pci devices tat might be worth a shot.

For testing, you could try killing gnome+gdm and stuff, logging in with ssh, then with sudo, try a couple of the echo commands, but using your devices addresses?

Can’t believe I didn’t think of that. Yeah exiting out of the display manager and ending it with systemctl stop display-manager.service killed of all the GNOME processes. So we can rule that out of the equation.

I’ll have a go with that script you mention substituting the id’s with my own.

I did just notice there are a bunch of PCI entries other than just the GPU and it’s audio output. Do I need to unbind all of these? Apologies this is my first VFIO attempt. Not sure if these are just from the motherboard or not.

user@userpc:~$ lspci | grep -e 'PCI\|\VGA\|Vega'
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57ad
03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a3
03:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a3
03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c3)
0a:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function

Edit: fixed my lspci output.

You’d only need to unbind the GPU’s devices, and I guess anything else in the same IOMMU group?

Not sure if this is right, especially since it segfaults, but does switch off my monitors as normal. I’m unsure if I’m suppose change “10de 100a” or “10de 0e1a” to something else since searching for either comes up with forum issues with 750ti’s.

#!/bin/bash

# VGA Controller
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind
echo '10de 100a'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/bind
echo '10de 100a'    > /sys/bus/pci/drivers/vfio-pci/remove_id

# Audio Controller
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/unbind
echo '10de 0e1a'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/bind
echo '10de 0e1a'    > /sys/bus/pci/drivers/vfio-pci/remove_id

Interestingly if I startup my VM with virsh start win10 after this and then try to modprobe -r amdgpu results in:

modprobe: ERROR: ../libkmod/libkmod-module.c:793 kmod_module_remove_module() could not remove 'amdgpu': Device or resource busy

So it seems like something is happening. Although virsh start win10 does hang still and there doesn’t seem to be a great deal of disk activity.

Nevermind, it’s the device ID and vendor ID. This should be right now but still segfaulting:

#!/bin/bash

# VGA Controller
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind
echo '1002 687F'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/bind
echo '1002 687F'    > /sys/bus/pci/drivers/vfio-pci/remove_id

# Audio Controller
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/unbind
echo '1002 aaf8'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/bind
echo '1002 aaf8'    > /sys/bus/pci/drivers/vfio-pci/remove_id

Edit: I need some sleep. Forgot to change the device ID for the HDMI audio.

1 Like

Yeah, you’d have to replace the hardware ID’s with the ones from your card.

lspci -nnk

should give the hardware ID with a colon in square braces:

09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] [10de:17c8] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

09:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition Audio [10de:0fb0] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

The highlighted ones are the Gpu and audio id’s for me.

I find it a lot easier to use a $20 used GPU as a “main” GPU and to isolate the pass through one. I had issues with a Rom Bar stopping the pass through for me headless, and it was a pain getting it off the card.

Have you tried isolating the pci id’s in grub, running the machine completely headless, to see if it would work that way?

Not yet, what’s the grub option and I’ll give it a shot via SSH?

on the same line in the grub config, add your hardware id’s, so it;s sometgin like:

GRUB_CMDLINE_LINUX=" iommu=1 amd_iommu=on rd.driver.pre=vfio-pci vfio-pci.ids=10de:17c8,10de:0fb0"

Like this?

GRUB_CMDLINE_LINUX="iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8"

Oddly my GPU is still being used as normal even after grub-update. These are 100% the right device ID’s. See: PCI\VEN_1002&DEV_687F - Vega 10 XL/XT [Radeon RX Vega… | Device Hunt