Attempting single GPU passthrough but only getting a black screen

Copious · January 28, 2021, 5:36pm

Okay, so I’m able to stop GDM/GNOME by manually running sudo systemctl stop display-manager.service and then sudo pkill gdm-wayland-session. This doesn’t work in the hook script though for some reason as the session doesn’t get killed ¯\_(ツ)_/¯. I can’t explain why it works when it is manually run but not as a hook script.

Anyway, if I run those it drops me to a black screen with a flashing cursor in the top-left (guessing this is vconsole)? If I try run modprobe -r amdgpu from SSH it still does modprobe: FATAL: Module amdgpu is in use.

Output of lsmod | grep amdgpu (I can’t unload any of these modules):

amdgpu               3461120  21
chash                  16384  1 amdgpu
gpu_sched              28672  1 amdgpu
i2c_algo_bit           16384  1 amdgpu
ttm                   126976  1 amdgpu
drm_kms_helper        208896  1 amdgpu
drm                   495616  25 gpu_sched,drm_kms_helper,amdgpu,ttm
mfd_core               16384  1 amdgpu

If I try to just run the VM with virsh start I just get a black screen.

I’m starting to wonder if my virtual consoles are just not unloading properly? Again here’s the script which does include commands needed to unbind the vconsoles:

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service
pkill gdm-wayland-session

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

sleep 5

# Unload AMD drivers
modprobe -r amdgpu

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_0b_00_0
virsh nodedev-detach pci_0000_0b_00_1

# Load VFIO kernel module
modprobe vfio-pci

I’ve installed the vendor-reset DKMS as well to appease the VFIO gods. Just in case. But otherwise I’m lost on this.

Spicysauced · January 29, 2021, 3:14pm

Same card, same problem. Im trying to solve this since days and cant get it to work at all. When I run my script over SSH from my phone, which btw looks similar to yours, just doenst have the unbind features for the console, I see that my Vega and its HDMI thingy are successfully detached, and that it started my “Win10 domain”, but only thing I get is a black screen.

lbibass · January 30, 2021, 8:37pm

Have you tried unbinding the audio driver as well?

Copious · January 31, 2021, 9:24am

Yes, that’s virsh nodedev-detach pci_0000_0b_00_1 in the startup hook script. Neither virsh nodedev-detach pci_0000_0b_00_0 or virsh nodedev-detach pci_0000_0b_00_1 actually run properly, even if I manually run them (after running the hook script) as individual commands they just hang until I CTRL+C.

I think the problem is that those detach commands will not run until the amdgpu module is unloaded.

If I run the complete script as sudo manually, my screens completely go black but I still can’t unload the amdgpu module.

Here’s the relevant lspci output:

0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]

Trooper_ish · January 31, 2021, 10:55am

Have you started the pc, but chosen not to log in to gnome, switched to a text session (like Ctl+Alt+F3 or whatever) then closed the gui?

There is other ways to unbind and bind the pci devices tat might be worth a shot.

Trooper_ish · January 31, 2021, 10:57am

For testing, you could try killing gnome+gdm and stuff, logging in with ssh, then with sudo, try a couple of the echo commands, but using your devices addresses?

Need help with dynamically binding and un-binding Nvidia GPU VFIO

Hey Tree, that would be my guide you’re talking about Like gordonthree said, you could run a script like this: #!/bin/bash # VGA Controller: unbind nvidia and bind vfio-pci echo '0000:0f:00.0' > /sys/bus/pci/devices/0000:0f:00.0/driver/unbind echo '10de 100a' > /sys/bus/pci/drivers/vfio-pci/new_id echo '0000:0f:00.0' > /sys/bus/pci/devices/0000:0f:00.0/driver/bind echo '10de 100a' > /sys/bus/pci/drivers/vfio-pci/remove_id # Audio Controller: unbind snd_hda_intel and bind vfio-pci echo '0000:0f:00.1' > /sys/bus/pci/devices/0000:0f:00.1/driver/unbind echo '10de 0e1a' > /sys/bus/pci/drivers/vfio-pci/new_id echo '0000:0f:00.1' > /sys/bus/pci/devices/0000:0f:00.1/driver/bind echo '10de 0e1a' > /sys/bus/pci/drivers/vfio-pci/remove_id I’m using the virsh tool in my scripts but this works and is a manual way of accomplishing the same task. Also, it doesn’t matter if you’re using GPUs with the same drivers or not. The only thing to be careful about is this: ## Unload nvidia modprobe -r nvidia_drm modprobe -r nvidia_uvm modprobe -r nvidia_modeset ^^ Since you’ll be using two Nvidia GPUs, you wont be able to unload these modules because they’ll have non-zero usage count (this wasn’t an issue for me because I had an AMD GPU for my host). So go ahead and just delete those lines for your setup.

Copious · January 31, 2021, 11:54am

Can’t believe I didn’t think of that. Yeah exiting out of the display manager and ending it with systemctl stop display-manager.service killed of all the GNOME processes. So we can rule that out of the equation.

I’ll have a go with that script you mention substituting the id’s with my own.

I did just notice there are a bunch of PCI entries other than just the GPU and it’s audio output. Do I need to unbind all of these? Apologies this is my first VFIO attempt. Not sure if these are just from the motherboard or not.

user@userpc:~$ lspci | grep -e 'PCI\|\VGA\|Vega'
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57ad
03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a3
03:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a3
03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c3)
0a:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function

Edit: fixed my lspci output.

Trooper_ish · January 31, 2021, 12:05pm

You’d only need to unbind the GPU’s devices, and I guess anything else in the same IOMMU group?

Copious · January 31, 2021, 12:30pm

Not sure if this is right, especially since it segfaults, but does switch off my monitors as normal. I’m unsure if I’m suppose change “10de 100a” or “10de 0e1a” to something else since searching for either comes up with forum issues with 750ti’s.

#!/bin/bash

# VGA Controller
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind
echo '10de 100a'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/bind
echo '10de 100a'    > /sys/bus/pci/drivers/vfio-pci/remove_id

# Audio Controller
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/unbind
echo '10de 0e1a'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/bind
echo '10de 0e1a'    > /sys/bus/pci/drivers/vfio-pci/remove_id

Interestingly if I startup my VM with virsh start win10 after this and then try to modprobe -r amdgpu results in:

modprobe: ERROR: ../libkmod/libkmod-module.c:793 kmod_module_remove_module() could not remove 'amdgpu': Device or resource busy

So it seems like something is happening. Although virsh start win10 does hang still and there doesn’t seem to be a great deal of disk activity.

Copious · January 31, 2021, 1:04pm

Nevermind, it’s the device ID and vendor ID. This should be right now but still segfaulting:

#!/bin/bash

# VGA Controller
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind
echo '1002 687F'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.0' > /sys/bus/pci/devices/0000:0b:00.0/driver/bind
echo '1002 687F'    > /sys/bus/pci/drivers/vfio-pci/remove_id

# Audio Controller
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/unbind
echo '1002 aaf8'    > /sys/bus/pci/drivers/vfio-pci/new_id
echo '0000:0b:00.1' > /sys/bus/pci/devices/0000:0b:00.1/driver/bind
echo '1002 aaf8'    > /sys/bus/pci/drivers/vfio-pci/remove_id

Edit: I need some sleep. Forgot to change the device ID for the HDMI audio.

Trooper_ish · January 31, 2021, 1:13pm

Yeah, you’d have to replace the hardware ID’s with the ones from your card.

lspci -nnk

should give the hardware ID with a colon in square braces:

09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] [10de:17c8] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

09:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition Audio [10de:0fb0] (rev a1)
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

The highlighted ones are the Gpu and audio id’s for me.

I find it a lot easier to use a $20 used GPU as a “main” GPU and to isolate the pass through one. I had issues with a Rom Bar stopping the pass through for me headless, and it was a pain getting it off the card.

Trooper_ish · January 31, 2021, 1:14pm

Have you tried isolating the pci id’s in grub, running the machine completely headless, to see if it would work that way?

Copious · January 31, 2021, 1:16pm

Not yet, what’s the grub option and I’ll give it a shot via SSH?

Trooper_ish · January 31, 2021, 1:19pm

on the same line in the grub config, add your hardware id’s, so it;s sometgin like:

GRUB_CMDLINE_LINUX=" iommu=1 amd_iommu=on rd.driver.pre=vfio-pci vfio-pci.ids=10de:17c8,10de:0fb0"

Copious · January 31, 2021, 1:42pm

Like this?

GRUB_CMDLINE_LINUX="iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8"

Oddly my GPU is still being used as normal even after grub-update. These are 100% the right device ID’s. See: PCI\VEN_1002&DEV_687F - Vega 10 XL/XT [Radeon RX Vega… | Device Hunt

Trooper_ish · January 31, 2021, 1:55pm

That looks right, and on my system it reserves them for vfio, with driver in use: vfio.

Not sure what else to try.

Might it be something to do with the efib?
Any mention of the he id’s or the pci address in dmesg to say what is working or not working on boot time?

Copious · January 31, 2021, 2:05pm

I’m now 100% convinced my GPU is cursed.

Here’s some dmesg output, not sure if it’s what your looking for:

user@userpc:~$ sudo dmesg | grep 1002
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.0-13-amd64 root=/dev/mapper/userpc-lv_root ro iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8 quiet loglevel=3 fsck.mode=auto
[    0.172355] Kernel command line: BOOT_IMAGE=/vmlinuz-4.19.0-13-amd64 root=/dev/mapper/userpc-lv_root ro iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8 quiet loglevel=3 fsck.mode=auto
[    0.503827] pci 0000:0b:00.0: [1002:687f] type 00 class 0x030000
[    0.504008] pci 0000:0b:00.1: [1002:aaf8] type 00 class 0x040300
[    1.847255] [drm] initializing kernel modesetting (VEGA10 0x1002:0x687F 0x1002:0x0B36 0xC3).
[    2.433841] Topology: Add dGPU node [0x687f:0x1002]
[    2.433903] kfd kfd: added device 1002:687f
user@userpc:~$ sudo dmesg | grep 0000:0b:00
[    0.503827] pci 0000:0b:00.0: [1002:687f] type 00 class 0x030000
[    0.503850] pci 0000:0b:00.0: reg 0x10: [mem 0xd0000000-0xdfffffff 64bit pref]
[    0.503859] pci 0000:0b:00.0: reg 0x18: [mem 0xe0000000-0xe01fffff 64bit pref]
[    0.503865] pci 0000:0b:00.0: reg 0x20: [io  0xf000-0xf0ff]
[    0.503871] pci 0000:0b:00.0: reg 0x24: [mem 0xfcc00000-0xfcc7ffff]
[    0.503878] pci 0000:0b:00.0: reg 0x30: [mem 0xfcc80000-0xfcc9ffff pref]
[    0.503890] pci 0000:0b:00.0: BAR 0: assigned to efifb
[    0.503935] pci 0000:0b:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.504008] pci 0000:0b:00.1: [1002:aaf8] type 00 class 0x040300
[    0.504024] pci 0000:0b:00.1: reg 0x10: [mem 0xfcca0000-0xfcca3fff]
[    0.504099] pci 0000:0b:00.1: PME# supported from D1 D2 D3hot D3cold
[    0.508847] pci 0000:0b:00.0: vgaarb: setting as boot VGA device
[    0.508847] pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.508847] pci 0000:0b:00.0: vgaarb: bridge control possible
[    0.535543] pci 0000:0b:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.535550] pci 0000:0b:00.1: Linked as a consumer to 0000:0b:00.0
[    0.535551] pci 0000:0b:00.1: D0 power state depends on 0000:0b:00.0
[    1.152550] iommu: Adding device 0000:0b:00.0 to group 24
[    1.152610] iommu: Using direct mapping for device 0000:0b:00.0
[    1.152695] iommu: Adding device 0000:0b:00.1 to group 25
[    1.152712] iommu: Using direct mapping for device 0000:0b:00.1
[    1.847293] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_gpu_info.bin
[    1.847317] amdgpu 0000:0b:00.0: No more image in the PCI ROM
[    1.847353] amdgpu 0000:0b:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    1.847354] amdgpu 0000:0b:00.0: GART: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    1.847657] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sos.bin
[    1.847671] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_asd.bin
[    1.847706] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_acg_smc.bin
[    1.847726] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_pfp.bin
[    1.847736] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_me.bin
[    1.847746] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_ce.bin
[    1.847755] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_rlc.bin
[    1.847793] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_mec.bin
[    1.847830] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_mec2.bin
[    1.848455] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sdma.bin
[    1.848467] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sdma1.bin
[    1.848578] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_uvd.bin
[    1.849121] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_vce.bin
[    2.528033] amdgpu 0000:0b:00.0: fb0: amdgpudrmfb frame buffer device
[    2.540371] amdgpu 0000:0b:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    2.540372] amdgpu 0000:0b:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    2.540373] amdgpu 0000:0b:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    2.540374] amdgpu 0000:0b:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    2.540374] amdgpu 0000:0b:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    2.540375] amdgpu 0000:0b:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    2.540376] amdgpu 0000:0b:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    2.540377] amdgpu 0000:0b:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    2.540377] amdgpu 0000:0b:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    2.540378] amdgpu 0000:0b:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    2.540379] amdgpu 0000:0b:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    2.540380] amdgpu 0000:0b:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    2.540380] amdgpu 0000:0b:00.0: ring 12(uvd<0>) uses VM inv eng 6 on hub 1
[    2.540381] amdgpu 0000:0b:00.0: ring 13(uvd_enc0<0>) uses VM inv eng 7 on hub 1
[    2.540382] amdgpu 0000:0b:00.0: ring 14(uvd_enc1<0>) uses VM inv eng 8 on hub 1
[    2.540382] amdgpu 0000:0b:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    2.540383] amdgpu 0000:0b:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    2.540384] amdgpu 0000:0b:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    2.541241] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:0b:00.0 on minor 0
[   15.243902] snd_hda_intel 0000:0b:00.1: Handle vga_switcheroo audio client
[   15.283308] input: HD-Audio Generic HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input12
[   15.283391] input: HD-Audio Generic HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input13
[   15.283483] input: HD-Audio Generic HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input14
[   15.283538] input: HD-Audio Generic HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input15
[   15.283643] input: HD-Audio Generic HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input16
[   15.283746] input: HD-Audio Generic HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input17

Trooper_ish · January 31, 2021, 2:09pm

You might try the efib=off or whatever in the grub command line?

video=efifb:off

From here:

HowTo: Headless VFIO Host Looking Glass

Figured I would document my recent success in getting a Threadripper system to pass through both GPUs to VMs without requiring a third video card, or any kernel patches. This will be just a quick rundown for those that are looking to do the same, I will not be providing help or support on this, but it should get you started should you be looking to do the same. Requirements: The primary video card MUST be something that can be re-posted, such as Intel or NVidia If you are using a VEGA, it must be the secondary card and must NOT post at boot as at this time FLR on this card is broken. If your primary is a NVidia card and it won’t post first, enable CSM, this was the only way I could achive this on my Gigabyte AORUS x399 Gaming 7. Follow the usual method of setting up your host for VFIO, be sure you have a method of remote access (SSH) as once it’s booted headless you will not be able to use the machine. As many have found, blacklisting the primary video driver is not enough to allow the video device to be passed through to a VM as the efifb kernel driver will mmap the BAR of the video card. To prevent this you can add the following to your kernel boot arguments: video=efifb:off Once this is done, after the bios boots the kernel, you should see no further screen updates, the system will appear frozen, but it isn’t. The card is now free to be passed through into a VM. If you are using a NVidia card as your primary device, after posting the VGA BIOS will not be accessible, to work around this dump the BIOS first and provide it to the vfio device in qemu as romfile=/path/to/the/vgabios.rom. You can use the nouveau driver to dump the rom. Using the above information I have two VM’s operating, one running Windows 10 with a GeForce 1080Ti, and the other running Linux on an AMD Vega 56. The host is headless as it now has no VGA adapter at all.

Copious · January 31, 2021, 3:50pm

Nope. It’s still starting as normal.

Grub config:

GRUB_CMDLINE_LINUX="iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8 video=efifb:off"

Here’s the dmesg logs again:

user@userpc:~$ sudo dmesg | grep 1002
[sudo] password for user: 
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.0-13-amd64 root=/dev/mapper/userpc-lv_root ro iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8 video=efifb:off quiet loglevel=3 fsck.mode=auto
[    0.169958] Kernel command line: BOOT_IMAGE=/vmlinuz-4.19.0-13-amd64 root=/dev/mapper/userpc-lv_root ro iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=1002:687F,1002:aaf8 video=efifb:off quiet loglevel=3 fsck.mode=auto
[    0.505647] pci 0000:0b:00.0: [1002:687f] type 00 class 0x030000
[    0.505829] pci 0000:0b:00.1: [1002:aaf8] type 00 class 0x040300
[    1.851449] [drm] initializing kernel modesetting (VEGA10 0x1002:0x687F 0x1002:0x0B36 0xC3).
[    2.439143] Topology: Add dGPU node [0x687f:0x1002]
[    2.439206] kfd kfd: added device 1002:687f
user@userpc:~$ sudo dmesg | grep 0000:0b:00
[    0.505647] pci 0000:0b:00.0: [1002:687f] type 00 class 0x030000
[    0.505671] pci 0000:0b:00.0: reg 0x10: [mem 0xd0000000-0xdfffffff 64bit pref]
[    0.505680] pci 0000:0b:00.0: reg 0x18: [mem 0xe0000000-0xe01fffff 64bit pref]
[    0.505686] pci 0000:0b:00.0: reg 0x20: [io  0xf000-0xf0ff]
[    0.505692] pci 0000:0b:00.0: reg 0x24: [mem 0xfcc00000-0xfcc7ffff]
[    0.505698] pci 0000:0b:00.0: reg 0x30: [mem 0xfcc80000-0xfcc9ffff pref]
[    0.505711] pci 0000:0b:00.0: BAR 0: assigned to efifb
[    0.505757] pci 0000:0b:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.505829] pci 0000:0b:00.1: [1002:aaf8] type 00 class 0x040300
[    0.505845] pci 0000:0b:00.1: reg 0x10: [mem 0xfcca0000-0xfcca3fff]
[    0.505920] pci 0000:0b:00.1: PME# supported from D1 D2 D3hot D3cold
[    0.510211] pci 0000:0b:00.0: vgaarb: setting as boot VGA device
[    0.510211] pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.510211] pci 0000:0b:00.0: vgaarb: bridge control possible
[    0.536690] pci 0000:0b:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.536697] pci 0000:0b:00.1: Linked as a consumer to 0000:0b:00.0
[    0.536697] pci 0000:0b:00.1: D0 power state depends on 0000:0b:00.0
[    1.149254] iommu: Adding device 0000:0b:00.0 to group 24
[    1.149315] iommu: Using direct mapping for device 0000:0b:00.0
[    1.149398] iommu: Adding device 0000:0b:00.1 to group 25
[    1.149415] iommu: Using direct mapping for device 0000:0b:00.1
[    1.851489] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_gpu_info.bin
[    1.851514] amdgpu 0000:0b:00.0: No more image in the PCI ROM
[    1.851549] amdgpu 0000:0b:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[    1.851550] amdgpu 0000:0b:00.0: GART: 512M 0x000000F600000000 - 0x000000F61FFFFFFF
[    1.851838] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sos.bin
[    1.851852] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_asd.bin
[    1.851888] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_acg_smc.bin
[    1.851907] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_pfp.bin
[    1.851918] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_me.bin
[    1.851926] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_ce.bin
[    1.851935] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_rlc.bin
[    1.851971] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_mec.bin
[    1.852006] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_mec2.bin
[    1.852606] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sdma.bin
[    1.852617] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_sdma1.bin
[    1.852724] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_uvd.bin
[    1.853260] amdgpu 0000:0b:00.0: firmware: direct-loading firmware amdgpu/vega10_vce.bin
[    2.533308] amdgpu 0000:0b:00.0: fb0: amdgpudrmfb frame buffer device
[    2.545680] amdgpu 0000:0b:00.0: ring 0(gfx) uses VM inv eng 4 on hub 0
[    2.545681] amdgpu 0000:0b:00.0: ring 1(comp_1.0.0) uses VM inv eng 5 on hub 0
[    2.545682] amdgpu 0000:0b:00.0: ring 2(comp_1.1.0) uses VM inv eng 6 on hub 0
[    2.545683] amdgpu 0000:0b:00.0: ring 3(comp_1.2.0) uses VM inv eng 7 on hub 0
[    2.545683] amdgpu 0000:0b:00.0: ring 4(comp_1.3.0) uses VM inv eng 8 on hub 0
[    2.545684] amdgpu 0000:0b:00.0: ring 5(comp_1.0.1) uses VM inv eng 9 on hub 0
[    2.545685] amdgpu 0000:0b:00.0: ring 6(comp_1.1.1) uses VM inv eng 10 on hub 0
[    2.545685] amdgpu 0000:0b:00.0: ring 7(comp_1.2.1) uses VM inv eng 11 on hub 0
[    2.545686] amdgpu 0000:0b:00.0: ring 8(comp_1.3.1) uses VM inv eng 12 on hub 0
[    2.545687] amdgpu 0000:0b:00.0: ring 9(kiq_2.1.0) uses VM inv eng 13 on hub 0
[    2.545688] amdgpu 0000:0b:00.0: ring 10(sdma0) uses VM inv eng 4 on hub 1
[    2.545688] amdgpu 0000:0b:00.0: ring 11(sdma1) uses VM inv eng 5 on hub 1
[    2.545689] amdgpu 0000:0b:00.0: ring 12(uvd<0>) uses VM inv eng 6 on hub 1
[    2.545690] amdgpu 0000:0b:00.0: ring 13(uvd_enc0<0>) uses VM inv eng 7 on hub 1
[    2.545690] amdgpu 0000:0b:00.0: ring 14(uvd_enc1<0>) uses VM inv eng 8 on hub 1
[    2.545691] amdgpu 0000:0b:00.0: ring 15(vce0) uses VM inv eng 9 on hub 1
[    2.545692] amdgpu 0000:0b:00.0: ring 16(vce1) uses VM inv eng 10 on hub 1
[    2.545693] amdgpu 0000:0b:00.0: ring 17(vce2) uses VM inv eng 11 on hub 1
[    2.546554] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:0b:00.0 on minor 0
[   14.013683] snd_hda_intel 0000:0b:00.1: Handle vga_switcheroo audio client
[   14.026813] input: HD-Audio Generic HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input12
[   14.026871] input: HD-Audio Generic HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input13
[   14.026953] input: HD-Audio Generic HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input14
[   14.027029] input: HD-Audio Generic HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input15
[   14.027111] input: HD-Audio Generic HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input16
[   14.027167] input: HD-Audio Generic HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:09:00.0/0000:0a:00.0/0000:0b:00.1/sound/card0/input17

nbartowski · January 31, 2021, 4:15pm

I haven’t had time to read thru the whole post and I have to take off soon so hopefully this isn’t pointing you in the wrong direction but the last “echo” command for both the VGA and Audio devices is unloading the VFIO-pci driver; I would expect that in this case either the normal driver, e.g. nouveau or nvidia or whatever, is then loaded or the device will have no driver at all. I would remove the last “echo” line for both devices that is writing to the /sys/but/pci/drivers/vfio-pci/remove_id file and see if that works. If it doesn’t, I’d also remove the /sys/bus/pci/drivers/vfio-pci/new_id command too. I’m new to this, but from my research I think you only need to use the pci bus address OR the device id, not both. so:
echo ‘0000:0b:00.0’ > /sys/bus/pci/devices/0000:0b:00.0/driver/unbind – unloads the current loaded driver like nvida; thenecho ‘0000:0b:00.0’ > /sys/bus/pci/devices/0000:0b:00.0/driver/bind --loads the VFIO-pci driver. this is all that I think is needed

One other thing of note in general (i.e. doesn’t apply to this specific case) you can also use “virsh” commands to accomplish the same, but I’ve been having problems where the first time I use the “virsh” command it cannot find the device and fails, for some reason the second or third time it is able to locate the device and works. So if you use “virsh” commands in a script, do some checking to see if it succeeded and if not try again. Use a loop or something to try 5 times is my suggestion.