VFIO, Single GPU passthrough, Vega 56 - Black Screen only, PopOS 20.10

Okay, here we go again.
Im trying to get a single GPU passthrough to work since days but all I got is a black screen. I definitely need help, as I cannot figure this out alone anymore.

My system:
MSI X99A Krait, Core i7 5820k, Powercolor Vega 56 Red Dragon, PopOS 20.10 (Tried also with Ubuntu 20.04 but same issue)

IOMMU is enabled in BIOS, as well as in systemd-boot which is used by PopOS.
The boot arguments are:
quiet loglevel=0 systemd.show_status=false splash intel_iommu=on iommu=pt.

iommu=pt is needed for me, otherwise my kexboard doesnt work when I boot, dont ask me why.

Vendor_reset is installed and loaded at boot, verified via lsmod.

My start script looks like this:

sudo systemctl stop display-manager.service


sleep 2
    
sudo virsh nodedev-detach pci_0000_05_00_0 #release the GPU
sudo virsh nodedev-detach pci_0000_05_00_1 #release the GPU audio device
sleep 2
sudo modprobe vfio-pci 

virsh start win10

I do run the script via SSH on my phone to see the output live.

It says that GPU and HDMI audio are successfully detached and that the VM has started.

When I run lspci -k when the VM is running it confirms that vfio-pci is indeed attached to the two devices.

In Qemu I have of course passed through these both devices as well as mouse and keyboard and got rid of the spice stuff.
I also pointed to the right BIOS ROM file via XML editing, which sits in /usr/share/vgabios.
I tried the correct one from techpowerup and my own one extracted via gpuz on Windows, or even completely without poiting to a bios file, no joy.

I also tried various other things, for example setting my BIOS to UEFI only from UEFI+Legacy, also no joy, always black screen.

Kvm-ok gives:
INFO: /dev/kvm exists
KVM acceleration can be used

and when I run htop via SSH while the VM (apparently) is started it indeed shows up there. Cant add another picture, so its basically saying the the VM is running and that is uses 100% cpu of one core and 24 Gb of memory, which I allocated.

I also modified my start script once so that it looked like this, to unbind the devices from console.

sudo systemctl stop display-manager.service


sleep 2
#Unbind framebuffer
printf 0 > /sys/class/vtconsole/vtcon0/bind || printf 'Failed to unbind vtcon0.\n'
printf 0 > /sys/class/vtconsole/vtcon1/bind || printf 'Failed to unbind vtcon1.\n'
printf 'efi-framebuffer.0' > /sys/bus/platform/drivers/efi-framebuffer/unbind || printf 'Failed to unbind efi-framebuffer.\n'
    
sudo modprobe vfio-pci 
#newstuff    
# Vega 10 XL/XT
printf '1002 687f' > /sys/bus/pci/drivers/vfio-pci/new_id
printf '0000:05:00.0' > /sys/bus/pci/drivers/amdgpu/unbind || printf 'Failed to unbind gpu from amdgpu.\n'
printf '0000:05:00.0' > /sys/bus/pci/drivers/vfio-pci/bind || printf 'Failed to bind gpu to vfio-pci.\n'
printf '1002 687f' > /sys/bus/pci/drivers/vfio-pci/remove_id

# Vega 10 HDMI Audio
printf '1002 aaf8' > /sys/bus/pci/drivers/vfio-pci/new_id
printf '0000:05:00.1' > /sys/bus/pci/drivers/snd_hda_intel/unbind || printf 'Failed to unbind gpu-audio from snd_hda_intel.\n'
printf '0000:05:00.1' > /sys/bus/pci/drivers/vfio-pci/bind || printf 'Failed to bind gpu-audio to vfio-pci.\n'
printf '1002 aaf8' > /sys/bus/pci/drivers/vfio-pci/remove_id
    
    
#sudo virsh nodedev-detach pci_0000_05_00_0 #release the GPU
#sudo virsh nodedev-detach pci_0000_05_00_1 #release the GPU audio device
sleep 2


virsh start win10

Guess what - didnt work.

I am ABSOLUTELY lost and about to give up, you experts are my only hope.
Help is greatly appreciated!

Regards!

For single gpu this is an edge case I don’t mess with much. But. I’ve heard you need to load a rom for your gpu since it’s already initialized. You might have to use vendor reset I suppose to reset the gpu and maybe also load an external copy of the rom instead of from the card??

I said that in my post good Sir :smiley:
“I also pointed to the right BIOS ROM file via XML editing, which sits in /usr/share/vgabios.
I tried the correct one from techpowerup and my own one extracted via gpuz on Windows, or even completely without poiting to a bios file, no joy.”

“Vendor_reset is installed and loaded at boot, verified via lsmod.”

That is exactly the reason why Im stuck.

Everything SEEMS to be loaded correctly and running, but it does not dispay anything at all. And nobody can help me.

It also cant be the IOMMU groups since on X99, everything has its own group.

I’d try another couple rom files. Did you make it yourself or get from tech poweup?

Whats going on :smiley:

"I tried the correct one from techpowerup and my own one extracted via gpuz on Windows, or even completely without poiting to a bios file, no joy.”

1 Like

thats good, you’ve tried all the usual tricks. Though sometimes it pays to try even more bioses if you can track them down.

Do you have a temporary graphics card you can borrow and get it working as the secondary and then work on making it the only gpu? That’s about all I can think to tell you next.

you don’t even see the tianocore or anything right, when you boot the vm?

tried adding nomodeset to your kernel command line?
nofb nomodeset and made sure via lsmod no vesa fb/etc modules have loaded?

your bios settings have you booting in EFI mode but with csm is disabled? It says you tried uefi+legacy, but usually csm disabled is its own option. And, ideally, EFI w/csm disabled is what you want here.

Tried that once as well, to no avail.

Regarding EFI/CSM: I read some conflicting stuff. One guy on Reddit pointed out that only CSM worked for him, So I tried all available options.

These options for me are “UEFI” and “UEFI+LEGACY”.

And no, I dont see Tianocore or anything, its really just a black screen.

That useful for disabling csm? The only way csm on fixing anything makes sense is if that prevents the uefi from initing the graphics card into graphics mode. Some bioses do that so the bios “looks good” in high res into some fb mode.

So maybe also disable “full screen logo display” if you can find it, so it’s just a text mode bios screen without the msi logo or anything like that.

And maybe you can find the above csm option to disable it. “The idea” is to prevent putting the gpu into higher modes, however we want to accomplish that.

Bios is fully up to date? I have had some trouble using passthrough in these edge cases on older hardware.

Adding a temporary gpu to verify passthrough can work is probably the most logical next step. Once it is working, then work on modifying the config to work with one gpu.

If it isn’t some h/w incompatibility, the problem is likely around the bios initalization of the gpu, if we really have ruled out the kernel initalization of the gpu.

Is the text mode you see on screen ye olde standard 80x25 text screen? If not then something is still initing the gpu. and we don’t want that.

UEFI with csm disabled, and those kernel options should for sure be there.

When I use the Windows 8.1 etc option only thing it does is greying out the UEFI+LEGACY option, with it being stuck on UEFI only.

I do have the option to use a 1660 Super to test vfio with two gpus.

However if I’d do that I’d like to stick in my first PCIE slot for other reasons and use the Vega as secondary.

That shouldn’t be a problem right?

should be fine

When I do that and put in the 1660 in slot 1 and the Vega in slot 2 the OS only sees my Vega …
They do work fine in Windows, so …

Oh gosh …

So when I boot, the signal comes from the 1660 first, then during booting PopOS it switches to the Vega and lspci does not even list the 1660. Also my keyboad behaves funky, I have to press every key slowly or … “chokes”.

WAT

IT RUNS!
IT FINALLY RUNS!

Four days of work … finally!

With two GPUs its SO much easier.
I used this guide for blacklisting and binding VFIO-PCI to the Vega, he mentions nvidia drivers but I just swapped the commands out and used “amdgpu”: Blacklisting Graphics Driver - Heiko's Blog

I dont have sound yet but … I think I can work the rest out, if not, Ill ask.

Ignore the absolute mess in the picture but Im just so happy I wanted to share :smiley:

Edit: Sound is working.
However: I passed through the Audio chip of my mainboard, when I turn off the VM it doesnt give me sound anyore on Linux.
Also: I can only use the VM once. When I shut it down and want to start it back up, nothing happens (or virt-manager crashes).

I DO have vendor-reset installed and loaded but it doesnt seem to be working at all.

If I can figure these last two things out im pretty much good to go.
Any tips?

1 Like

I sorted out the reset issue.
I tried before to bind my gpu and audio to vfio-pci through initramfs-tools, that didnt work at all.
The reason why I still could start the VM is (I guess!) that virt-manager grabbed what I told it to grab and Linux only THEN attached the vfio-pci driver. That would have been fine in itself, but vfio-pci not being loaded at boot meant, that vendor-reset couldnt do its thing.
So what I did was
sudo kernelstub -a vfio-pci.ids=1002:687f,1002:aaf8

So that is working now. BUT:

The VM crashes a few minutes in. Its seemingly random, I feel it has to do with my CPU setting in virt manager. I have a 5820k,6 cores 12 threads.
But as I said, just a feeling. I dont know where to start really.
ROM file also doesnt make any difference, if I give it one, or leave it out completely.

So … help again? :slight_smile:

I got my single gpu pass-through working with:

#!/bin/bash
systemctl stop display-manager
modprobe vfio-pci
virsh nodedev-detach <PCI devices>
./start_vm.sh

I have a running VM now with 2 GPUs as I said in the post above, still with the Vega as Guest GPU though.

However after a couple minutes it ALWAYS crashes and I dont know what the problem is, also I dont have an “entry point” to diagnose.

After fiddling around a lot I figured it has something to do with the GPU, since always when I tax the card the slightest, its crashing.
For example, opening OBS or even opening Radeon Settings. When Im just on the Windows desktop it either doesnt crash at all or takes very long to do so, just when I engange something that taxes the GPU it freezes.

Sometimes, it then unfreezes and recovers after a minute or so, but mostly not. I then have to “virsh destroy win10” to get back to Linux.

As I said, I dont know what to do.

Alright, so I got it all working now it SEEMS.
The solution was in the end to pass through the 1660S, situated in the 2nd PCIE slot and while having BIOS set to “UEFI”, NOT as @wendell suggested with “Windows 8/10 compatible” mode, that caused POPOS to not boot anymore, it hung at “Gnome 3 started” (Although it was installedin UEFI mode).

I tried to force it to use my Vega in slot 1 via /etc/X11/xorg.conf.d/Vega.conf, basically creating an Xorg config for it and pointing it to the PCI bus the Vega was using, while using “video=efifb:off” in kernel boot options.
Didnt work. Dont ask me why. Couldnt boot at all.

Now my config that WORKS looks like this:

Vega 56 in slot 1, connected to HDMI
1660S in slot 2, connected via DVI to the same monitor

Kernel options:
intel_iommu=on vfio-pci.ids=10de:21c4,10de:1aeb,10de:1aec,10de:1aed splash quiet iommu=pt loglevel=0 video=efifb:off systemd.show_status=false vfio_iommu_type1.allow_unsafe_interrupts=1

No fiddling at all with modprobe or anything else. The kernel gets loaded first, so why mess around with stuff that gets loaded after the kernel, thats what I thought anyway. Note here that contrary to basically all tutorials out there I needed to pass through FOUR IDs, not just the GPU itself and the HDMI audio, but also the integrated USB© controllers. All of these are in one IOMMU group, in my case number 33.

My VM XML looks like this (only relevant parts):

<domain type="kvm">
  <name>win10</name>
  <uuid>23385e20-1ef4-4798-80a7-31166a5c1d49</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">24576000</memory>
  <currentMemory unit="KiB">24576000</currentMemory>
  <vcpu placement="static" current="6">48</vcpu>
  <os>
    <type arch="x86_64" machine="pc-q35-5.0">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/OVMF/OVMF_CODE_4M.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vendor_id state="on" value="WowWhataVM"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
    <ioapic driver="kvm"/>
  </features>
  <cpu mode="custom" match="exact" check="none">
    <model fallback="forbid">Haswell-noTSX</model>
    <topology sockets="6" dies="1" cores="1" threads="8"/>
    <feature policy="require" name="ibpb"/>
    <feature policy="require" name="md-clear"/>
    <feature policy="require" name="spec-ctrl"/>
    <feature policy="require" name="ssbd"/>
  </cpu>
  <clock offset="localtime">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
    <timer name="hypervclock" present="yes"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2"/>
      <source file="/var/lib/libvirt/images/win10.qcow2"/>
      <target dev="sda" bus="sata"/>
      <boot order="1"/>

Note the included vendor ID, hidden state flag and ioapic driver=“kvm” line to avoid error 43 during driver installation in Windows.

Also interesting: I thought I had to blacklist nvidia driver, (I work with nouveau because in Linux I dont need supermega gpu performance anyway), but that wasnt the case. Actually it was counterproductive, as it messed with PopOS and again led to it not being bootable anymore. This was unlogical because I didnt use the 1660S as a boot GPU, at that point there wasnt even a monitor cable attached, but it is what it is.
So if youre working with nouveau, dont blacklist it, just add the pci ids to the kernel options to load vfio-pci.
The lspci output for it now looks like this, if anyone is curious:

Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

So it does load the driver apparently but isnt using it, guess because I loaded vfio-pci before anything else.

Conclusion:
I think I found the easiest way of doing vfio things.
There are a lot of people doing all kinds of hooks and scripts and whatnot, but idk if thats really necessary. Just include what you need in grub or systemd-boot kernel arguments (I had to work with kernelstub, you might have to use grub file editing).
Also: Vega doesnt like vfio at all, the 1660S is flawless.
Another thing I noticed (but that might be mainboard specific): Using the GPU in the first PCIE slot for vfio is always trickier than using one in slot 2 (or 3), even if you blacklist its drivers etc.
What also is resolved now is my sound problem I mentioned in an earlier post - just by switching to NVidia.
Also I didnt need to link to a Vbios rom file anymore.

My initial goal however of getting single gpu passthrough to work failed. Thats what I tried at first. Since then, some hours have passed and I learned and read a lot and my wife asked my why the hell Im doing all this :upside_down_face:
It could be, that it failed because of my specific hardware config which, as stated above, doesnt like vfio with gpus in slot 1, but also in general single gpu stuff seems to be much harder to realize as you NEED unbind scripts for the consule, for the framebuffer etc. and that creates a lot of unneeded … stress.

Im now off playing games :smiley:
If anyone has any questions I am very happy to answer if I can.
This is a VERY difficult topic, especially for someone like me, whose biggest achievement in linux before was to set up a VPS server and install some snaps on it.
I learned a lot and Im happy to pass what I now know.

Thanks everyone for the help, especially “Mr. VFIO” @wendell :slight_smile: