Linux host; GPU passthrough fails for no apparent reason

Hey guys,
I’ve been trying to set up a windows VM for a solid 3 days now.
I’ve managed to set the actual VM up, but for some reason, my GPU isn’t being passed through to the VM.
Oddly enough, the audio device attached to the GPU works perfectly.

So here’s my full situation. Ubuntu 19.10, kernel version 5.3.0-23, 1080ti, AsRock x470 Taichi Ultimate (bios version 3.30), Ryzen 7 1700, RX550 (Host GPU), 1080Ti (GPU that I’m trying to pass through). I hope that’s all the relevant information.

I’ve tried a good number of things, but so far only the only thing I’ve managed to fix is the GPU audio.

The only clue I have for what might be causing my troubles is the dmesg output for vfio.
At boot I get expected stuff, namely

[    4.349582] vfio-pci 0000:0e:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.368092] vfio_pci: add [10de:1b06[ffffffff:ffffffff]] class 0x000000/00000000
[    4.392444] vfio_pci: add [10de:10ef[ffffffff:ffffffff]] class 0x000000/00000000

But if I try to launch the VM, I get more lines. Most of them seem normal, but two lines confuse me…

[   50.620484] vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[   50.644065] vfio-pci 0000:0e:00.1: enabling device (0000 -> 0002)
[   51.847825] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   51.848054] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.015179] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.015664] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.024259] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.024742] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.029768] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.030250] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.033151] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.037588] vfio-pci 0000:0e:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[   53.037737] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.044279] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.048365] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.048410] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.048466] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.048702] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.048746] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.048966] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.049011] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.051824] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.051872] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.057102] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.057147] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   53.057192] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   53.057237] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs

The lines [ 50.620484] vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 and [ 53.037588] vfio-pci 0000:0e:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff confuse me, and I think they might be a clue as to what’s going on.

lspci -nnv also gives a half-expected, half-odd result…

0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] [10de:1b06] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau

0e:00.1 Audio device [0403]: NVIDIA Corporation GP102 HDMI Audio Controller [10de:10ef] (rev ff) (prog-if ff)
	!!! Unknown header type 7f
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

Not sure what to make of the header type, but at least I know vfio has claimed the card.

Something that’s also been frustrating is the fact that I get the reset bug if I try to boot the VM while attempting to pass the GPU through, forcing me to reboot the host if I wanna try again.

If anyone has a clue as to what’s going on, I’d greatly appreciate your input

Let me know what happens with that 7f code. I get that on certain gaming laptops.

As for invalid pci rom header, is it possibly a gpu rom misbatch or something? Or could it be the board rev?

No clue, my guess was that it was somehow software related, since 0xFFFF seems like a default value your gpu might throw at you in case of an error

0xFFFF would mean it can’t see anything there. 0x0000 mould at least be an addressable id number, at least I’d think.

@FurryJackman help us oh lord of stripes atd gpu’s.

So I updated my bios (even though asrock’s website advised against it), and now dmesg output has changed.

[   86.235698] vfio-pci 0000:0e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[   87.461805] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   87.462010] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   88.616903] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   88.617387] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   88.625962] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   88.626446] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   88.631224] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   88.631707] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs
[   88.634627] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   88.639061] vfio-pci 0000:0e:00.0: No more image in the PCI ROM
[   88.639105] vfio-pci 0000:0e:00.0: No more image in the PCI ROM
[   89.085092] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   89.090927] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   89.095093] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   89.095161] vfio-pci 0000:0e:00.0: vfio_bar_restore: reset recovery - restoring BARs
[   89.095222] vfio-pci 0000:0e:00.1: vfio_bar_restore: reset recovery - restoring BARs

On the bright side, the 0xFFFF code is gone, but now an even more cryptic message has appeared in its place…

Edit: nevermind, that was just a fluke. It’s back to the previous messages now

Maybe this is an indication that the guest GPU should be in the primary slot. There could be something funky going on with the guest GPU in the secondary slot. Initialization might simply not occur correctly on a secondary slot with this board.

On ASRock, you will need to stub out nouveau to boot on a secondary slot with the RX 550.

I always put guest GPUs in the primary slot and boot to the secondary GPU. You can do this with Gigabyte boards without issue. If this is indeed the issue, a X470 Aorus Master is worth looking into.

It’s already in the first slot (I mostly just didn’t feel like moving the 1080Ti when I dropped in the 550)
I’m also not exactly certain on the stub term. Is that the same as blacklisting nouveau? Because if yes then that’s also already done

Interesting, then everything should be in order, but you could be on a bad AGESA code. AGESA regressions are not uncommon.

It is indeed a bad AGESA update. Fought this problem yesterday. What you need is AMD AGESA Combo-AM4 1.0.0.4 Patch B

Luckily for my x470 Taichi it’s a beta bios and it fixed the problem.

I actually thought of that, and was hoping very hard that was not the case (since I didn’t know where I could find a beta bios)
Do you have an idea where I could find it? Not getting promising results on ddg

I could also downgrade my bios to a working AGESA version (assuming that’s possible), I suppose. At least until my 3950x gets delivered

https://www.asrock.com/mb/AMD/X470%20Taichi%20Ultimate/index.asp#BIOS

It’s at the bottom of the page, 3.77

I have the regular x470 Taichi and used the beta bios to make it work yesterday.

I feel stupid now, because I’ve looked at that pages dozens of time and I’ve never seen that part hahahaha

1 Like

Update! It worked!
Now to fix that pesky code-43 (aka Market Segmentation Error)

Nice! Yeah, you need both KVM hidden mode turned on for the VM and a MS Hyper-V vendor string that’s just random I believe.

Is there something else I need to pay attention to? Because I just added that but I’m still getting the error.
My first instinct would be to try uninstalling then reinstalling the driver

Fully coldboot if you’re still getting the error. Could be PCI-E power states. And disable ASPM.

Just did, still the same error. I also can’t find anything relating to ASPM in the bios or in the manual. Could it be one of those settings that every vendor has a different name for? (I doubt it tbh, since it’s a pretty descriptive name already)

Figured it out!
Turns out if I put the GTX in the secondary slot it works without a hitch.
Not too fond of this because now my prettier graphics card is at the bottom of my system where I can’t see it, but hey, function over form.
Thanks a lot for the help guys!

Interesting, so that’s a case where Threadripper 3000 could help you get it working but not compromise on PCI-E lanes.