RX6800 vfio issue

Hello, i ran into a Problem using a RX6800 for a Win10 kvm with pci-passthrough, i can install Windows, but after installing the amd Drivers, the system crashes. After reboot windows seems to get stuck on driver loading, the VM pauses and thats it.

grubFile.txt (1.3 KB) lspci.txt (28.1 KB) uname.txt (113 Bytes)

VM xml config:

VMconfig.txt (7.6 KB)
Thanks in advance

I checked your config, but can’t see anything obvious for me.
Do you boot uefi only? Maybe card isn’t reset.

Also:

<vcpu placement="static" current="1">8</vcpu>

Not really related, but do you have reason to enable fewer than max?

Edit: Oh, and i’m not sure if AMD really requires hiding HV from OS. Don’t have AMD in my rig atm to check this.

<vcpu placement="static" current="1">8</vcpu>

yes forgot to change it after the 6th reconfiguration :grinning:

yes booting with ovmf.4M and Q35 in vm

Edit: Oh, and i’m not sure if AMD really requires hiding HV from OS. Don’t have AMD in my rig atm to check this.

ya i tried about every thing you can find on similar issues

so if the card doesnt reset, i propably have to test the kernel hack for AMDs reset bug

the card seemed to get bound:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-29-generic root=UUID=c29f54b3-6fb1-474b-a1f9-c9adf6667d62 ro quiet intel_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=1002:73bf,1002:ab28,1002:73a6,1002:73a6
[    0.066906] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.8.0-29-generic root=UUID=c29f54b3-6fb1-474b-a1f9-c9adf6667d62 ro quiet intel_iommu=on kvm.ignore_msrs=1 vfio-pci.ids=1002:73bf,1002:ab28,1002:73a6,1002:73a6
[    0.564011] VFIO - User Level meta-driver version: 0.3
[    0.564099] vfio-pci 0000:05:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    0.584021] vfio_pci: add [1002:73bf[ffffffff:ffffffff]] class 0x000000/00000000
[    0.603988] vfio_pci: add [1002:ab28[ffffffff:ffffffff]] class 0x000000/00000000
[    0.624028] vfio_pci: add [1002:73a6[ffffffff:ffffffff]] class 0x000000/00000000
[    0.624054] vfio_pci: add [1002:73a6[ffffffff:ffffffff]] class 0x000000/00000000
[   61.790989] vfio-pci 0000:05:00.0: enabling device (0100 -> 0103)
[   61.791368] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[   61.791374] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[   61.791378] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[   61.791380] vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x27@0x440

I saw that in your config. Meant hardware uefi only.

Which would be strange, because you probably already know that FLR supposed to work on 6xxx series…

Now I noticed that youre on 5.8. Maybe 5.9 is needed for that, and still not backported to 5.8. DOnt have AMD currently so I dont know the details.

@cryoss make sure you are hiding the HV from the OS. If it was working before you loaded the drivers, this is it, not the reset problem.

If you are facing the reset issue dmesg would show that something barfed after restarting the vm,.

do you see the “TianoCore” when restarting the vm?

Ive tried it now with 5.9 Kernel, same issue
Hiding the kvm doesent seem to make a difference.
it works with no driver installed, when the driver installation initialises the card, the VM hangs.
On reboot you can see the Windows dots before the tianoCore, sometimes even devices getting ready. Then the VM paused an crashed. Sometimes the Host crashes too.

dmesg_afterCrashHost.txt (234.1 KB) dmesg_afterDriverInstall.txt (179.3 KB)

At the first glance I would try to send “04:00.0 PCI bridge” to VM along with card?
Wandering what this is, some new bridge on NAVI?

That was something i was wondering too, will try tomorrow

Just in case, I would also try remove those other devices you’re passing trough by vid&did (extra usb?). Of course leave all 05: devices in.

But maybe Wendell can hint more what needs to be passed with new NAVI, since he has the actual card :wink:

I will double check my config. I did it through the vfio gui! Literally just adding the devices (all 4 of them!) and that’s it.

  <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='4096'/>
     <vendor_id state='on' value='Level1VM'/>

    </hyperv>

this is what I added, inside features section

Yeah, I missed vapic missing from his config I believe.
NVM. vendor_id state is missing.

Also, he gets many AER errors on 04 device, some Navi pcie switch.

bios fully up to date on that board?

Yeah Bios is on P3.5 microcode 306F2 3C Its an Asrock x99m killer with an i75820k.
Ive tried passing the PCI Bridge but virsh or the virt-manager both say “you only can passthrough endpoint devices”.Biggest Problem now is, often the Host crashes too, and i need to reboot 5 times because the Mobo cant initialise the GPU with error 62 on loading the UEFI screen (on the Host). Ive tried with Above 4g decoding on and off, Aspm on and off, CSM on and off.
vendor_id state didnt help either.

I think that mobo has problem with new GPU in general, and it has nothing to do with vifio.

Does it even work properly when you use it “as overlords intended” :wink:

BTW, did you use it before with different card for passtrough?

If i Boot the “VM” (on physical Nvme) on the bare machine everything works fine with both gpus beeing functional. Yeah i was wondering if the board just doesnt want to cooperate. The error code 62 with the GPU initialisation is new so maybe its a hardware issue with power or the physical pci slot?

So mobo works properly under Windows (no shocker there :wink:
But does it work properly under linux with RX6000 as host display?

And this is your first passtrough attempt with RX, or you used this mobo with other cards before?

Ive done it before with a 980ti, but remember to have some issues there too (cant remember what it was, little while ago) Linux does recognise Rx 6800, i didnt try to install any drivers (have to check what drivers Linux used on itself)

Yeah, I would start there.
I mean I would do RX host and GTX VM. And if that will work, then I would switch.

Also I just remembered, that few days ago someone had problem with PCIe lanes not beeing properly assigned when old GTX and AMD was pluged in.
Old GTX was grabbing all lanes for itself, and AMD was not detected at all.

It was on B550, but its is starting to smell exactly like it. So your gtx670 might not play well together with RX.
So solution may be to put one card trough chipset instead of direct CPU, but you have to check mobo manual how lanes are connected, and if its possible.

1 Like

tldr;
Is there any chance that you are modifying the BIOS?
AMD cards are checking BIOS checksum and if you make any changes, driver fails to load, pretty much the behavior you are describing…
In some VFIO configurations, people suggest to override the BIOS of the card to work, that is why I am asking.

Is there any chance that you are modifying the BIOS?

No its a stock sapphire card luckily bought on release day.

I will try the card swap, but i doubt i can change anything about the lane allocation, because the board only has two slots i can use.