Installing AMD 6800xt drivers in VM with vfio passthrough leads to host crash/reboot

Add in <features>

<kvm>
      <hidden state="on"/>
</kvm>

Also I would try to remove NVME pcie bus and post it as block device if its still crashing after hiding HV.

Also this:
rendernode='/dev/dri/by-path/pci-0000:10:00.0-render'
Have caused me troubles in the past.

I’ve tried with the hidden state=“on” as well, I had actually just finished removing it before posting this cause I had read that AMD cards sometimes don’t need it to be hidden.

I have have a second VM that I tried with just a block device and a fresh instead instead of the passed through drive since I thought as well that it may be conflicting. But the same issue persists.

What would you suggest as an alternative to this, that points to my Vega running the host OS.

Nothing really, at least for now. When troubleshooting remove as much variables as you can.
Once you get it stable, then you can tweak.

I thought it too, but Wendell said it should be there. Alas don’t have new 6xxx to test it myself, so I will rely on his word for now.

Edit: Taking into consideration that you have the same mobo and trying to pass NVME as pcie, I would say you have similar problem.

So to reiterate, I would try to pass GPU only, and nothing else. So disk as block storage or even file. Once that is working I would try to add more, one by one. Otherwise you never know what is causing troubles.

Cool I’ll take a look at the other post, and try removing everything but the gpu.

@misiektw reset my BIOS to the defaults, and applied all the IOMMU options to the same as I had before, now it’s working with no changes to the OS or VM configs.

3 Likes

Glad to hear it worked out :slight_smile:

1 Like

Glad to see it was solved!
I have the same mobo and planning to get the same CPU/GPU for VM setup, so thankfully, it works :slight_smile:

Hi @robbbot!

Glad that you got that working. Would you mind to share the XML of your VM which you got working? And what do you mean by “reset my BIOS to the defaults”? Q35 and BIOS? Q35 and OVMF?

I’m trying to passthrough my 6900 XT and I’m always getting a code 43. I tried everything I was able to find (including this thread :wink: ). I completely ran out of ideas what could be wrong.

My VM XML

(I tried also Q35 chipset + OVMF, and i440fx + OVMF, none of them worked)

Kernel parameters:

GRUB_CMDLINE_LINUX="amd_iommu=force_isolation iommu=pt rd.driver.pre=vfio-pci"

(I also tried amd_iommu=on)

I tried three versions of Radeon drivers:

  • 21.2.3 (22.2.2021)
  • 20.12.2 (12/18/2020)
  • 20.12.1 (12/8/2020)

The problem is clearly related to Windows and its drivers. I can passthrough the same card to a Linux VM and it works perfectly. Here goes Unigine Heaven running on it.

The XML of the Linux VM, just in case.

IOMMU groups seperation looks good to me.

My hardware:

  • AMD Ryzen 5950x
  • ASUS ROG Strix x570-F
  • 64 GB RAM DDR4 3600
  • XFX Radeon 6900 XT
  • XFX Radeon 6800 XT

Yes, I want to use 6800 XT on host and passthrough 6900 XT. Crazy setup, but as I mentioned, it works for a Linux VM, so I think it shouldn’t be a problem itself. :slight_smile:

I would appreciate any help or any ideas.

Hi there, The XML I’m using is linked above in the post. And the BIOS settings aren’t the virt-manager bios settings, but rather my actual motherboards bios settings. I reset the bios to default configuration, then re-applied the IOMMU / svm options. Basically I had all the right “linux” / “bootloader” settings in order, but my motherboard settings were screwy despite being set properly.

The actual bios settings I needed to have set to get it working can be found here: Dev Oops - Odds and Ends

1 Like

May also want to include the results or something like lspci -nnv |grep ATI -C 5to make sure that the vfio-pci driver is assigned properly to the card.

Thanks for quick reply.

i didn’t solve the issue yet, but your hint about BIOS settings made me realize that the IOMMU option was on Auto instead of Enabled. SVM was already enabled, CSM already disabled.

My BIOS has no option of setting a “boot GPU”, but i keep my “host” card (6800 XT) in the 1st PCIe slot, the card I want to passthrough (6900 XT) in the 2nd slot.

However, that didn’t fix the problem (even when I enabled IOMMU after restoring default settings), but I will check if there are more related options which could help.

I’ve seen a lot of AMD CBS options being on Auto and the IOMMU setting was in that section. I’ll try enabling them.

This is the fragment of my lspci -nnv output, with the devices I want to passthrough:

0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c0) (prog-if 00 [VGA controller])
	Subsystem: XFX Limited Device [1eae:6901]
	Flags: bus master, fast devsel, latency 0, IRQ 150, IOMMU group 30
	Memory at 7000000000 (64-bit, prefetchable) [size=16G]
	Memory at 7400000000 (64-bit, prefetchable) [size=256M]
	I/O ports at d000 [size=256]
	Memory at fc600000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at fc700000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] Physical Resizable BAR
	Capabilities: [240] Power Budgeting <?>
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2a0] Access Control Services
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Capabilities: [410] Physical Layer 16.0 GT/s <?>
	Capabilities: [440] Lane Margining at the Receiver <?>
	Kernel driver in use: vfio-pci
	Kernel modules: amdgpu

0e:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
	Flags: bus master, fast devsel, latency 0, IRQ 147, IOMMU group 31
	Memory at fc724000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [2a0] Access Control Services
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

0e:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a6] (prog-if 30 [XHCI])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a6]
	Flags: bus master, fast devsel, latency 0, IRQ 148, IOMMU group 32
	Memory at fc500000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [2a0] Access Control Services
	Kernel driver in use: vfio-pci
	Kernel modules: xhci_pci

0e:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:73a4]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0408]
	Flags: bus master, fast devsel, latency 0, IRQ 151, IOMMU group 33
	Memory at fc720000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/2 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [2a0] Access Control Services
	Kernel driver in use: vfio-pci

Looks fine, vfio-pci is a driver in use.

For comparison, there is the “host” card:

0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c1) (prog-if 00 [VGA controller])
	Subsystem: XFX Limited Device [1eae:6701]
	Flags: bus master, fast devsel, latency 0, IRQ 137, IOMMU group 26
	Memory at 7800000000 (64-bit, prefetchable) [size=16G]
	Memory at 7c00000000 (64-bit, prefetchable) [size=256M]
	I/O ports at e000 [size=256]
	Memory at fcc00000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at fcd00000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] Physical Resizable BAR
	Capabilities: [240] Power Budgeting <?>
	Capabilities: [270] Secondary PCI Express
	Capabilities: [2a0] Access Control Services
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [320] Latency Tolerance Reporting
	Capabilities: [410] Physical Layer 16.0 GT/s <?>
	Capabilities: [440] Lane Margining at the Receiver <?>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
	Flags: bus master, fast devsel, latency 0, IRQ 144, IOMMU group 27
	Memory at fcd20000 (32-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [64] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [2a0] Access Control Services
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

The “host” card (6800 XT) doesn’t have the “USB controller” and the “bus controller”, because it has no USB-C port, and the 6900 XT card has it.

Seems you’re in a similar situation as I was when I made this post, your configs for the most part look good. But you likely have some wonky motherboard BIOS setting that’s not toggled to the right state. Lemme know if you find out what it is.

1 Like

I fixed the issue.

The problem was about the SRIOV option in my host BIOS being enabled. Disabling it made Radeon drivers inside Windows VM working. Which is quite odd and I don’t really understand why that would be an issue. But that was the only non-default option which I enabled manually after reset, apart from those from the article that @robbbot linked (IOMMU, SVM, CSM). And I have a network card with SRIOV support which I was hoping to use for the Windows VM.

Thank you @robbbot for help!

2 Likes

The VirtIO network drivers work super well, try those out if you haven’t for your NIC.

1 Like

I had the similar issue with 5700 XT passthrough. I guess when you enable SR-IOV you will switch the system to what amd calls infinity fabric. SR-IOV is to split pci-e cards virtual functionality many vms without loosing performance. It is used for mainly ethernet card performance. So that you make many vms with high performance vitio ethernet tapped to single physical card.

However when you

lspci -tvv

Edit:
Found an example in web:

           +-02.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.1-[2a-2c]----00.0-[2b-2c]----00.0-[2c]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
           |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio

you see a cool structure where there is downstream and upstream ports connected to buses which connects to cards gpu and audio

If you emulate these downstream and upstream ports and see same structure in a linux VM you are done.

I dont have the card with me however I have some xml not sure if it is the correct one. I am pasting one here as a hint :

<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='dmi-to-pci-bridge'>
  <model name='i82801b11-bridge'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
</controller>
<controller type='pci' index='2' model='pci-bridge'>
  <model name='pci-bridge'/>
  <target chassisNr='2'/>
  <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='3' model='pcie-root-port'>
  <model name='pcie-root-port'/>
  <target chassis='3' port='0x8'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='4' model='pcie-root-port'>
  <model name='ioh3420'/>
  <target chassis='4' port='0x9'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='5' model='pcie-switch-upstream-port'>
  <model name='x3130-upstream'/>
  <target chassis='5' port='0x10'/>
  <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='6' model='pcie-switch-downstream-port'>
  <model name='xio3130-downstream'/>
  <target chassis='6' port='0x11'/>
  <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
  <model name='pcie-root-port'/>
  <target chassis='7' port='0x8'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
  <model name='pcie-root-port'/>
  <target chassis='8' port='0x9'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>

<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
  </source>
  <rom file='/usr/share/vgabios/5700XT.rom'/>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0' multifunction='on'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x1'/>
</hostdev>
1 Like

I was going to glom on to this but instead posted a new post where I’m having similar issues.

Unfortunately I started having a similar problem again…

I can’t update Radeon Software. Every time I try to do it, either through the current version of AMD Radeon Software or by downloading a new installer. Each time the driver installer just crashes and causes hard reboot of my VM.

Of course the reason I need newer drivers is Battlefield 2042. :smirk:

I didn’t change anything in the VM XML and each time I updated BIOS (on Asus ROG Strix X570-F), I was using the default config + mentioned above changes in config (IOMMU enabled, SVM enabled, CMS disabled, SR-IOV disabled). Does anyone have any idea what’s different about the newest AMD drivers?

To be precise, the versions I tried to install are:

Both had the same effect - crash and hard reboot when starting the installer.

Nevermind, I fixed that with changing the following BIOS settings:

Advanced → AMD CBS → NBIO Common Options → ACS Enable
Advanced → AMD CBS → NBIO Common Options → PCIe ARI Support
Advanced → AMD CBS → NBIO Common Options → PCIe ARI Enumeration

I searched them… by searching “vfio bios” on this forum, lol. But I’ve never seen those options in my BIOS before, so they must’ve come to ASUS ROG Strix motherboards with some recent BIOS update… or maybe I was blind and lucky to have VFIO working without these options, heh.

Did you enabled them or disabled them? On my bios they are on auto

You need ACS Enable set to enabled. What this does it creates proper separated IOMMU groups. In the same way as the, so called, ACS override patch does for mainboards that do not have this feature. The other two options listed in the post above should not be needed for PCIe passthrough as far as I understand them. I have them enabled on one of my computers as well, but not for PCIe passthough, but to use an SR-IOV capable device. On the other hand the software we call UEFI has shown to have terrible software quality in many cases. So if you have problems with passthrough you could try to toggle them regardless.

2 Likes