VFIO 2023 / Radeon 7000 Edition [WIP]

Hey, i’m following the guide on arch wiki and after setting up “virtualization technology” on the bios and editing the kernel parameter i got this log for dmesg | grep -i -e DMAR -e IOMMU, is it joever? I’m using an asrock 670e steel legend and a rizen 7900 cpu. GPU is asrock phantom gaming 6800XT.

[    0.025411] Kernel command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img root=PARTUUID=15968a6a-9471-4cf8-88e2-0ce79becfd56 zswap.enabled=0 rw rootfstype=ext4 iommu=pt
[    0.373199] iommu: Default domain type: Passthrough (set via kernel command line)
[    0.453115] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.

EDIT: I found it. Besides the virtualization on the CPU area, you should go into advanced → AMD CBS → NBIO → Iommu

I had this issue on 7900 XTX, after VM exit passed gpu was unusable and libvirt gave this same error.

echo "0" > "/sys/bus/pci/devices/${VIRSH_GPU_VIDEO}/d3cold_allowed"
In vm start script fixed it for me. (I’m changing it back with echo "1" after vm exit but before rebinding to amdgpu)

1 Like

I’ve directly assigned vfio-pci to it and am not using it on the host (blacklisted amdgpu). Though I’ve tried it anyway to no avail.
So far I can only every boot the VM once but after any reboot/shutdown of the VM it can’t boot again properly (cpu core gets pinned to 100% on start).
Same behavior when using a dumped gpu rom via . What seem strange though that I can’t assign a sub-device of the GPU vfio. It’s the the one with index 3 which gets i2c-designware-pci assigned to it (even if I blacklist that module also).

lspci

0d:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:7444]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0408]
Kernel driver in use: i2c-designware-pci

1 Like

I was checking it before with pci-stub driver but not vfio-pci directly, I was always rebinding it before vm start. No matter if using amdgpu or pci-stub I had to disable d3cold_allowed. Without disabling it, when I exit VM I’m getting Unknown PCI header type '127' error somewhere in VM log.
If I disable d3cold_allowed before VM is on, it works fine, can start and restart vm many times.
I’m using vbios from gpu-z.

Have you tried disabling d3cold_allowed before starting VM? Maybe in udev rules since you’re not using amdgpu at all for it.

1 Like

I’ve tried this in my qemu hook script of libvirt:

if [[ $command == "prepare" ]]; then
  echo "0" > "/sys/bus/pci/devices/0000:0d:00.0/d3cold_allowed"
elif [[ $command == "stopped" ]]; then
  echo "1" > "/sys/bus/pci/devices/0000:0d:00.0/d3cold_allowed"
fi

Also it’s not necessarily an Unknown PCI header type '127' issue as I was able to bind and use amdgpu to the gpu after the VM no longer worked with it (after guest os shutdown).
How did you dump the GPU Bios via gpu-z? did you do it after booting and from inside the guest or did you boot Windows on bare metal with another gpu?

I’v used windows on bare metal to dump vbios

Hey, i bought a 7900XTX today and i’m not not able to get video when i boot up my windows VM. I’m able to VNC into it and use it normally through my laptop, but using my host’s hardware i get no video, only audio. I couldn’t find many leads on how to debug, but one thing out of the ordinary is that i can’t consistently get video back on my host even after i turn off the VM. I have to force shutdown my computer and only then i get my video back. Executing the setup and teardown scripts manually gets me back on my login screen as they should, but if i boot the VM my display is gone until restart.

Update: Looking through logs i found these weird messages:

Oct 28 01:33:43 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem
Oct 28 01:33:43 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: deactivate vga console
Oct 28 01:24:27 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem
Oct 28 01:24:27 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: deactivate vga console
Oct 28 01:08:31 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
Oct 28 01:07:52 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem
Oct 28 01:07:52 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: deactivate vga console
Oct 28 00:51:03 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
Oct 28 00:50:28 selhar kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem

There is an article mentioning them on arch wiki, however both turning off my displays during boot and adding the kernel parameter did not help. I still get nothing on my screen whenever the VM turns on. I found out my keyboard still works though, so i’m able to turn the VM off through vnc, type the shortcut to my terminal and request a poweroff.

I have a very simple setup (i had a working vm with my 6800XT, built a new one in the hopes of finding out what was wrong without success), just a basic windows with drivers installed. Here’s the XML:

<domain type="kvm">
  <name>win11</name>
  <uuid>5824117f-8a2e-442e-b312-6076b6bce0af</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/11"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">28672000</memory>
  <currentMemory unit="KiB">28672000</currentMemory>
  <vcpu placement="static">20</vcpu>
  <os firmware="efi">
    <type arch="x86_64" machine="pc-q35-8.1">hvm</type>
    <firmware>
      <feature enabled="no" name="enrolled-keys"/>
      <feature enabled="yes" name="secure-boot"/>
    </firmware>
    <loader readonly="yes" secure="yes" type="pflash">/usr/share/edk2/x64/OVMF_CODE.secboot.4m.fd</loader>
    <nvram template="/usr/share/edk2/x64/OVMF_VARS.4m.fd">/var/lib/libvirt/qemu/nvram/win11_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode="custom">
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
    </hyperv>
    <vmport state="off"/>
    <smm state="on"/>
  </features>
  <cpu mode="host-passthrough" check="none" migratable="on">
    <topology sockets="1" dies="1" cores="20" threads="1"/>
  </cpu>
  <clock offset="localtime">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
    <timer name="hypervclock" present="yes"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" discard="unmap"/>
      <source file="/var/lib/libvirt/images/win11.qcow2"/>
      <target dev="sda" bus="virtio"/>
      <boot order="2"/>
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
    </disk>
    <controller type="usb" index="0" model="qemu-xhci" ports="15">
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
    </controller>
    <controller type="pci" index="0" model="pcie-root"/>
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="1" port="0x10"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="2" port="0x11"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x1"/>
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="3" port="0x12"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x2"/>
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="4" port="0x13"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x3"/>
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="5" port="0x14"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x4"/>
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="6" port="0x15"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x5"/>
    </controller>
    <controller type="pci" index="7" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="7" port="0x16"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x6"/>
    </controller>
    <controller type="pci" index="8" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="8" port="0x17"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x7"/>
    </controller>
    <controller type="pci" index="9" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="9" port="0x18"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="10" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="10" port="0x19"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x1"/>
    </controller>
    <controller type="pci" index="11" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="11" port="0x1a"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x2"/>
    </controller>
    <controller type="pci" index="12" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="12" port="0x1b"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x3"/>
    </controller>
    <controller type="pci" index="13" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="13" port="0x1c"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x4"/>
    </controller>
    <controller type="pci" index="14" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="14" port="0x1d"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x5"/>
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
    </controller>
    <controller type="virtio-serial" index="0">
      <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
    </controller>
    <interface type="network">
      <mac address="52:54:00:07:30:51"/>
      <source network="default"/>
      <model type="e1000e"/>
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </interface>
    <serial type="pty">
      <target type="isa-serial" port="0">
        <model name="isa-serial"/>
      </target>
    </serial>
    <console type="pty">
      <target type="serial" port="0"/>
    </console>
    <input type="tablet" bus="usb">
      <address type="usb" bus="0" port="1"/>
    </input>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <graphics type="spice" autoport="yes">
      <listen type="address"/>
    </graphics>
    <sound model="ich9">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1b" function="0x0"/>
    </sound>
    <audio id="1" type="none"/>
    <video>
      <model type="qxl" ram="65536" vram="65536" vgamem="16384" heads="1" primary="yes"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0"/>
    </video>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x03" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
    </hostdev>
    <watchdog model="itco" action="reset"/>
    <memballoon model="virtio">
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </memballoon>
  </devices>
</domain>

Hello, I’m posting here about my experiences with RX 7900 XTX and VFIO to give others (and write down for myself) some more data about this card’s behavior.

Some specs

  1. Ryzen 5700X
  2. MSI MEG X570 UNIFY, BIOS 7C35vAH (2023-07-03)
  3. ASRock Radeon RX 7900 XTX Phantom Gaming OC
  4. 128GB RAM (static hugepages were necessary for consistent performance at high host uptimes)

Host setup

  1. ReBAR disabled in UEFI, otherwise default, IIRC
  2. Debian 12, kernel 6.1.55-1, QEMU 7.2+dfsg-7+deb12u2, libvirt 9.0.0-4, no GUI
  3. vfio_pci vfio vfio_iommu_type1 added to /etc/initramfs-tools/modules
  4. GRUB_CMDLINE_LINUX="vfio-pci.ids=1002:744c,1002:ab30 video=efifb:off hugepages=17408" in /etc/default/grub
  5. VBIOS downloaded from here: techpowerup dot com/vgabios/254835/asrock-rx7900xtx-24576-221128-1 (a Taichi card in Quiet mode) and put in /usr/share/vgabios to let QEMU use it without disabling AppArmor
  6. CPU core isolation via systemd, toggled on demand

Guest setup

  1. main: Arch Linux, kernel 6.5.9-arch2-1, KDE Plasma 5, Wayland session
  2. occasional: Windows 10

Main libvirt XML

<domain type='kvm'>
  <name>arch</name>
  <uuid>uuid</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://archlinux.org/archlinux/rolling"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>33554432</memory>
  <currentMemory unit='KiB'>33554432</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>10</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='11'/>
    <vcpupin vcpu='2' cpuset='4'/>
    <vcpupin vcpu='3' cpuset='12'/>
    <vcpupin vcpu='4' cpuset='5'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='14'/>
    <vcpupin vcpu='8' cpuset='7'/>
    <vcpupin vcpu='9' cpuset='15'/>
    <emulatorpin cpuset='2,10'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-7.2'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE_4M.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/arch_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='off'>
    <topology sockets='1' dies='1' cores='5' threads='2'/>
    <feature policy='require' name='topoext'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/disk/by-uuid/AAAA-AAAA'/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/disk/by-uuid/BBBB-BBBB'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xe'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0xf'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='xx:xx:xx:xx:xx:xx'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <interface type='network'>
      <mac address='yy:yy:yy:yy:yy:yy'/>
      <source network='default'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <audio id='1' type='none'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x30' slot='0x00' function='0x0'/>
      </source>
      <rom file='/usr/share/vgabios/ASRock.RX7900XTX.24576.221128_1-taichi-quiet.rom'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x30' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x32' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
    </rng>
  </devices>
</domain>

Setup quirks

  1. Card wouldn’t get handled properly by amdgpu in VM unless I gave it VBIOS downloaded from that techpowerup site linked above. Using my own VBIOS dumped with GPU-Z on baremetal Windows install didn’t help at all - I think I was getting Unknown PCI header immediately after first VM shutdown. With that Taichi VBIOS and AppArmor issues handled by putting .rom in proper directory, it started working immediately, even with proprietary Nvidia drivers still installed in Arch.
  2. Windows VM required me to add virtual display, run DDU to remove Nvidia drivers and then install drivers from AMD website. Before that, with simply device IDs swapped from previous RTX 3070, I was getting no video output whatsoever.
  3. as a general quirk related to using Debian’s OVMF in newly created VMs - I had to manually specify OVMF_CODE_4M.fd as firmware. OVMF_CODE_4M.ms.fd couldn’t boot to ISOs or existing drives used earlier in VMs created and ran on Arch host, no idea why.

Performance

  1. On Linux it’s satisfying (imo), as long as host isn’t doing too much in background. Core isolation helps keep GPU usage closer to 100% and reduce stutters, notably in Cyberpunk and RDR2. I’m considering upgrading to 5900X/5950X for a whole dedicated CCX to use with VM. Static hugepages were a necessity for me because of kcompactd0 running at 100% otherwise, and the overall performance deteriorating with host uptime increasing.
  2. On Windows, which I use only for VR, it’s a bit weird but I’m not sure if that’s up to GPU. Once I started using static hugepages, 3070 (on SteamVR pre 2.0 update) was enough for stable 144Hz in Beat Saber at high graphical settings, with Radeon (and SteamVR 2.0) I had to disable additional effects to get the frametimes at somewhat stable 1ms. I don’t want to judge it right now, just saying, maybe it simply needs more tinkering.

Stability

This is my biggest concern. I use the host as 24/7 server and occasionally VMs on it as a game console. This card is prone to 3 failure modes I’ve seen so far:

  1. Around 10% recoverable driver timeout when GPU reset happens (e.g. I’ve observed this with 100% successful failures in COD MW2 Campaign Remastered in second mission after tutorial, as the screen goes white). amdgpu will complain in dmesg, and sometimes it can recover from this state, however most of the time a full VM stop & start cycle is required.
  2. Bugging out when card is “detached” from VM. This can happen when VM is destroy --graceful’ed and started again, or reset - both performed through virsh. Normal reboots happening inside VM have been stable so far. When this issue happens, dmesg/journalctl will show stuff like:
  • vfio-pci 0000:30:00.{0,1}: Unable to change power state from D0 to D3hot, device inaccessible
  • vfio-pci 0000:30:00.{0,1}: Unable to change power state from D3cold to D0, device inaccessible
  • libvirtd[7297]: internal error: Unknown PCI header type '127' for device '0000:30:00.0'
    It’s quite annoying because a full host shutdown is required to get the card working again, even in host UEFI/GRUB. If power going to card isn’t cut off - it’ll stay unusable. However, that’s still not the biggest issue, as I can schedule a host power cycle to fix this. There’s still the following:
  1. Sudden host reboots. It seems that for this to happen the card must fall into some bad state (reset?) and possibly crash QEMU. I’ve observed this only once or twice, so pardon me if my recollection isn’t perfect, but it was roughly this: one CPU core goes 100% on some kernel thread (kvm?), stays like that for a few seconds and suddenly host reboots. Unfortunately I don’t have any logs from those occurences, but IIRC one time there was something about kernel panic or oops logged to dmesg as I was watching it.

Especially this last issue is making me reconsider 4070 Ti, 4080 or going back to 3070 and finding less GPU intensive games to play :slight_smile:

Attempted fixes to stability issues

  • echo "0" > "/sys/bus/pci/devices/0000:30:00.{0,1}/d3cold_allowed" before or after starting VM - I don’t think it’s made any difference.
  • echos to /sys/bus/pci/devices/0000:30:00.{0,1}/{reset,remove}, trying to rescan PCIe devices afterwards - no luck

So far it seems that if I had just one VM, that I never had to stop/shutdown, and I knew I’ll never get GPU timeout caused by (I assume) malfunctioning game, then I could have an ok experience. But for how much this card costs, this really shouldn’t be so brittle…
vendor-reset doesn’t support 7000 series cards (yet) and 6000 series apparently didn’t need this kernel module. I’m sort of hoping that RDNA3 refresh won’t have this issue, in a tick-tock cycle. :eyes:

For comparison, 3070 on the same machine was pretty much rock solid. It used to (very occasionally) hang the entire VM and not work until host reboot (but without requiring power to be cut off), though once I started using hugepages this problem has never occured again. Maybe it was related, maybe it wasn’t, it sucks that I have to reserve RAM at boot time but I’m ok with this tradeoff as it improves performance overall.

Other minor issues

  1. Unlike 3070, this card doesn’t show anything while in OVMF EFI, systemd-boot or before Windows desktop. Linux console output is visible, but only after amdgpu takes care of the card, I suspect. virsh console vm-name can be used to navigate OVMF/systemd-boot, so it’s not a huge problem, just a bit annoying.
  2. I haven’t found a way to unbind the GPU from amdgpu on host properly, I don’t remember what was the issue but card wasn’t handled properly by amdgpu inside VM, maybe it failed at loading firmware into GPU?
  3. Just like 3070, unless a VM with GPU driver is running, the idle power consumption is 10 or 20 watts higher, I think.

AMD Product Verification Tool protip

Also just for future reference (since Starfield promotional offer is now over), I was able to run AMD PVT in Windows VM (instead of baremetal) but only after adding the following to libvirt config:

  <os>
  [...]
    <smbios mode='host'/>
  </os>

Otherwise the program would just attempt to autoupdate, and then shut down without any user facing error message.

I hope that the above information is correct, had to recall some of it from memory. Maybe we’ll figure out what’s stopping those cards from being 100% reliable in VMs :slight_smile:

Soon I’ll build my AMD 7950X 16 core CPU system - for gaming should I just yolo it and assign cores as I please or should I think about pinning and isolation? With my current i9 9900K and Proxmox everything is snappy, even VR without taking additional meassures.

Using <rom bar="off"/> fixed the issue regarding reset.
The VM will only show output on the display once the AMD driver in guest took over but I can actually reboot/shutdown+boot the guest OS now without getting the reset bug again.
In the past few months, I had difficulties with GPU pass-through until I reset my UEFI. Turns out enabling re-sizable bar causes the UEFI to initialize the non-primary GPU too (as evidenced by POST and systemd service startup being printed on its display). Disabling the re-sizable bar doesn’t resolve the issue, only resetting the UEFI does (while re-configuring all the same options except re-sizable bar support).

Considering these two factors, passing through RDNA3/7000 series works well in my setup.

7 Likes

For me skipping pinning only caused issue with audio (which I use an emulated card and LG for spice audio passthrough for). Using a 3950x (also 16 cores) I pin 8 of the cores (2 threads each) but didn’t isolate them while having the emulator threads on a separate core.

Holy crap, the extra efi reset is the missing variable!

2 Likes

I’d try enabling Core Isolation in Windows instead of enabling Hyper-V it worked Wonders for me… Same effect, a little annoying but no performance problems (at least not that I could tell)

I am not sure what it does. The reason I want Hyper-V, is because the anti-cheat of some games (ex PUBG), does not complain (kick) if it is enabled. Does Core Isolation offers the same advantage?

What do you mean by resetting UEFI? On your host bios or somewhere else?

image
The resizable bar option is configured in the hosts UEFI hence you reset the hosts UEFI

I also have a 7900XTX and the card is 100% stable as long as I don’t overdo it with OC and even then most of the time only the driver crashes, I’ve never had to restart the host.
What you can obviously try is to configure the bus topology correctly in libvirt.
With your configuration I have seen that people complain about poor OpenCL performance, maybe that is also the reason why your card is not stable.
And I use the BIOS for my card and not the one from another model, that doesn’t have to be the problem, but I would eliminate it as a possible source of the problem before I buy another card


 <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
      </source>
      <rom file='/etc/libvirt/navi31.rom'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x2'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x2'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x05' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x3'/>
    </hostdev>
GRUB_DISTRIBUTOR="Manjaro"
GRUB_CMDLINE_LINUX_DEFAULT="verbose iommu=pt amdgpu.sg_display=0 mitigations=off udev.log_priority=3"

Thank you for looking into my report. My model doesn’t have the USB-C port present in some other 7900 XTX cards, and therefore shows only 2 PCIe devices on host:

IOMMU Group 31:
        30:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)
IOMMU Group 32:
        30:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio [1002:ab30]

As you’ve suggested, I’ve fixed the topology to have audio device as a function of the GPU device and switched back to VBIOS dumped from my card months ago:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x30' slot='0x00' function='0x0'/>
      </source>
      <rom file='/usr/share/vgabios/ASRock7900XTX-my-dump.rom'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x30' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x1'/>
    </hostdev>

I’ve done those changes soon after your post and used the VM as usual for a few weeks. The card still mostly works, so I must’ve had something else misconfigured when I was using my dump of VBIOS earlier. I haven’t done OpenCL benchmarks before, so no comment on that matter.

Unfortunately my biggest issue here remains unsolved - card still can crash and require host power cycle. To expand this problem (noted in original post as 2nd “failure mode”), the card can fail after a normal reboot of OS within VM, but for that to happen it must (?) first encounter a hang or GPU reset reported in guest dmesg (for example by toying with llama.cpp and overusing GPU layer offloading), survive a virsh reset, and later a reboot inside VM can make card die until host power cycle. These lines repeat in host dmesg after that last reboot of guest:

vfio-pci 0000:30:00.1: vfio_bar_restore: reset recovery - restoring BARs
vfio-pci 0000:30:00.0: vfio_bar_restore: reset recovery - restoring BARs

I’m not sure if vfio_bar_restore: reset recovery - restoring BARs | Proxmox Support Forum is relevant to my case, especially since the poster uses 6700XT, but they’ve reported that simply replacing the Gigabyte Aorus 6700 XT Elite 12G card with MSI Radeon RX 6700 XT Mech 2X 12GB helped.

Still speaking of 6700XT, there’s also this, which suggests that some VBIOS can have working reset:

Which 7900XTX model do you have exactly, @Janos?

So, I have the 7900 XTX running perfectly with Proxmox 8.1.3 (kernel 6.5.11) as a passthrough to a Fedora guest VM. I can not get it to work with a Windows 11 Guest VM, though. It is driving me crazy.

Actually, I had Windows 11 working last night, but this morning, I went to use it, and it wasn’t working again (after a week of working to get it functional).

Now, it is back to a non-working state. The best I can get is garbled/distorted/artifact screens (random colored lines and boxes) when Windows starts up. Again, Fedora starts up with the same GPU totally fine.

Some other notes:

  • With Windows I get the same result whether I have a romfile or not.
  • When I RDP into the Windows machine, I see the driver in Device Manager and it is working.
  • I was able to install the Adrenaline driver but I had to use the offline installer

Fedora is passing through the entire device, and all functions. Here is the config:

acpi: 1
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 8
cpu: host
efidisk0: vmstorage:1500/vm-1500-disk-1.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:03:00,pcie=1,rombar=0
hostpci1: 0000:69:00,pcie=1
hostpci2: 0000:6d:00.0,pcie=1
machine: q35
memory: 32768
meta: creation-qemu=8.0.2,ctime=1691801072
name: personal-fedora38
net0: virtio=CA:84:67:5A:A5:F3,bridge=vmbr10,firewall=1
numa: 0
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=d3124f76-9e9e-4d62-b354-82b0f47c315e
sockets: 1
startup: order=1
tpmstate0: vmstorage:1500/vm-1500-disk-0.raw,size=4M,version=v2.0
usb0: host=320f:504b
usb1: host=046d:c52b
vga: none
virtio0: vmstorage:1500/vm-1500-disk-2.qcow2,iothread=1,size=250G
virtio1: vmstorage:1500/vm-1500-disk-3.qcow2,iothread=1,size=500G
vmgenid: ab64fe49-720b-4f94-85b2-48db44191e91

Windows is only getting passed the first two functions, GPU and Audio. Here is that config:

agent: 1
args: -cpu ‘host,-hypervisor,kvm=off’
balloon: 0
bios: ovmf
boot: order=virtio0;ide0;ide2;net0
cores: 8
cpu: host
efidisk0: vmstorage:1003/vm-1003-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hostpci0: 0000:03:00.0,pcie=1,rombar=0,romfile=vbios_1002_744c.bin,x-vga=1
hostpci2: 0000:03:00.1,pcie=1
hostpci1: 0000:69:00.0
hostpci3: 0000:6d:00.0
ide0: iso:iso/virtio-win-0.1.229.iso,media=cdrom,size=522284K
ide2: iso:iso/Win11_22H2_English_x64v2.iso,media=cdrom,size=5705260K
machine: q35
memory: 32768
meta: creation-qemu=8.1.2,ctime=1702607309
name: HeNe
net0: virtio=ca:84:67:5a:a5:f3,bridge=vmbr10,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=527f8787-942b-4510-b875-c31c63f7548b
sockets: 1
spice_enhancements: foldersharing=1,videostreaming=filter
startup: order=1,up=10
tpmstate0: vmstorage:1003/vm-1003-disk-1.qcow2,size=4M,version=v2.0
vga: none
virtio0: vmstorage:1003/vm-1003-disk-2.qcow2,iothread=1,size=500G
vmgenid: 24dee5c5-022b-485b-9716-90dbf774fffa

Any ideas?

DDU the driver (perhaps via a spice display temporarily) and reinstall it. There’s also Windows safe mode and trying another/fresh Windows install, at least for ruling out general Windows compatibility.
I’d also try adding all functions.