Hello, I’m posting here about my experiences with RX 7900 XTX and VFIO to give others (and write down for myself) some more data about this card’s behavior.
Some specs
- Ryzen 5700X
- MSI MEG X570 UNIFY, BIOS 7C35vAH (2023-07-03)
- ASRock Radeon RX 7900 XTX Phantom Gaming OC
- 128GB RAM (static hugepages were necessary for consistent performance at high host uptimes)
Host setup
- ReBAR disabled in UEFI, otherwise default, IIRC
- Debian 12, kernel 6.1.55-1, QEMU 7.2+dfsg-7+deb12u2, libvirt 9.0.0-4, no GUI
-
vfio_pci vfio vfio_iommu_type1
added to /etc/initramfs-tools/modules
-
GRUB_CMDLINE_LINUX="vfio-pci.ids=1002:744c,1002:ab30 video=efifb:off hugepages=17408"
in /etc/default/grub
- VBIOS downloaded from here: techpowerup dot com/vgabios/254835/asrock-rx7900xtx-24576-221128-1 (a Taichi card in Quiet mode) and put in
/usr/share/vgabios
to let QEMU use it without disabling AppArmor
- CPU core isolation via systemd, toggled on demand
Guest setup
- main: Arch Linux, kernel 6.5.9-arch2-1, KDE Plasma 5, Wayland session
- occasional: Windows 10
Main libvirt XML
<domain type='kvm'>
<name>arch</name>
<uuid>uuid</uuid>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://archlinux.org/archlinux/rolling"/>
</libosinfo:libosinfo>
</metadata>
<memory unit='KiB'>33554432</memory>
<currentMemory unit='KiB'>33554432</currentMemory>
<memoryBacking>
<hugepages/>
</memoryBacking>
<vcpu placement='static'>10</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='3'/>
<vcpupin vcpu='1' cpuset='11'/>
<vcpupin vcpu='2' cpuset='4'/>
<vcpupin vcpu='3' cpuset='12'/>
<vcpupin vcpu='4' cpuset='5'/>
<vcpupin vcpu='5' cpuset='13'/>
<vcpupin vcpu='6' cpuset='6'/>
<vcpupin vcpu='7' cpuset='14'/>
<vcpupin vcpu='8' cpuset='7'/>
<vcpupin vcpu='9' cpuset='15'/>
<emulatorpin cpuset='2,10'/>
</cputune>
<os>
<type arch='x86_64' machine='pc-q35-7.2'>hvm</type>
<loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE_4M.fd</loader>
<nvram>/var/lib/libvirt/qemu/nvram/arch_VARS.fd</nvram>
<bootmenu enable='yes'/>
</os>
<features>
<acpi/>
<apic/>
<vmport state='off'/>
</features>
<cpu mode='host-passthrough' check='none' migratable='off'>
<topology sockets='1' dies='1' cores='5' threads='2'/>
<feature policy='require' name='topoext'/>
</cpu>
<clock offset='utc'>
<timer name='rtc' tickpolicy='catchup'/>
<timer name='pit' tickpolicy='delay'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled='no'/>
<suspend-to-disk enabled='no'/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source dev='/dev/disk/by-uuid/AAAA-AAAA'/>
<target dev='vda' bus='virtio'/>
<boot order='1'/>
<address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
<source dev='/dev/disk/by-uuid/BBBB-BBBB'/>
<target dev='vdb' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
</disk>
<controller type='usb' index='0' model='qemu-xhci' ports='15'>
<address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='0' model='pcie-root'/>
<controller type='pci' index='1' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='1' port='0x8'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='2' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='2' port='0x9'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<controller type='pci' index='3' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='3' port='0xa'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
</controller>
<controller type='pci' index='4' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='4' port='0xb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
</controller>
<controller type='pci' index='5' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='5' port='0xc'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
</controller>
<controller type='pci' index='6' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='6' port='0xd'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
</controller>
<controller type='pci' index='7' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='7' port='0xe'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
</controller>
<controller type='pci' index='8' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='8' port='0xf'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x7'/>
</controller>
<controller type='pci' index='9' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='9' port='0x10'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='10' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='10' port='0x11'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
</controller>
<controller type='pci' index='11' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='11' port='0x12'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
</controller>
<controller type='pci' index='12' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='12' port='0x13'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
</controller>
<controller type='pci' index='13' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='13' port='0x14'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
</controller>
<controller type='pci' index='14' model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='14' port='0x15'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
</controller>
<controller type='scsi' index='0' model='virtio-scsi'>
<address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
</controller>
<controller type='sata' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
</controller>
<controller type='virtio-serial' index='0'>
<address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
</controller>
<interface type='bridge'>
<mac address='xx:xx:xx:xx:xx:xx'/>
<source bridge='br0'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</interface>
<interface type='network'>
<mac address='yy:yy:yy:yy:yy:yy'/>
<source network='default'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
</interface>
<serial type='pty'>
<target type='isa-serial' port='0'>
<model name='isa-serial'/>
</target>
</serial>
<console type='pty'>
<target type='serial' port='0'/>
</console>
<channel type='unix'>
<target type='virtio' name='org.qemu.guest_agent.0'/>
<address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
<input type='tablet' bus='usb'>
<address type='usb' bus='0' port='1'/>
</input>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<audio id='1' type='none'/>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x30' slot='0x00' function='0x0'/>
</source>
<rom file='/usr/share/vgabios/ASRock.RX7900XTX.24576.221128_1-taichi-quiet.rom'/>
<address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x30' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x32' slot='0x00' function='0x3'/>
</source>
<address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
</hostdev>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
</memballoon>
<rng model='virtio'>
<backend model='random'>/dev/urandom</backend>
<address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
</rng>
</devices>
</domain>
Setup quirks
- Card wouldn’t get handled properly by amdgpu in VM unless I gave it VBIOS downloaded from that techpowerup site linked above. Using my own VBIOS dumped with GPU-Z on baremetal Windows install didn’t help at all - I think I was getting
Unknown PCI header
immediately after first VM shutdown. With that Taichi VBIOS and AppArmor issues handled by putting .rom in proper directory, it started working immediately, even with proprietary Nvidia drivers still installed in Arch.
- Windows VM required me to add virtual display, run DDU to remove Nvidia drivers and then install drivers from AMD website. Before that, with simply device IDs swapped from previous RTX 3070, I was getting no video output whatsoever.
- as a general quirk related to using Debian’s OVMF in newly created VMs - I had to manually specify
OVMF_CODE_4M.fd
as firmware. OVMF_CODE_4M.ms.fd
couldn’t boot to ISOs or existing drives used earlier in VMs created and ran on Arch host, no idea why.
Performance
- On Linux it’s satisfying (imo), as long as host isn’t doing too much in background. Core isolation helps keep GPU usage closer to 100% and reduce stutters, notably in Cyberpunk and RDR2. I’m considering upgrading to 5900X/5950X for a whole dedicated CCX to use with VM. Static hugepages were a necessity for me because of
kcompactd0
running at 100% otherwise, and the overall performance deteriorating with host uptime increasing.
- On Windows, which I use only for VR, it’s a bit weird but I’m not sure if that’s up to GPU. Once I started using static hugepages, 3070 (on SteamVR pre 2.0 update) was enough for stable 144Hz in Beat Saber at high graphical settings, with Radeon (and SteamVR 2.0) I had to disable additional effects to get the frametimes at somewhat stable 1ms. I don’t want to judge it right now, just saying, maybe it simply needs more tinkering.
Stability
This is my biggest concern. I use the host as 24/7 server and occasionally VMs on it as a game console. This card is prone to 3 failure modes I’ve seen so far:
- Around 10% recoverable driver timeout when GPU reset happens (e.g. I’ve observed this with 100% successful failures in COD MW2 Campaign Remastered in second mission after tutorial, as the screen goes white). amdgpu will complain in dmesg, and sometimes it can recover from this state, however most of the time a full VM stop & start cycle is required.
- Bugging out when card is “detached” from VM. This can happen when VM is
destroy --graceful
’ed and started again, or reset
- both performed through virsh
. Normal reboots happening inside VM have been stable so far. When this issue happens, dmesg/journalctl will show stuff like:
vfio-pci 0000:30:00.{0,1}: Unable to change power state from D0 to D3hot, device inaccessible
vfio-pci 0000:30:00.{0,1}: Unable to change power state from D3cold to D0, device inaccessible
-
libvirtd[7297]: internal error: Unknown PCI header type '127' for device '0000:30:00.0'
It’s quite annoying because a full host shutdown is required to get the card working again, even in host UEFI/GRUB. If power going to card isn’t cut off - it’ll stay unusable. However, that’s still not the biggest issue, as I can schedule a host power cycle to fix this. There’s still the following:
-
Sudden host reboots. It seems that for this to happen the card must fall into some bad state (reset?) and possibly crash QEMU. I’ve observed this only once or twice, so pardon me if my recollection isn’t perfect, but it was roughly this: one CPU core goes 100% on some kernel thread (kvm?), stays like that for a few seconds and suddenly host reboots. Unfortunately I don’t have any logs from those occurences, but IIRC one time there was something about kernel panic or oops logged to dmesg as I was watching it.
Especially this last issue is making me reconsider 4070 Ti, 4080 or going back to 3070 and finding less GPU intensive games to play
Attempted fixes to stability issues
-
echo "0" > "/sys/bus/pci/devices/0000:30:00.{0,1}/d3cold_allowed"
before or after starting VM - I don’t think it’s made any difference.
- echos to
/sys/bus/pci/devices/0000:30:00.{0,1}/{reset,remove}
, trying to rescan PCIe devices afterwards - no luck
So far it seems that if I had just one VM, that I never had to stop/shutdown, and I knew I’ll never get GPU timeout caused by (I assume) malfunctioning game, then I could have an ok experience. But for how much this card costs, this really shouldn’t be so brittle…
vendor-reset
doesn’t support 7000 series cards (yet) and 6000 series apparently didn’t need this kernel module. I’m sort of hoping that RDNA3 refresh won’t have this issue, in a tick-tock cycle.
For comparison, 3070 on the same machine was pretty much rock solid. It used to (very occasionally) hang the entire VM and not work until host reboot (but without requiring power to be cut off), though once I started using hugepages this problem has never occured again. Maybe it was related, maybe it wasn’t, it sucks that I have to reserve RAM at boot time but I’m ok with this tradeoff as it improves performance overall.
Other minor issues
- Unlike 3070, this card doesn’t show anything while in OVMF EFI, systemd-boot or before Windows desktop. Linux console output is visible, but only after amdgpu takes care of the card, I suspect.
virsh console vm-name
can be used to navigate OVMF/systemd-boot, so it’s not a huge problem, just a bit annoying.
- I haven’t found a way to unbind the GPU from amdgpu on host properly, I don’t remember what was the issue but card wasn’t handled properly by amdgpu inside VM, maybe it failed at loading firmware into GPU?
- Just like 3070, unless a VM with GPU driver is running, the idle power consumption is 10 or 20 watts higher, I think.
AMD Product Verification Tool protip
Also just for future reference (since Starfield promotional offer is now over), I was able to run AMD PVT in Windows VM (instead of baremetal) but only after adding the following to libvirt config:
<os>
[...]
<smbios mode='host'/>
</os>
Otherwise the program would just attempt to autoupdate, and then shut down without any user facing error message.
I hope that the above information is correct, had to recall some of it from memory. Maybe we’ll figure out what’s stopping those cards from being 100% reliable in VMs