2024 VFIO Performance Loss vs Bare Metal

Dear Exalted Techies,

I’ve been experimenting with VFIO for some time but have been unable to achieve the fabled “Bare-metal” performance described by others. I will use the game Squad as a primary example here. I am certain that this is a case of me just needing to tweak something but I’m at my wits end to determine what that is. Sadly VFIO is still a niche topic. Information is sparse and often dated. Help me Level1Techs, you’re my only hope!

System Spec
CPU: i7-10700
GPU: GTX 1070TI
RAM: 32GB
Host: Debian 12 (Stable)
Linux Kernel ver: 6.1.0-28
QEMU emulator ver: 7.2.13
Looking Glass Ver: B7-rc1 (d12 capture)

Background
In demanding games (Squad) I see a 30-40% loss in performance versus bare metal. In older titles (CS:Source, World in Conflict) the performance drop is more modest at maybe 10-20%. For example - on the same hardware, on a bare metal Windows 10 install I get a very solid 60 FPS on Squad irrespective of what is happening in the game. If I transpose this Windows 10 install into a VM I get maybe 45 FPS in quieter scenes, with the frame rate absolutely tanking to the mid 30s during battles.

Now, I suspect something funky is going on because Squad is not using all of my vm’s resources. According to Windows Task Manager I’m using ~40% of my allocated CPU resources & 50% of my GPU resources. Disk usage, according to Windows, is negligible. HWMonitor indicates that individual cores are hitting turbo frequency. At the same time my host machine is essentially “idling” running no applications other than looking-glass & virt-manager (excluding background processes etc).

It is frustrating that people say VFIO can achieve “bare-metal-like” performance without actually quantifying what that means. I.E. whether I should be expecting to lose 1% of my total performance or 10%.

VM Configuration
Grub kernel parameters:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt vfio-pci.ids=10de:1cb1,10de:0fb9"
Hugepages:

$ cat /proc/meminfo | grep Huge
AnonHugePages:    544768 kB
ShmemHugePages:   684032 kB
FileHugePages:         0 kB
HugePages_Total:    8192
HugePages_Free:     8192
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        16777216 kB

CPU Pinning:
NB: Many guides suggest using emulatorpin cpuset & iothreadpin however i’ve tested the VM with & without this explicitly defined and it doesn’t seem to have a substantial effect on performance. I’ve also tried upping iothreads to 2 (and using two iothreadpins) without any luck.

  <vcpu placement="static">14</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="8"/>
    <vcpupin vcpu="2" cpuset="1"/>
    <vcpupin vcpu="3" cpuset="9"/>
    <vcpupin vcpu="4" cpuset="2"/>
    <vcpupin vcpu="5" cpuset="10"/>
    <vcpupin vcpu="6" cpuset="3"/>
    <vcpupin vcpu="7" cpuset="11"/>
    <vcpupin vcpu="8" cpuset="4"/>
    <vcpupin vcpu="9" cpuset="12"/>
    <vcpupin vcpu="10" cpuset="5"/>
    <vcpupin vcpu="11" cpuset="13"/>
  </cputune>

Hyperv
NB: Most guides i’ve found specify using “related state=‘on’”, but virt-manager throws an “unsupported hyperv enlightenment feature” error I i try to use this.

    <hyperv mode="custom">
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vpindex state="on"/>
      <runtime state="on"/>
      <synic state="on"/>
      <stimer state="on"/>
      <reset state="on"/>
      <frequencies state="on"/>
    </hyperv>

Virtual Disk Config

<disk type="file" device="disk">
  <driver name="qemu" type="raw"/>
  <source file="/mnt/SSD//VMs/Gamer/gamer.img"/>
  <target dev="sda" bus="sata"/>
  <boot order="1"/>
  <address type="drive" controller="0" bus="0" target="0" unit="0"/>
</disk>
Full VM XML
<domain type="kvm">
  <name>Gamer</name>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">16777216</memory>
  <currentMemory unit="KiB">16777216</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement="static">14</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="8"/>
    <vcpupin vcpu="2" cpuset="1"/>
    <vcpupin vcpu="3" cpuset="9"/>
    <vcpupin vcpu="4" cpuset="2"/>
    <vcpupin vcpu="5" cpuset="10"/>
    <vcpupin vcpu="6" cpuset="3"/>
    <vcpupin vcpu="7" cpuset="11"/>
    <vcpupin vcpu="8" cpuset="4"/>
    <vcpupin vcpu="9" cpuset="12"/>
    <vcpupin vcpu="10" cpuset="5"/>
    <vcpupin vcpu="11" cpuset="13"/>
  </cputune>
  <os>
    <type arch="x86_64" machine="pc-q35-7.2">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/OVMF/OVMF_CODE_4M.ms.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/Gamer_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode="custom">
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vpindex state="on"/>
      <runtime state="on"/>
      <synic state="on"/>
      <stimer state="on"/>
      <reset state="on"/>
      <frequencies state="on"/>
    </hyperv>
    <vmport state="off"/>
  </features>
  <cpu mode="host-passthrough" check="none" migratable="off">
    <topology sockets="1" dies="1" cores="7" threads="2"/>
  </cpu>
  <clock offset="localtime">
    <timer name="rtc" present="no" tickpolicy="catchup"/>
    <timer name="pit" present="no" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
    <timer name="kvmclock" present="no"/>
    <timer name="hypervclock" present="yes"/>
    <timer name="tsc" present="yes" mode="native"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="raw"/>
      <source file="/mnt/SSD//VMs/Gamer/gamer.img"/>
      <target dev="sda" bus="sata"/>
      <boot order="1"/>
      <address type="drive" controller="0" bus="0" target="0" unit="0"/>
    </disk>
    <controller type="usb" index="0" model="qemu-xhci" ports="15">
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
    </controller>
    <controller type="pci" index="0" model="pcie-root"/>
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="1" port="0x10"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="2" port="0x11"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x1"/>
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="3" port="0x12"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x2"/>
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="4" port="0x13"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x3"/>
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="5" port="0x14"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x4"/>
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="6" port="0x15"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x5"/>
    </controller>
    <controller type="pci" index="7" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="7" port="0x16"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x6"/>
    </controller>
    <controller type="pci" index="8" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="8" port="0x17"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x7"/>
    </controller>
    <controller type="pci" index="9" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="9" port="0x18"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="10" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="10" port="0x19"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x1"/>
    </controller>
    <controller type="pci" index="11" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="11" port="0x1a"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x2"/>
    </controller>
    <controller type="pci" index="12" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="12" port="0x1b"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x3"/>
    </controller>
    <controller type="pci" index="13" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="13" port="0x1c"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x4"/>
    </controller>
    <controller type="pci" index="14" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="14" port="0x1d"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x5"/>
    </controller>
    <controller type="pci" index="15" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="15" port="0x1e"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x6"/>
    </controller>
    <controller type="pci" index="16" model="pcie-to-pci-bridge">
      <model name="pcie-pci-bridge"/>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
    </controller>
    <controller type="virtio-serial" index="0">
      <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
    </controller>
    <interface type="bridge">
      <mac address="(redacted)"/>
      <source bridge="virbr0"/>
      <model type="e1000e"/>
      <link state="up"/>
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </interface>
    <channel type="spicevmc">
      <target type="virtio" name="com.redhat.spice.0"/>
      <address type="virtio-serial" controller="0" bus="0" port="1"/>
    </channel>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <graphics type="spice" autoport="yes">
      <listen type="address"/>
      <image compression="off"/>
    </graphics>
    <sound model="ich9">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1b" function="0x0"/>
    </sound>
    <audio id="1" type="spice"/>
    <video>
      <model type="none"/>
    </video>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
    </hostdev>
    <memballoon model="none"/>
    <shmem name="looking-glass">
      <model type="ivshmem-plain"/>
      <size unit="M">64</size>
      <address type="pci" domain="0x0000" bus="0x10" slot="0x01" function="0x0"/>
    </shmem>
  </devices>
</domain>

Check that you are not using VBS/core isolation/etc. Those rely on virtualisation on bare metal windows but on nested virtualisation if windows runs on a VM. I had performance issues with this especially with GPU-related tasks. One way to make sure to have no nested virtualisation is to disable svm (amd) or vmx (intel) in the config. I think it goes like this:

  <cpu mode="host-passthrough" check="none" migratable="off">
    <topology sockets="1" dies="1" cores="7" threads="2"/>
    <feature policy="disable" name="vmx">    
  </cpu>

I have svm instead of vmx since I’m on Intel, so I’m not 100% sure if vmx is the correct name.

But 50% GPU use seems low in games, there might be some issue there… Try to test CPU performance separately, that should be within a couple percent of expected bare metal performance (correcting for the cores left to the host of course).

2 Likes

@quilt You are an absolute legend. That seems to have done the trick.

The XML excerpt was missing a closing slash, but otherwise improves performance substantially. The amended XML:

  <cpu mode="host-passthrough" check="none" migratable="off">
    <topology sockets="1" dies="1" cores="6" threads="2"/>
    <feature policy="disable" name="vmx"/>
  </cpu>

Re low GPU usage - I should have specified, but I was running everything on lowest settings to try and rule out the game simply being too graphically intensive for my system. (prior to your fix) if I cranked up the graphics I would see a corresponding increase in GPU usage, but it did not seem to substantially affect frame rate one way or the other (+/- 5 fps)

Anyway, Cheers for your help!

2 Likes

Happy to help! I struggled with this exact issue for months, and it took a while to figure out (it seemed to affect only some games/applications but not all). So I know the frustration :smiley: