Increasing VFIO VGA Performance

I am working with git master.

Thanks. I had to add void pcie_negotiate_link(PCIDevice *dev); to pcie.h.

Nasty patch is working, CUDA bandwith test without changes.

screenshot

https://www.monitos.cz/tmp/nasty_patch_PCIe3.0.png

Has anyone tried cpu(die) pinning for VM, AMD relevant patch was included in upstream in August (QEMU-3.0.0 -machine must be pc-q35-3.0)?

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897054
https://www.reddit.com/r/VFIO/comments/9iaj7x/smt_not_supported_by_qemu_on_amd_ryzen_cpu/

Yes. 2700x here and R9 390. On Debian with a custom rt-kernel as you need one to enable nohz_full, and rcu_nocbs.
Cores are pinned on the isolated cores with qemu3.0 along with isolcpus, nohz_full, rcu_nocbs.

Just make sure you’ve got “feature policy=‘require’ name=‘topoext’” in your libvirt config and you should set.

A few things to note in my testing which is why I made my account; I’ve been following this thread and feel like should make note of these:
Having -hypervisor set to off results in higher DPC latency. I’m seeing an average 87-130us vs 20-30us process interrupts based off the latencymon gui.

Having “timer name=‘hypervclock’ present=‘yes’” with -hypervisor on with the guest being aware it’s in a hypervisor determines whether I get 100fps or 125fps in Cinebench (poor bench I know). All CPU benches show an uplift in performance (CPU-Z, Cinebench). Watching the PCIe link in the host with <lspci -s 0b:00.0 -nvv|grep LnkS/> shows the graphics link scaling with load. It’s 2.5GT/s while idle and scales up to 8GT/s while playing. The GPU guest is also connected to a root-port as discussed above. GPU-Z also shows the correct config of “PCIex16 3.0 @ x8 3.0”

I understand that specifiying “timer name=‘tsc’ present=‘yes’ mode=‘native’” and “feature policy=‘require’ name=‘invtsc’” invokes that timer being used only when the hypervisor is off.
Yet the tsc feature shows up in both my hypervisor off and on configs in coreinfo.exe. Screenshots to follow.

With hypervisor on and hyperV timer on

Summary

Hypervisor off with -hypervisor,migratable=no,+invtsc

Summary

Both have it available to them.

Configs: Hypervisor on

Summary
<domain type='kvm'>
  <name>win10-clone</name>
  <uuid>54a91832-1fae-4ae0-b14f-e10b169479d7</uuid>
  <memory unit='KiB'>10000000</memory>
  <currentMemory unit='KiB'>10000000</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <vcpupin vcpu='8' cpuset='4'/>
    <vcpupin vcpu='9' cpuset='5'/>
    <vcpupin vcpu='10' cpuset='6'/>
    <vcpupin vcpu='11' cpuset='7'/>
    <emulatorpin cpuset='0-3'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='8' scheduler='fifo' priority='1'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-3.0'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10-clone_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='off'/>
      <frequencies state='on'/>
    </hyperv>
    <kvm>
      <hidden state='off'/>
    </kvm>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='full'>
    <topology sockets='1' cores='6' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='svm'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='apic'/>
    <feature policy='require' name='topoext'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' present='no' tickpolicy='catchup'/>
    <timer name='pit' present='no' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='kvmclock' present='no'/>
    <timer name='hypervclock' present='yes'/>
    <timer name='tsc' present='yes' mode='native'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/sda2'/>
      <target dev='vda' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='1' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/sdb'/>
      <target dev='vdb' bus='scsi'/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6' multifunction='on'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x17'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver queues='4' iothread='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </controller>
    <controller type='scsi' index='1' model='virtio-scsi'>
      <driver queues='4' iothread='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:e7:8b:ca'/>
      <source network='default'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0c' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0d' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
    <shmem name='looking-glass'>
      <model type='ivshmem-plain'/>
      <size unit='M'>32</size>
      <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
    </shmem>
  </devices>
</domain>

Config: Hypervisor off

Summary
<domain type='kvm xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>win10-clone</name>
  <uuid>54a91832-1fae-4ae0-b14f-e10b169479d7</uuid>
  <memory unit='KiB'>10000000</memory>
  <currentMemory unit='KiB'>10000000</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <vcpupin vcpu='8' cpuset='4'/>
    <vcpupin vcpu='9' cpuset='5'/>
    <vcpupin vcpu='10' cpuset='6'/>
    <vcpupin vcpu='11' cpuset='7'/>
    <emulatorpin cpuset='0-3'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='8' scheduler='fifo' priority='1'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-3.0'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10-clone_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='off'/>
      <frequencies state='on'/>
    </hyperv>
    <kvm>
      <hidden state='off'/>
    </kvm>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='full'>
    <topology sockets='1' cores='6' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='svm'/>
    <feature policy='disable' name='hypervisor'/>
    <feature policy='require' name='apic'/>
    <feature policy='require' name='topoext'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' present='no' tickpolicy='catchup'/>
    <timer name='pit' present='no' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='kvmclock' present='no'/>
    <timer name='hypervclock' present='no'/>
    <timer name='tsc' present='yes' mode='native'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/sda2'/>
      <target dev='vda' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='1' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source dev='/dev/sdb'/>
      <target dev='vdb' bus='scsi'/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6' multifunction='on'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x17'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver queues='4' iothread='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </controller>
    <controller type='scsi' index='1' model='virtio-scsi'>
      <driver queues='4' iothread='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:e7:8b:ca'/>
      <source network='default'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0c' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0d' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
    <shmem name='looking-glass'>
      <model type='ivshmem-plain'/>
      <size unit='M'>32</size>
      <address type='pci' domain='0x0000' bus='0x0e' slot='0x00' function='0x0'/>
    </shmem>
  </devices>
  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,-hypervisor,migratable=no,+invtsc'/>
 </qemu:commandline>
</domain>

tl;dr
Does Windows actually run better when it’s aware it’s in a hypervisor, or is the hyperV timer making benchmarks appear to be better due to running an inaccurate timer? To me, I gain an awful lot of fps to almost baremetal, and I ‘feel’ that it’s better. But that’s subjective. Both have +invtsc specificed and it shows up in coreinfo.exe. I want to hear if other AMD users can chime in on it, as Nvidia seems to require the having the hypervisor being off for their graphics cards to work.

5 Likes

I’m on nvidia so I can’t speak for everything you added to your configs that I had not been using. I did notice a slight performance improvement from adding stuff I saw in your configs, but i have to ask why are you only using vcpusched on 9 of your 12 passed cores?

I have to know. How in the world are you using this feature given the current need to use ignore_msrs=1 ???

When I tested it with every pinned core, it resulted in massive stutters that were very noticeable. When I passed only one CCX (4c-8t), it was fine.
Figured out that for whatever reason, you’re fine doing it to first set of CCX cores that the VM sees. The 2nd set doesn’t play nice with the fifo tag. Results in slightly better CPU-Z points and Cinebench.

I was testing the first core of the 2nd CCX to see if you can get away with just that, but forgot to take it out of the configs. Would recommend keeping it off for the 2nd set if you have Ryzen.

I have some options for you to try then. I have applied the vcpusched to all my cores with no problems. I also pass 6 cores of my 2700X but I keep the first core of each ccx on the host. I also do isolation idk if you are doing that but I’ve heard it mentioned that fifo can have issues if the host does anything on those cores while they are pulling a load. Another interesting difference that I plan on testing more thoroughly tonight is that on a whim and rumor I’ve recently been using 2 3 2 for my topology instead of 1 6 2. the idea being that this hints to the vm of the CCX layout so that it can work with it more appropriately. I don’t know if it actually helps much but I haven’t run into any stability issues.

You only need this flag with the latest versions of Windows, Microsoft changed something that causes this problem.

Has anyone done a more in-depth performance-test of 2MB hugepages compared to 1GB for general gaming usage?

Personally I’ve just switches to 2MB instead of 1GB, with no noticeable performance-degradation.
Using 1GB pages, I was only able to allocate 12GB out of my 16GB - as my system has a few reserved memory spaces.
By using 2MB pages, I can allocate 15GB out of my 16GB - so, giving VFIO VM’s more effective memory.

My quick test with Heaven Benchmark showed a 2% drop in FPS going from 1GB to 2GB… But, my theory is that on overall I benefit more from having 15GB RAM in my VM compared to 12GB.

I tried your configuration with having the first core of each CCX for linux and having the rest for the VM with ‘fifo’ specified. I also changed the boot parameters to isolate the cores.

isolcpus=2-7,10-15 nohz_full=2-7,10-15 rcu_nocbs=2-7,10-15

No changes. Stutter still happens under load. This configuration also causes problems with L3 cache as Windows doesn’t adjust what it sees. Screenshot to follow.

Summary

Windows hasn’t adjust which cores has a shared L3 cache, only that x amount of cores share the same. This means you’re forced to use the 2nd CCX as your first set of cores, with others to follow as you can’t isolate the core0 in linux. I’ve tried to correct the wrongly reported L2 and L3 cache, but this is a bug. If you don’t have feature policy='require' name='topoext'in your libvirt config, then Ryzen correctly reports the right amounts of L2/L3 cache.

Unfortunately you do want this as it enables SMT. What’s important is that Windows knows which core is paired with who and what shares what for optimal performance. Specifying topoext also seems to make topology changes irrelevant as they took no effect within the VM.

Something to note, vcpusched can safely be used on all cores if you aren’t using hyperthreading within the VM. There’s no stutters and things work great. It was a configuration I used until I learned how to compile qemu3.0 for SMT support. These stutter problems with fifo has arisen since I enabled SMT.

I also tested few other things and managed to pin down the source of the stutters. It was core 8. The first core that Windows sees in the ‘2nd’ CCX that shares the ‘2nd’ set of L3 cache. Windows seems to have pinned the driver for my mouse to that core, and having two sets of ‘fifo’ to the same physical core 'causes the stutters while under load. Screenshot to follow

Summary

ISR and DPC count would only rise if I moved my mouse. Otherwise the counts would remain the same. Removing vcpusched fifo on the hyperthreaded core has fixed the issue. This does beg the question whether these stutters will happen on the other cores under high load, but Windows seems to place most of the interrupts on core0 and the others on the first core of the 2nd L3 cache. Everything seems to run great so I’ll leave it at that for now.
This also means, for others configurations - and I speak for only Ryzen, you have to make sure Windows has correctly identified which is the first core that has the 2nd set of L3 cache. While testing urmamasllama’s suggested config, LatencyMon still reported that the majority of interrupts were still between core0 and core 8. So do make sure, else performance seems to suffer. I did test a 4c/8t just to be sure, and sure enough, Windows just dumped it all on core0.

Overall I’m satisfied with the results: process interrupt has moved down to 9us at best with an average low of 13us under no load. Most often it jumps between 13us to 32us under the heaviest so I can’t complain.

feature policy='require' name='topoext' simply check that the feature is supported by CPU or it can be emulated. As it is supported by all Zen CPUs it will be passed for host-passthrough.

Anyway the problem with L3 and Zen is that there is no way to set CCX sizes in QEMU, or at least I dont know how.

Also Windows expect that at lest 4 coreids will share L3 for Zen. You can have more but not less. If you set less than Windows will ignore it and set it 4 anyway.Real HW actually skips coreids for CCX that has less than 4 cores in it. QEMU does not do so.

As for the latency and timing source. Its hard to say if the latency is better or the timer used simply distorts the measurement.

Hey, do you mind sharing how to force 8x link speed for the gpu at host? I would like to test it out on my own using 1070 TI.

It is a mystery how you achieve this low latencies.

I have a 1950x and this is my latency

Summary

and here is my config:

Summary
<domain type='kvm'>
  <name>win10-gaming</name>
  <uuid>ca042030-d63d-40a5-8b32-7df0948f81ed</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>12681216</memory>
  <currentMemory unit='KiB'>12681216</currentMemory>
  <vcpu placement='static'>16</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='16'/>
    <vcpupin vcpu='1' cpuset='17'/>
    <vcpupin vcpu='2' cpuset='18'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <vcpupin vcpu='4' cpuset='20'/>
    <vcpupin vcpu='5' cpuset='21'/>
    <vcpupin vcpu='6' cpuset='22'/>
    <vcpupin vcpu='7' cpuset='23'/>
    <vcpupin vcpu='8' cpuset='24'/>
    <vcpupin vcpu='9' cpuset='25'/>
    <vcpupin vcpu='10' cpuset='26'/>
    <vcpupin vcpu='11' cpuset='27'/>
    <vcpupin vcpu='12' cpuset='28'/>
    <vcpupin vcpu='13' cpuset='29'/>
    <vcpupin vcpu='14' cpuset='30'/>
    <vcpupin vcpu='15' cpuset='31'/>
    <emulatorpin cpuset='0-7'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
    <iothreadsched iothreads='2' scheduler='fifo' priority='99'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
  </numatune>
  <os>
    <type arch='x86_64' machine='pc-q35-3.0'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/ovmf/ovmf_code_x64.bin</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10-gaming_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='on' value='KVM Hv'/>
    </hyperv>
    <kvm>
      <hidden state='on'/>
    </kvm>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='full'>
    <topology sockets='1' cores='8' threads='2'/>
    <cache level='3' mode='emulate'/>
    <feature policy='require' name='topoext'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='svm'/>
    <feature policy='disable' name='hypervisor'/>
    <feature policy='require' name='apic'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' present='no' tickpolicy='catchup'/>
    <timer name='pit' present='no' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='kvmclock' present='no'/>
    <timer name='hypervclock' present='no'/>
    <timer name='tsc' present='yes' mode='native'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback' io='threads' discard='unmap' detect_zeroes='unmap'/>
      <source dev='/dev/disk/by-id/ata-Crucial_CT256MX100SSD1_14330CF987C3'/>
      <target dev='vda' bus='virtio'/>
      <boot order='2'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='sdb' bus='sata'/>
      <readonly/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file=''/>
      <target dev='sdc' bus='sata'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-to-pci-bridge'>
      <model name='pcie-pci-bridge'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:09:ab:46'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <link state='up'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='usb' managed='yes'>
      <source>
        <vendor id='0x1b1c'/>
        <product id='0x1b11'/>
      </source>
      <address type='usb' bus='0' port='2'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='yes'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc085'/>
      </source>
      <address type='usb' bus='0' port='3'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0a' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='yes'>
      <source>
        <vendor id='0x0d8c'/>
        <product id='0x0102'/>
      </source>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </memballoon>
    <shmem name='looking-glass'>
      <model type='ivshmem-plain'/>
      <size unit='M'>32</size>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
    </shmem>
  </devices>
</domain>

maybe somebody is able to shed some light on this.

An extended gaming session revealed to me that FIFO was in fact unstable in my configuration. it took a few hours for any issues to occur but eventually my host starts stuttering and then eventually just locks up. the whole time my VM is perfectly fine. I have since removed the scheduler options, but I still use a 2 3 2 topology because it does actually help brute force windows into understanding that there is a performance cost, and from my testing with cinebench it does have a positive impact on threaded performance. If it ever happens that Qemu fixes this issue of the VM not knowing the CCX and cache topology then this won’t be needed anymore.

are you isolating your pinned cores?
have you enabled MSI on all passed hardware that supports it?

Yes, along with the following additional important changes:

  • In the BIOS set the ram to “Channel” to expose the NUMA topology
  • Run QEMU with numactl to ensure qemu only allocates memory locally to the die it’s pinned to.
  • Use taskset to pin the CPU threads to the correct CPU cores, being careful to observe which threads are pinned to HT cores.
  • Pin all other threads to the same CPU cores to ensure that memory accesses are local for them also.
  • Adjust the IRQ affinity for bound devices to the correct cores via /proc/irq/X/smp_affinity
  • Disable kernel numa balancing or it will unpin your cores. sysctl -w kernel.numa_balancing=0
  • Do not install numad or it will also unpin your cores.
  • Ensure you have the latest AGESA/BIOS, AMD changed the CPU core indexes in the later versions, instead of being 1,2,3,4 and 9,10,11,12 for the HT cores it is now sequential as 1,2,3,4,5,6,7,8

No

Don’t enable MSI??? doing that was one of the biggest factors for latency on my system

I have a non support LG related question for you. I’ve messed around with NVFBC ( I know the risks, but science!) and it’s quite astounding the difference it makes. I don’t hold much stock that NV will ever give full approval but it got me thinking about alternatives. At one point AMD had their own FBC called direct encode mode (DEM) but when they open-sourced their media framework they deprecated the feature. Have you inquired with anyone at AMD about possibly bringing back DEM to their supported features? Since you’ve apparently mostly fixed the reset bug, if we could get FBC support too it would be huge! I’d sell my 1070 immediately to buy a vega if it happened.

I experimented with this and found it had little to no difference, perhaps MSI helps if one of the above points have not been taken into consideration.

It certainly is, I am very excited to get it working in the host again but I need to ensure that “Compatability” mode (DXGI DD) is 100% first, we want to cast as wide a net as possible with hardware support. Once LG is out of Alpha (A12 will very likely be the last alpha) I will bring the NVFBC capture code up to date.

I briefly looked into this and from what I understood it only allowed encode to compressed formats, which we do not want, but I could be wrong here.

It is unlikely AMD will implement a proprietary capture API. DXGI DD is both part of the specification and an API, the video card mfg still has to implement it in their drivers. AMD have simply done it according to spec, as NVidia should have done. It seems that NVidia cripple DXGI DD performance, likely to ensure the validity of NVFBC commercially.

I would guess that AMD deprecated DEM since Microsoft added a standard API for hardware encode which would make maintaining both DXGI and DEM a duplication of effort.

I’m on AM4 not TR4 so the numa focused stuff doesn’t apply to me but I can see how that would help for the other guy. I’ve already got my pinning right, But I’ll look into the IRQ affinity I guess.

So you’re saying that AMD drivers have a more performant implementation of DXGI making it pointless to have DEM. I haven’t seen anyone do a performance comparison for this yet. I have a 580 for my host but it would not be a small task for me to reconfigure to test it myself. I’ll see if I can find someone on the vfio discord that can get me numbers but it sounds like a good video for Wendell. Thanks a ton!

Not at all, I am stating that AMD likely is dropping DEM due to the duplication of work. This is pure speculation however, but it would make sense. It could be faster or slower depending on the hoops AMD have to jump through to build a compatible interface.