Increasing VFIO VGA Performance

gnif · October 6, 2018, 1:44pm

Umm, wow, this is new too. After updating my Linux workstation VM to use this same method, my VEGA10 resets properly! Need to perform further testing, but it seems to have fixed the problem!

@wendell can you please test this also and confirm?

Here is the lspci output in my Linux VM

00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
00:02.0 Ethernet controller: Red Hat, Inc Virtio network device
00:03.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:04.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:05.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 0 (rev 02)
00:06.0 RAM memory: Red Hat, Inc Inter-VM shared memory (rev 01)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43c8 (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c3)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf8
02:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
02:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
02:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
03:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
03:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
03:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller

wendell · October 6, 2018, 2:49pm

As far as Vega reset goes, this explains why it almost always worked for me, as I usually used the q35 chipset. I lost that accidental discovery when switching around hardware and was puzzled I too was now having the Vega reset problem

gnif · October 6, 2018, 2:51pm

I have always had it with q35, but putting the vega onto a root bus seems to have fixed it. Initial testing on the VEGA also show a very substantial performance increase now it can see it’s on a Gen3 bus.

geoff@aeryn:~$ sudo lspci -s 01:00.0 -vv | grep LnkSta:
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

wendell · October 6, 2018, 2:53pm

I must have done that accidentally during my initial setup because it wasn’t fine after everyone was like “wait it works for you?” Not sure checking now

garyp · October 6, 2018, 3:25pm

Is there a way to do this via virt-manager or virsh edit [name]?

I currently do GPU passthrough by adding the gpu/audio via virt-manager which produces the following in the virsh edit xml file:

  <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x1d' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x1d' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </hostdev>

Can anything extra be specified to move this onto a pcie root?

I think your method could be replicated via qemu:commandline but I assume I’d lose the automatic rebinding virt-manager does? Currently it will unbind from amdgpu, bind to vfio-pci, run the vm then reverse the process on shutdown so I can continue to use the GPU in the host via DRI_PRIME.

wendell · October 6, 2018, 3:30pm

this way
https://www.redhat.com/archives/vfio-users/2016-January/msg00301.html

heres what I added. Note that multifunction and getting function id right (0 for gfx 1 for audio) seems to be important, still testing.

Controller section (I used bus id 8, you should use whatver makes sense for your system)

 <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='1' port='0x1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1c' function='0x0' multifunction='on'/>
    </controller>

and

 <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x1'/>
    </hostdev>

still testing, so I might edit this more.

garyp · October 6, 2018, 5:04pm

If “index” in the controller definition matches the “bus” used in the hostdev, then it looks like by default virt-manager defines a separate pcie-root-port device for gpu and gpu-audio pci devices rather than assigning both to a single pcie-root-port with function 0x0 and 0x1 accordingly.

I’ve tried with this default and also with a config similar to yours above which places the gpu-audio and gpu on the same pcie-root-port.

Two gaming benchmarks and a run of the superposition benchmark showed no real difference between the two. GPU-Z reports in both cases that the gpu is on a pcie3 port running at “pci-express x8 v1.1”.

So unless I’ve missed something, it may be that out the box virt-mananger is already setting up the pci passthrough for the gpu in a suitable way (albeit with audio and gpu on different root ports, which may not be desirable?)

Still, not in vain, learned more about the xml config format

Benchmark wise if anyone happens to have similar hw, 8GB RX580 Nitro+, 2700X and 16GB 3200 ram is giving me 70/84/107 min/avg/max in F1 2017 with ultra high setting and both AA options maxed). Superposition (1080p extreme) gives a score of 2647 utilising 100% of gpu. Be interested in seeing how that compares to others doing passthrough if you have similar hw.

wendell · October 6, 2018, 5:07pm

I am still having reset issues, I think, if I unclean shutdown the guest. clean shutdown is still ok?

gnif · October 6, 2018, 5:33pm

I am yet to try an unclean shutdown, but for me even a clean shutdown wasn’t working at all before these changes.

wendell · October 6, 2018, 5:46pm

My lab asplode but testing right now
.may do a live stream for this later. I’ve got tons of friends hardware right now because everyone is upgrading and letting me test their stuff

PetebLazar · October 6, 2018, 7:01pm

 -M q35,accel=kvm \
 -smp 32,cores=32,threads=1,sockets=1 \
 -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
 -device vfio-pci,host=42:00.0,bus=root.1,addr=00.0,multifunction=on,romfile=/home/pburic/GTX1080Ti_patched.rom \
 -device vfio-pci,host=42:00.1,bus=root.1,addr=00.1 \

I’m using that from day one (11/2017 … Ugly patch), it looks same.
Or is there difference?

This setup is mentioned in most HowTos (very old one).
https://wiki.debian.org/VGAPassthrough
https://gist.github.com/ArseniyShestakov/2891f5c147298e0b8ffa#file-qemu-cmd

gnif · October 6, 2018, 7:04pm

Looks fine, a few differences though.

x-vga doesn’t need to be specified anymore.
multifunction=on doesn’t make any sense on the root port if you are only passing a single device through.
multifunction=on only makes sense on the actual device if you are passing through a device with children, ie HDMI audio.

Other then that, if GPU-Z or lspci reports the video card has a link speed other then just “PCI” you’re already running as you should be.

That example is actually incomplete though, it’s skipping a device which can cause NVidia cards to have a fit (although I can see you have included it anyway), it should be:

 -device ioh3420,bus=pcie.0,addr=1c.0,port=1,chassis=1,id=root.1 \
 -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on \
 -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1

gnif · October 6, 2018, 7:27pm

For reference here is my complete qemu launch args.

/usr/local/bin/qemu-system-x86_64 \
  -nographic \
  -machine q35,accel=kvm,usb=off,vmport=off,dump-guest-core=off,kernel-irqchip=on \
  -cpu host,hv_time,hv_vpindex,hv_reset,hv_runtime,hv_crash,hv_synic,hv_stimer,hv_spinlocks=0x1fff,hv_vendor_id=lakeuv283713,kvm=off,l3-cache=on,-hypervisor,migratable=no,+invtsc \
  -drive file=/opt/VM/Windows/ovmf/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on \
  -drive file=/opt/VM/Windows/ovmf/vars.fd,if=pflash,format=raw,unit=1 \
  -realtime mlock=on \
  -pidfile /var/run/vm.Windows.pid \
  -monitor stdio \
  -runas geoff \
  -enable-kvm \
  -name guest=Windows,debug-threads=on \
  -smp 8,sockets=1,cores=4,threads=2 \
  -m 16384 \
  -mem-prealloc \
  -global ICH9-LPC.disable_s3=1 \
  -global ICH9-LPC.disable_s4=1 \
  -no-user-config \
  -nodefaults \
  -rtc base=localtime,driftfix=slew \
  -global kvm-pit.lost_tick_policy=discard \
  -boot strict=on \
  -no-hpet \
  -netdev tap,script=/opt/VM/bin/ovs-ifup,downscript=/opt/VM/bin/ovs-ifdown,ifname=windows.30,id=hostnet0,vhost=on \
  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:06:99:25,bus=pcie.0 \
  -soundhw ac97 \
  -device ioh3420,id=root_port1,chassis=0,slot=0,bus=pcie.0 \
  -device vfio-pci,host=45:00.0,id=hostdev1,bus=root_port1,addr=0x00,multifunction=on,romfile=/opt/VM/Windows/1080Ti.rom \
  -device vfio-pci,host=45:00.1,id=hostdev2,bus=root_port1,addr=0x00.1 \
  -drive  id=disk,file=/dev/disk/by-id/md-uuid-18bc7433:b223acbf:a1cbb062:f0b030c9,format=raw,if=none,cache=none,aio=native,discard=unmap,detect-zeroes=unmap,copy-on-read=on \
  -device virtio-scsi-pci,id=scsi \
  -device scsi-hd,drive=disk,bus=scsi.0,rotation_rate=1 \
  -device ivshmem-plain,memdev=ivshmem,bus=pcie.0 \
  -object memory-backend-file,id=ivshmem,share=on,mem-path=/dev/shm/looking-glass,size=128M \
  -spice disable-ticketing,seamless-migration=off,port=5900,addr=192.168.10.50 \
  -device virtio-keyboard-pci,id=input2,bus=pcie.0,addr=0xc

I also run a script that pins the CPU threads to the correct cores and ensures all memory accesses are local to the CPU it is pinned to. For Ryzen this isn’t so critical but for ThreadRipper it is critical if you wish to obtain the same results I have as shown below.

x3sphere · October 6, 2018, 8:18pm

I’m using the defaults generated by virt-manager and it seems to report the correct PCIe link speed.

Capture

That said - I cannot get the VEGA to reset properly. Running 4.19 rc6 and even when doing a normal reboot it just locks up, OVMF splash never shows and one of the CPU cores just stays pegged at 100%.

What’s strange is I did have it working at one point. It would be good to know what breaks the reset on VEGA I’ll try your config changes just to see if that fixes it though…

PetebLazar · October 6, 2018, 9:08pm

2 gnif:
I think, current PCIe speed of GPU is related to GPU power management. Try run Render test of GPU-z (question mark).

gnif · October 6, 2018, 9:12pm

Thanks for the information but it certainly was not the issue. This is an issue that others have reported, either a PCIe link speed of 0, or reporting as PCI instead of PCIe. The exact cause and resolution is as described in the original post.

Link speed for PCIe is incomplete according to https://www.linux-kvm.org/page/PCITodo
Heading, “Support for different PCI express link width/speed settings”.

The code that causes this behaviour in in Qemu is as follows:

github.com

qemu/qemu/blob/master/hw/vfio/pci.c#L1850-L1878


} else if (pci_bus_is_root(pci_get_bus(&vdev->pdev))) {
    /*
     * On a Root Complex bus Endpoints become Root Complex Integrated
     * Endpoints, which changes the type and clears the LNK & LNK2 fields.
     */
    if (type == PCI_EXP_TYPE_ENDPOINT) {
        vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS,
                               PCI_EXP_TYPE_RC_END << 4,
                               PCI_EXP_FLAGS_TYPE);


        /* Link Capabilities, Status, and Control goes away */
        if (size > PCI_EXP_LNKCTL) {
            vfio_add_emulated_long(vdev, pos + PCI_EXP_LNKCAP, 0, ~0);
            vfio_add_emulated_word(vdev, pos + PCI_EXP_LNKCTL, 0, ~0);
            vfio_add_emulated_word(vdev, pos + PCI_EXP_LNKSTA, 0, ~0);


#ifndef PCI_EXP_LNKCAP2
#define PCI_EXP_LNKCAP2 44
#endif
#ifndef PCI_EXP_LNKSTA2

This file has been truncated. show original

gnif · October 6, 2018, 9:43pm

Hrmm, I just noticed I was still not seeing PCIe 3.0 speeds, this is due to a hard coded value in QEMU, the following patch/hack changes these values but please only apply this if you are actually running at PCIe x16 3.0 or it may cause undesireable behaviour.

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa054..34cfa906cf 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1895,7 +1895,7 @@ static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, int pos, uint8_t size,
                                    PCI_EXP_TYPE_ENDPOINT << 4,
                                    PCI_EXP_FLAGS_TYPE);
             vfio_add_emulated_long(vdev, pos + PCI_EXP_LNKCAP,
-                                   PCI_EXP_LNK_MLW_1 | PCI_EXP_LNK_LS_25, ~0);
+                                   PCI_EXP_LNK_MLW_16 | PCI_EXP_LNK_LS_80, ~0);
             vfio_add_emulated_word(vdev, pos + PCI_EXP_LNKCTL, 0, ~0);
         }
 
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index a95522a13b..902ace0a69 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -35,9 +35,16 @@
 /* PCI_EXP_LINK{CAP, STA} */
 /* link speed */
 #define PCI_EXP_LNK_LS_25               1
+#define PCI_EXP_LNK_LS_50               2
+#define PCI_EXP_LNK_LS_80               3
 
 #define PCI_EXP_LNK_MLW_SHIFT           ctz32(PCI_EXP_LNKCAP_MLW)
-#define PCI_EXP_LNK_MLW_1               (1 << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_1               (1  << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_2               (2  << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_4               (4  << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_8               (8  << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_12              (12 << PCI_EXP_LNK_MLW_SHIFT)
+#define PCI_EXP_LNK_MLW_16              (16 << PCI_EXP_LNK_MLW_SHIFT)
 
 /* PCI_EXP_LINKCAP */
 #define PCI_EXP_LNKCAP_ASPMS_SHIFT      ctz32(PCI_EXP_LNKCAP_ASPMS)

link

Dje4321 · October 6, 2018, 10:43pm

pretty please?

BarkingMad · October 6, 2018, 11:01pm

+1
Subscribed

gnif · October 6, 2018, 11:06pm

Ok, there is still a bit more to this to make it run at these speeds even still. While GPUZ reports that it’s at 16x 3.0, it is infact lying. This can be seen by checking the NVidia Control Panel, under “Help -> System Information” the guest is still reporting PCI Express x1.

I have another patch that fixes this, after some testing I will release it for people to try out.