GPU passthrough straightforward with Coffee Lake + some "duel booting" notes and issues

When searching around I don’t find many GPU passthrough success stories from the Coffee Lake platform, so I figured I should share what I’ve tested so far.

TL;DR
As a bottom line it works, there are no notable surprises. IOMMU grouping is really good on my Asrock z370m pro4. Passthrough of a MSI Geforce GTX 970 to Win8.1 pro N works fine with KVM and vfio, given the standard nvidia-spoofing procedures (see below).

Here follows some setup details, tests, and notes.

Basics

Hardware

  • ASRock Z370M pro4
  • Intel Core i7 8700 (non-K), 3.2GHz (4.3-4.6GHz turbo)
  • Corsair Vengance LPX white DDR4-2666, 32Gb (2x16), Cl16
  • MSI Geforce GTX 970, 4Gb
  • Samsung SM961 256Gb nvme x4 on M2_2 (Xubuntu 17.10)
  • OCZ Vertex4 256Gb on sata3 (Windows 8 PRO N updated to 8.1)
  • BeQuiet Dark Rock 3
  • Seasonic X-750 (yes, kinda overkill for this build.)
  • Fractal Design Define Mini C

Linux + KVM subsystem versions

  • Xubuntu 17:10 vanilla everything (no special ppa:s)
  • Linux 4.13.0-16-generic
  • libvirtd (libvirt) 3.6.0
  • QEMU emulator version 2.10.1(Debian 1:2.10+dfsg-0ubuntu3)
  • OVMF version 0~20170911.5dfba97c-1

IOMMU groups

IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3ec2] (rev 07)
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c9]
IOMMU Group 10 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1]
IOMMU Group 10 00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0]
IOMMU Group 10 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3]
IOMMU Group 11 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
IOMMU Group 12 04:00.0 Network controller [0280]: Intel Corporation Wireless 8260 [8086:24f3] (rev 3a)
IOMMU Group 13 05:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 07)
IOMMU Group 1 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)
IOMMU Group 1 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
IOMMU Group 2 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:3e92]
IOMMU Group 3 00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af]
IOMMU Group 3 00:14.2 Signal processing controller [1180]: Intel Corporation 200 Series PCH Thermal Subsystem [8086:a2b1]
IOMMU Group 4 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
IOMMU Group 5 00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282]
IOMMU Group 6 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
IOMMU Group 7 00:1b.2 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #19 [8086:a2e9] (rev f0)
IOMMU Group 8 00:1b.3 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #20 [8086:a2ea] (rev f0)
IOMMU Group 9 00:1b.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #21 [8086:a2eb] (rev f0)

I have tried all PCIe slots except the uppermost x1 slot (as it is shadowed by the GTX970 cooler). All these slots represent own IOMMU groups, even the x4 one (in an x16 slot). I suppose this means that all slots except the main x16 one sits on the chipset. I also suppose it means that a second external GPU could sit in the x16(4) slot and get its own IOMMU group, though working at chipset-constrained speed, but I have not tested this.

Hardware notes

I use IGD for host graphics on Linux, GTX970 for passthrough. This needs IGD to be primary graphics in UEFI together with multigpu = enabled. The option of using IGD for host graphics is IMO the main selling point to choose Coffee Lake over other architectures for a Linux+Windows work/gamestation. It allows a silent and power-efficient system while still providing (barely) enough cores.

Currently I have no good way of organising input as I use a single keyboard and mouse with a KVM-switch. I have a PCIe USB controller from a former passthrough machine, but it’s ports are misaligned with the chassis back plate so I had to order a new one. But this I expect will be straightforward to add, then host and Windows VM will have exclusive physical USB ports, so no emulation of peripherals will be needed. USB will also handle VM sound, I have not tested qemu’s emulated sound.

The eagle-eyed reader spots a Wifi card in the device list, it is a Gigabyte add-on card which sits in a PCIe x1 slot, so it’s not part of the motherboard.

VM setup notes

I used virt-manager to create VMs, followed by ‘virsh edit’ for final tweaks. I have not optimised much yet, no vCPU pinning or hugepages, but hyperv extensions are enabled. RedHat’s virtio drivers are installed in Windows. Most testing below has been carried out with 4 real cores passed through the VM (host-passthrough, sockets = 1, cores = 4, threads = 2).

In principle I could give the VM the entire SATA controller or the entire nvme ssd, as both are in their own IOMMU groups and the Vertex4 ssd is the only SATA device. I have not tested this though. Instead the Vertex4 (sda) is sent to the VM like this:

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/dev/sda'/>
      <target dev='sda' bus='scsi'/>
      <boot order='1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
  ...
    <controller type='scsi' index='0' model='virtio-scsi'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </controller>

Note that the disk itself is not defined as a virtio-device, albeit indirectly as the emulated scsi controller is. Virtio-scsi is AFAIK the only way to get Windows TRIM to work through the virtualisation layers down to the hardware, unless (presumably) you do PCIe passthrough of a physical controller.

EDIT 2017-11-18: Forgot the option discard=‘unmap’ in the section of the above XML. This is needed for TRIM commands from the VM to properly reach the SSD. I have updated this now. /END EDIT

For getting Geforce drivers to work in Windows while still using hyperv optimisations I have to add the following to the libvirt config file:

    <hyperv>
      ...
      <vendor_id state='on' value='nvidia_sucks'/> ['value' can be almost any string]
    </hyperv>
    <kvm>
      <hidden state='on'/>
    </kvm>

I include this note as there has been some confusion recently as to which workarounds are needed for fooling the nVidia drivers to start despite passthrough. (This is an on-going arms race and a reason for caution before buying a Geforce card for passthrough - future drivers may block this usage in new ways). Both the above additions together constitute the only way AFAIK to have any hyperv extensions active while still having a working Geforce passed through. Note that for doing this purely in XML as above a recent libvirt is required, not long ago it was possible only with qemu command line injections.

(basic) Performance

Some very initial performance numbers, baremetal vs virtual Win 8.1 pro N. Baremetal and virtual performance are recorded from the same installed Win8.1 image (see note on “duel booting” below).

Unigine Heaven 4 ultra settings, 1080p: no discernible difference between virtual and baremetal - around 1300 points, ~25-110fps (average ~55) in both cases

Flash performance (CrystalDiskMark), Baremetal vs VM (generic disk model) vs VM (optimised disk model)
(OCZ Vertex4 256Gb / SATA3)

image

Notable differences primarily in IOPS, where the generic SATA (emulated) device model suffers. But there is still room for optimisation relative to baremetal, and curiously there are areas where emulated SATA out-performs virtio-SCSI. Note again that many “best practices” optimisations (such as hugepages) are still not configured.

“Duel” booting - some caveats

I tried @wendell’s recently suggested “duel booting” procedure: First I installed Win8.1 to the vertex4 ssd on baremetal, then passing through the ssd as a raw block device to a VM. It works sort of as expected, but:

  1. Booting the VM triggers re-activation of Win8.1, given that it was already activated on baremetal. I have not tried activating Windows in the virtualized state, as I suppose that will just require a new re-activation on the next baremetal boot… Any advice on this? I am suspecting this is MS-intended behaviour and in line with the license, unfortunately.
  2. Everytime I change between VM boot / baremetal boot Win8 stays at “Windows is reconfiguring your devices” for some time. In addition, if this happens while on baremetal, Windows changes the UEFI boot order so that it’s disk is first next time, bypassing the GRUB loader on other disk. (this issue alone is reason enough to never let Windows near the real hardware ever again in the future!).

Because of these issues I will probably give up duel booting after I’m satisfied with my testing, and run Windows as a VM only. I am also considering adding some spinning storage and back Windows with ZFS sparse volumes, using the ssd as cache. This would simplify backups and be more disk-space efficient.

These are my notes so far from this setup. Note that I am not through testing by far, but I see enough critical parts working in order for considering Coffee Lake a viable GPU passthrough platform.

6 Likes

Win10 seemed satisfied with real hardware and VM after activating. I have had reports from users this is ok on retail win10 but not oem unless you use a Microsoft acct

Keep us posted looks good. I haven’t yet found a z370 that has two iommu groups for each of the x8 slots straight to the CPU…

Would somewhat surprise me. I suspect this is just because it detected some new hardware somewhere (even though passthrough “should” be the same system, but who knows what internal checks Windows does).

You may be right, it would be nice. Unfortunately I have reached the max activation limit of this key (used it for a few years for experiments like this), so I’ll have to call MS support about it before I can try.

My mistake might have been that I activated on baremetal before exposing the same installation to the virtual hardware. So it’s not impossible that my Win8 (non-oem) actually does work like that too.

Does Win10 duel-boot seamlessly without going into a “preparing your devices” type state for some time, after switching from virtual to BM or back? This phenomenon which I see (2) is a bit annoying, especially as it affects the UEFI boot order in an intrusive way.

Not happening for me on win10 but I did fiddle with the couch to be sure the same CPU is being passed through as is

Some updates.

2 discrete GPUs in the system

I added a second discrete GPU only for testing. I am generally happy with using IGD for host (in fact, this was the reason I chose Coffee Lake over other platforms), but I wanted to investigate the prospect of using two discrete GPUs with this hardware.

Since some time I got a working USB 3.0 card, an ICY box IB-AC614a (VIA VL805). This is connected to one of the upstream ports of my KVM switch (the other upstream port is connected to the motherboard USB ports). I also tried putting a Radeon 7790 in the remaining empty PCIe x4 slot, which sits on the chipset and thus has its own IOMMU group. See:

IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3ec2] (rev 07)
IOMMU Group 10 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0)
IOMMU Group 11 00:1c.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #5 [8086:a294] (rev f0)
IOMMU Group 12 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c9]
IOMMU Group 12 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series PCH PMC [8086:a2a1]
IOMMU Group 12 00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0]
IOMMU Group 12 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series PCH SMBus Controller [8086:a2a3]
IOMMU Group 13 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
IOMMU Group 14 04:00.0 USB controller [0c03]: VIA Technologies, Inc. VL805 USB 3.0 Host Controller [1106:3483] (rev 01)
IOMMU Group 15 05:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]
IOMMU Group 16 07:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Bonaire XT [Radeon HD 7790/8770 / R7 360 / R9 260/360 OEM] [1002:665c]
IOMMU Group 16 07:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:0002]
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 07)
IOMMU Group 1 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)
IOMMU Group 1 01:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
IOMMU Group 2 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:3e92]
IOMMU Group 3 00:14.0 USB controller [0c03]: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller [8086:a2af]
IOMMU Group 3 00:14.2 Signal processing controller [1180]: Intel Corporation 200 Series PCH Thermal Subsystem [8086:a2b1]
IOMMU Group 4 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
IOMMU Group 5 00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282]
IOMMU Group 6 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
IOMMU Group 7 00:1b.2 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #19 [8086:a2e9] (rev f0)
IOMMU Group 8 00:1b.3 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #20 [8086:a2ea] (rev f0)
IOMMU Group 9 00:1b.4 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #21 [8086:a2eb] (rev f0)

We find the GTX 970 in Group 1 and the Radeon 7790 in Group 16. The USB controller is in group 14.

Passthrough of 2 GPUs (while using IGD for host graphics)

I tried assigning both the GTX 970 and the Radeon 7790 to different VMs and run them in parallel.

Win8.1: GTX 970 + USB controller
Win7: Radeon 7790

Radeon 7790 is passed through to the Windows 7 installation as secondary GPU - this means that Seabios is used and the system always boots from the qemu emulated video. It is easy to make Windows use the secondary GPU as if it was primary, however, by turning off other displays in Window’s native display settings. Should the discrete GPU suddenly not work for some reason, the emulated video should be automatically enabled (I did not test this recently, but according to past experience at least). In my case the emulated GPU turned off automatically as soon as the 7790 came online.

Installing radeon drivers in the VM: Autodetect of the 7790 worked! There has been problems in the past with AMD’s autodetect procedure, but now there were no problems detecting the card. This may be related to the absence of reset issues (see below).

GPU-Z reports the card running at PCIe 3.0 x4 when under load, as the specification suggests. So the PCIe speeds should not be a bottleneck for this card, perhaps still for some of the more extreme GPUs of today… So here are some benchmarks:

7790 benchmark from Unigine heaven 4.0 with Ultra settings, 8x AA & Extreme Tesselation, 1080p (PCIe 3.0 x4):
391pts, 15.5 [7.8-36.9] FPS

This can be compared to my performance using the same card and settings on a K10 platform in the past (PCIe 2.0 x16, K10 Magny-Cours with 4 x 2.6GHz assigned to VM):

389 points, 15.5 [6.8-36.4] FPS

So no discernible difference compared to an older and slower machine.

I also run the same benchmark on the 7790 and the GTX 970 in parallel in different VMs, with no notable performance impact on either. This is expected as they do not compete about any substantial resources.

What if we put some competition over the chipset bandwidth, while running Heaven 4?

dd if=/dev/nvme0n1 of=/dev/null bs=1M ; dd if=/dev/nvme0n1 of=/dev/null bs=1M ; dd if=/dev/nvme0n1 of=/dev/null bs=1M & [<-- samsung SM961 @nvme x4, 3 iterations to make sure it lasts through the GPU benchmark]
dd if=/dev/sda of=/dev/null bs=1M [<-- OCZ vertex4 @sata3, run in parallel with the other dd process]

(make sure you understand what you are doing, especially the if and of options to dd, in case you plan to try the above.)

So here I read data as fast as possible from both the nvme and the sata ssd:s, in parallel. dd reported reading 2.8Gb/s for nvme0 and 510GB/s from sda. Heaven 4.0 results for 7790:

306pts, 12.1 [2.9-36.6] FPS (substantially lower than before)

So reading about 3.2-3.3Gb/s from chipset devices at the same time does have a measurable impact, but running another GPU on the CPU- connected PCIe lines does not. This means that given the speeds of today’s nvme ssd:s it is not advicable to run a GPU in a chipset-controlled slot for anything serious together with nvme-based storage.

I expected to encounter the device reset bug with this card, as this as been the case when using it in the past, but not this time. Rebooting the VM without rebooting the host worked fine. I believe that some workarounds has been added to the vfio_pci module since last time I tried, some quick google results suggested that.

Note that I have not tested the 7790 with OVMF, only as secondary passthrough with Seabios (the same way I always used Radeon cards with passthrough in the past). Secondary passthrough is an alternative if you have an AMD card and do not want to fiddle with OVMF. However, there are reports that this does not work on Threadripper, and I don’t know whether it works with all Radeon cards.

Second GPU as host graphics

I finally tried to run the 7790 in the host under linux, while passing through the GTX970 to windows. I did not manage to initialise the 7790 in the chipset slot as primary GPU by UEFI, something I don’t think is supported by current UEFI on the ASrock z370m pro4. Then I tried to let radeon/AMDGPU bind to it after booting to IGD, for some reason this did not work. However, a bare-metal boot of Windows found both discrete GPUs and could use both, so I don’t think that my problems in Linux had to do with hardware limitations. It is likely that it will work after some troubleshooting in Linux, but don’t take my word as a promise.

Sorry to pull this up, but I had a couple questions since your iommu groups were interesting to me.

Are you using the ACS override patch? (doesn’t seem like it but just want to confirm)

If you have a nvme device in the M2_1 slot instead of the M2_2 slot what do the groups look like? In other words, does the M2_1 device have its own group or is it combined with the primary graphics card?

I haven’t been able to determine if coffeelake m2 slots come of the the chipset or not and if they are separated properly.

They all do. Coffeelake has 20 PCIe lanes of which 4 are for DMI, everything else if hanging off of that.

Thanks. I think I was confused since AMD does actually have 24 CPU lanes with Ryzen with 4 going directly to an M2 slot. Seems kind of crazy for Intel to be tacking so many things onto the chipset still when it still has to push through the DMI.

No competition is good for business ¯\_(ツ)_/¯
(Almost) no development cost and still profits :+1: