Intro
In this installation we are going to be discussing the technology behind PCI Passthrough to VMs. The concept of passthrough is relatively simple. You take a physical device and forward it’s memory registers to the VM. A simple idea, however, doesn’t make an implementation simple. There’s a lot that goes into passthrough, and a bunch of extremely talented people have put a lot of time into software to bring passthrough to the point where it’s a lot easier than it was before, but still not quite plug ‘n’ play.
Conceptual Discussion
Now, what goes into it exactly? To give a brief overview, we’ve got the hardware support for passthrough, the IOMMU or Input-Output Memory Management Unit, supported by both the motherboard and CPU. (more info on that here) The Linux driver, VFIO, is assigned to the device at boot, preventing the device from being initialized. This will help us when it comes to passing our GPU into the vm. If we’ve got the GPU bound to another driver, we won’t achieve successful passthrough, because you won’t be able to exclusively lock the GPU’s resources to the QEMU vm.
Now, let’s talk about the PCIe bus. The machine I’m going to be using as a reference has an ASUS Z170-a and a 6700k. This gives me 16 PCIe lanes on the CPU to play with. Most GPU’s will be happy with 8 lanes, so we shouldn’t have bandwith issues here. I am going to be passing two devices to my VM: GPU and USB-3 controller. The passthrough GPU will be using an 8x connection on the PCH and the USB controller will be using a 4x connection on the CPU. This will allow the GPU I’m using for Linux output to be connected by an 8x connection as well.
Handling PCIe passthrough isn’t 100% straightforward. There are limitations and rules about how the IOMMU sees a device and its IO Virtual Addresses (IOVA). Some devices will alias to the same IOVA space which makes the IOMMU unable to destinguish between the two devices. This becomes problematic when dealing with transaction rerouting. Transactions don’t always make it all the way to the IOMMU because of the PCIe specification that will allow any downstream PCIe port to re-route the transaction from one device to another. If you have a USB Host Bus Adapter (HBA) assigned to a VM and you’re using a SATA HBA that happen to share the same IOVA space. Any interconnect could mistakenly redirect a transaction from the USB HBA to the SATA HBA.
This problem is solved by the PCIe specification for Access Control Services (ACS). This spec will limit an interconnect’s reach and control the redirects that could otherwise cause major problems. This is where IOMMU groups come into play, grouping devices capable of untranslated peer-to-peer DMA together. What this means to us is that without modifying the groups, you must pass all devices within a group to the same VM or your system will crash.
This is where Alex Williamson comes in. He’s written a patch which allows the spoofing of IOMMU groups at the kernel level so that every PCIe device appears to be in it’s own group. Essentially, when this patch is active, the kernel’s IOMMU module ignores the PCIe ACS rules and groups devices manually, allowing the user to either configure devices to be segregated or segregate each device into it’s own IOMMU group. This is dangerous because while at a kernel level, it appears they’re segregated and thus eligable for passthrough individually, a PCIe Interconnect can still redirect DMA transactions to another device. This can cause issues with passthrough. The bright side is that it’s not common for devices to perform these transactions.
So, let’s put everything we learned together. We enable the ACS override patch by using a kernel command line argument. This prevents the PCIe bridges and ports from getting confused and sending the wrong commands to a GPU. From there, we blacklist our GPU’s regular driver and tell vfio-pci to bind to the GPU. This results in us having a PCI device that’s uninitialized and ready to pass through to a VM.
This isn’t the only way to do it though. You can rebind kernel drivers while the PC is on with certain commands, but that’s going to come in a later guide.
Functional implementations
Now that we’ve gone over the technology and logic behind the PCI bus and passthrough, let’s go ahead and prep a GPU for passthrough. I’m going to be using a GTX 970 on my Z170-a motherboard. As far as software assumptions go, this can be done on almost any distribution out there. I’m currently using Solus but have, in the past, succeeded in passing through a GPU on Ubuntu, Fedora, Arch and Gentoo.
Following this part of the guide will recommend a lot of reboots to make it easier to diagnose where a problem occurs. This process can be done in one step without rebooting, but I’m recommending that you follow the steps I’m giving if you’re not very familiar with the process.
Determining Hardware Compatibility
Not all hardware is compatible with passthrough. Let’s have a look at your system and see if you’ve got what it takes.
First, let’s look at your CPU. If you don’t know exactly which cpu you’ve got in your system, go ahead and execute cat /proc/cpuinfo
. This will print a lot of information about each core of your CPU. You’ll be able see model name
in the mix.
If you’ve got an Intel processor, you’ll need to find it’s ark page and under “Advanced Technologies”, you’ll find an option for Intel Virtualization Technology with Directed I/O
. If this option is “No”, you’ll be out of luck.
Now, AMD is a bit trickier. AMD doesn’t have an official page for this, so you’ll have to refer to this Wikipedia page to get your information. If your CPU shows up on the list, you’re in business.
Now that we’ve sorted out the CPU, we need to make sure we’ve got Motherboard support for all this awesome tech. For this, Intel is more or less straightforward. If you’ve got a Z170, X99 or Qxx chipset, you should have support. There are some other oddities that have support, but you’ll have to consult either your motherboard’s documentation or this Wikipedia page for more information.
On the AMD front, you’ll be looking at this Wikipedia page for information about your motherboard. I wish I was more savvy with AMD’s motherboards, but I’ve been on intel since the Athlon 64 era ended.
Now, if both your Motherboard and CPU have support, you’ll just need to enable VT-d or AMD-Vi in the BIOS/UEFI config menu, then you’ll be ready to get on to the next step.
A note about GPU compatibility
The state of GPU compatibility is somewhat frustrating. Let me go into the problems surrounding each vendor’s GPU’s individually.
AMD GPU’s are completely compatible on the software side, and on the hardware side, they’re mostly compatible. The problem comes in when you try to use a reset command on certain AMD GPU’s. The problem is that AMD was lazy when they built this feature on their cards, presumably to save costs, and these cards don’t fully reset and fail to re-initialize. This means that every time you restart your VM, you need to restart the physical machine.
Nvidia has no physical incompatibilities like some AMD devices, but Nvidia wants you to buy the more expensive quadro cards if you’re going to be passing through your GPU, so if you’re using the Nvidia drivers in a VM, they fail to initialize a non-quadro GPU. You’ll experience an “error 43” issue with the device, which can be seen in the Windows device manager. There is a workaround, which I’ll go over in my next installation, at a small (1-2%) CPU performance hit.
Enable the IOMMU
First thing’s first, we need to enable the IOMMU. For this, we need only to edit the kernel command line arguments. This is done differently depending on which bootloader you’re using.
The command is different depending on which type of CPU you have. For an Intel CPU, you’ll be using intel_iommu=on
and for AMD, you’ll be using amd_iommu=on
. Make sure you use the proper command depending on your vendor.
If you’re using GRUB, you’ll need to edit /etc/default/grub
and add to GRUB_CMDLINE_LINUX_DEFAULT
the argument you’ve chosen above, depending on which vendor CPU you have. Once you’re done editing the file, you’ll need to rebuild your grub config with grub-mkconfig -o /boot/grub/grub.cfg
.
If you’re using Goofiboot (Solus EFI), you’ll be editing the /boot/efi/loader/entries/solus.conf
file and appending the command to the line that begins with options
.
If you’re using systemd-boot, you’ll be doing something similar, editing /boot/loader/entries/entry.conf
where entry
is the name of your installation entry.
For other bootloaders, have a look at the archwiki kernel parameters page. You should find what you need there.
Once done you should have something like: root=UUID=96b7100c-6b49-4843-bf6c-6c8c78918a3a intel_iommu=on
in the kernel command line configuration.
Now reboot so you can look at the IOMMU groups.
To check that the IOMMU is enabled, you’ll need to execute the following command:
$ dmesg | grep DMAR
[ 0.000000] ACPI: DMAR 0x0000000076A37E48 000078 (v01 INTEL SKL 00000001 INTL 00000001)
[ 0.000000] DMAR: IOMMU enabled
[ 0.065046] DMAR: Host address width 39
[ 0.065047] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
[ 0.065051] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap d2008c40660462 ecap f050da
[ 0.065051] DMAR: RMRR base: 0x00000076779000 end: 0x00000076798fff
[ 0.065052] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed90000 IOMMU 0
[ 0.065053] DMAR-IR: HPET id 0 under DRHD base 0xfed90000
[ 0.065053] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[ 0.065053] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[ 0.066337] DMAR-IR: Enabled IRQ remapping in xapic mode
[ 1.038203] DMAR: [Firmware Bug]: RMRR entry for device 03:00.0 is broken - applying workaround
[ 1.038204] DMAR: No ATSR found
[ 1.038537] DMAR: dmar0: Using Queued invalidation
[ 1.038542] DMAR: Setting RMRR:
[ 1.038560] DMAR: Setting identity map for device 0000:00:14.0 [0x76779000 - 0x76798fff]
[ 1.038579] DMAR: Setting identity map for device 0000:03:00.0 [0x76779000 - 0x76798fff]
[ 1.038583] DMAR: Prepare 0-16MiB unity mapping for LPC
[ 1.038598] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[ 1.038601] DMAR: Intel(R) Virtualization Technology for Directed I/O
You’re looking for the line that says DMAR: IOMMU enabled
.
A look at your IOMMU groups.
Now that you’ve got your IOMMU enabled, let’s have a look at how the motherboard is grouping these devices by default. For this, this archwiki page has a useful bash script that will print out, in a neat list, the IOMMU groups and the devices belonging to those groups.
for d in /sys/kernel/iommu_groups/*/devices/*; do
n=${d#*/iommu_groups/*}; n=${n%%/*}
printf 'IOMMU Group %s ' "$n"
lspci -nns "${d##*/}"
done;
What we’re looking for is the GPU and other devices that we’re going to be passing through. We need to make sure that there are no other devices in the same group as the GPU or other devices for passthrough. If there are, they’ll either need to be passed through as well, or we’re going to need to enable the PCIe ACS override patch. More on compiling this into your kernel here soon to come.
As an example, this is what my IOMMU groups look like AFTER applying the patch. The devices I’m looking for are 02:00.0
, 02:00.1
and 06:00.0
.
IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:191f] (rev 07)
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a145] (rev 31)
IOMMU Group 10 00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a121] (rev 31)
IOMMU Group 10 00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a170] (rev 31)
IOMMU Group 10 00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a123] (rev 31)
IOMMU Group 11 00:1f.6 Ethernet controller [0200]: Intel Corporation Device [8086:15b8] (rev 31)
IOMMU Group 12 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI Device [1002:7300] (rev cb)
IOMMU Group 12 01:00.1 Audio device [0403]: Advanced Micro Devices [AMD] nee ATI Device [1002:aae8]
IOMMU Group 13 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:13c2] (rev a1)
IOMMU Group 13 02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fbb] (rev a1)
IOMMU Group 14 03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:1242]
IOMMU Group 15 04:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04)
IOMMU Group 16 06:00.0 USB controller [0c03]: Fresco Logic Device [1b73:1100] (rev 10)
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 07)
IOMMU Group 2 00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a12f] (rev 31)
IOMMU Group 3 00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a13a] (rev 31)
IOMMU Group 4 00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a102] (rev 31)
IOMMU Group 5 00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:a167] (rev f1)
IOMMU Group 6 00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a110] (rev f1)
IOMMU Group 7 00:1c.2 PCI bridge [0604]: Intel Corporation Device [8086:a112] (rev f1)
IOMMU Group 8 00:1c.5 PCI bridge [0604]: Intel Corporation Device [8086:a115] (rev f1)
IOMMU Group 9 00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a118] (rev f1)
If we have to apply the ACS patch, edit your kernel command line and add pcie_acs_override=downstream
to the line and if needed, reconfigure grub with grub-mkconfig -o /boot/grub/grub.cfg
.
VFIO and blacklisted modules.
VFIO works with GPU drivers (or modules) in a peculiar way. It only works properly* if the drivers are completely blacklisted from the OS. Let’s get started blacklisting modules. Most distributions use the /etc/modprobe.d/
directory to handle configuration and options for kernel modules. We’re going to be creating and working with two files here.
Let’s start by blacklisting the driver for the GPU you’re going to pass through. If you’re passing an nvidia GPU, we’re going to blacklist the nouveau
module by editing /etc/modprobe.d/nouveau.conf
to contain blacklist nouveau
. If you’re passing an AMD GPU, blacklist the fglrx
and amdgpu
modules in the /etc/modprobe.d/amdgpu.conf
file, each on separate lines, like so:
blacklist fglrx
blacklist amdgpu
Now that we’ve blacklisted the video drivers, we can bind the vfio module to the device. To do this, we need the device ID’s of our devices. to find that, we issue lspci -nn
which will show us (in square brackets) the device ID of each PCI device connected to the computer. Let’s look at my output, for example.
$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:191f] (rev 07)
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 07)
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a12f] (rev 31)
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a13a] (rev 31)
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a102] (rev 31)
00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:a167] (rev f1)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a110] (rev f1)
00:1c.2 PCI bridge [0604]: Intel Corporation Device [8086:a112] (rev f1)
00:1c.5 PCI bridge [0604]: Intel Corporation Device [8086:a115] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a118] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a145] (rev 31)
00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a121] (rev 31)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a170] (rev 31)
00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a123] (rev 31)
00:1f.6 Ethernet controller [0200]: Intel Corporation Device [8086:15b8] (rev 31)
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI Device [1002:7300] (rev cb)
01:00.1 Audio device [0403]: Advanced Micro Devices [AMD] nee ATI Device [1002:aae8]
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:13c2] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fbb] (rev a1)
03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:1242]
04:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04)
06:00.0 USB controller [0c03]: Fresco Logic Device [1b73:1100] (rev 10)
Now, the device ID’s I’m looking for are for my nvidia GPU and the USB controller, so it’s going to be 1b73:1100
, 10de:13c2
and 10de:0fbb
. Now that we’ve got the device ID’s, we edit /etc/modprobe.d/vfio.conf
and add the following line:
options vfio-pci ids=1b73:1100,10de:13c2,10de:0fbb
Obviously, you’ll want to substitute the device ID’s with your own, but you get the idea.
Now, to make sure these bind early at boot, we’ve got to edit our kernel command line once again. Add the following to the kernel command line section.
vfio-pci.ids=1b73:1100,10de:13c2,10de:0fbb
Now that that’s done, we need to make sure we’ve got all the VFIO modules in the initrd or initcpio. There are a few different ways to do this.
If your system uses dracut
(to find out, issue ls /etc/dracut.conf
. If it prints /etc/dracut.conf
you’re using dracut), edit /etc/dracut.conf
and under the add_drivers+=
line, make sure you add vfio
, vfio_iommu_type1
, vfio_pci
, vfio_virqfd
to the list of drivers, save and rebuild dracut with sudo dracut -f
.
If your system uses mkinitcpio
(use ls /etc/mkinitcpio.conf
to find out if you’re using it), under the modules
section, add vfio
, vfio_iommu_type1
, vfio-pci
, vfio_virqfd
to the list of modules and run sudo mkinitcpio -p yourkernelname
to rebuild the initrd.
Now that we’ve got vfio configured, if you’re using grub, the last thing you need to do is rebuild the grub config with grub-mkconfig -o /boot/grub/grub.cfg
.
*This isn't 100% true, but for the sake of this guide, let's assume it is. This assumption will make things easier going forward. I'll be going into how this becomes false at a later time, once I sort out the details.
Conclusion
If you followed this guide to completion, you’ve got a GPU that’s ready to be passed through to a VM. The last bit is easy, but that’s going to come in the next installation because there’s a lot of supplementary knowledge that’s going to give people an in depth knowledge bout Linux and KVM. Here’s hoping this guide has been helpful to you.
This article was posted on the 28th of December, the one year Anniversary of my joining the forum. My first post was one discussing Passthrough, and since I’ve moved entirely away from Windows on the bare metal and have truly embraced Linux and virtualization. It’s only fitting that I celebrate 1 year with the forum by talking about virtualization.
I have been working on a few projects on the side so, get ready to see some more articles from me.
Feel free to check out smaller updates at my blog at blog.kebrx.tech.