The Pragmatic Neckbeard 3: VFIO, IOMMU and PCIe

Intro


In this installation we are going to be discussing the technology behind PCI Passthrough to VMs. The concept of passthrough is relatively simple. You take a physical device and forward it’s memory registers to the VM. A simple idea, however, doesn’t make an implementation simple. There’s a lot that goes into passthrough, and a bunch of extremely talented people have put a lot of time into software to bring passthrough to the point where it’s a lot easier than it was before, but still not quite plug ‘n’ play.

Conceptual Discussion


Now, what goes into it exactly? To give a brief overview, we’ve got the hardware support for passthrough, the IOMMU or Input-Output Memory Management Unit, supported by both the motherboard and CPU. (more info on that here) The Linux driver, VFIO, is assigned to the device at boot, preventing the device from being initialized. This will help us when it comes to passing our GPU into the vm. If we’ve got the GPU bound to another driver, we won’t achieve successful passthrough, because you won’t be able to exclusively lock the GPU’s resources to the QEMU vm.

Now, let’s talk about the PCIe bus. The machine I’m going to be using as a reference has an ASUS Z170-a and a 6700k. This gives me 16 PCIe lanes on the CPU to play with. Most GPU’s will be happy with 8 lanes, so we shouldn’t have bandwith issues here. I am going to be passing two devices to my VM: GPU and USB-3 controller. The passthrough GPU will be using an 8x connection on the PCH and the USB controller will be using a 4x connection on the CPU. This will allow the GPU I’m using for Linux output to be connected by an 8x connection as well.

Handling PCIe passthrough isn’t 100% straightforward. There are limitations and rules about how the IOMMU sees a device and its IO Virtual Addresses (IOVA). Some devices will alias to the same IOVA space which makes the IOMMU unable to destinguish between the two devices. This becomes problematic when dealing with transaction rerouting. Transactions don’t always make it all the way to the IOMMU because of the PCIe specification that will allow any downstream PCIe port to re-route the transaction from one device to another. If you have a USB Host Bus Adapter (HBA) assigned to a VM and you’re using a SATA HBA that happen to share the same IOVA space. Any interconnect could mistakenly redirect a transaction from the USB HBA to the SATA HBA.

This problem is solved by the PCIe specification for Access Control Services (ACS). This spec will limit an interconnect’s reach and control the redirects that could otherwise cause major problems. This is where IOMMU groups come into play, grouping devices capable of untranslated peer-to-peer DMA together. What this means to us is that without modifying the groups, you must pass all devices within a group to the same VM or your system will crash.

This is where Alex Williamson comes in. He’s written a patch which allows the spoofing of IOMMU groups at the kernel level so that every PCIe device appears to be in it’s own group. Essentially, when this patch is active, the kernel’s IOMMU module ignores the PCIe ACS rules and groups devices manually, allowing the user to either configure devices to be segregated or segregate each device into it’s own IOMMU group. This is dangerous because while at a kernel level, it appears they’re segregated and thus eligable for passthrough individually, a PCIe Interconnect can still redirect DMA transactions to another device. This can cause issues with passthrough. The bright side is that it’s not common for devices to perform these transactions.

So, let’s put everything we learned together. We enable the ACS override patch by using a kernel command line argument. This prevents the PCIe bridges and ports from getting confused and sending the wrong commands to a GPU. From there, we blacklist our GPU’s regular driver and tell vfio-pci to bind to the GPU. This results in us having a PCI device that’s uninitialized and ready to pass through to a VM.

This isn’t the only way to do it though. You can rebind kernel drivers while the PC is on with certain commands, but that’s going to come in a later guide.

Functional implementations


Now that we’ve gone over the technology and logic behind the PCI bus and passthrough, let’s go ahead and prep a GPU for passthrough. I’m going to be using a GTX 970 on my Z170-a motherboard. As far as software assumptions go, this can be done on almost any distribution out there. I’m currently using Solus but have, in the past, succeeded in passing through a GPU on Ubuntu, Fedora, Arch and Gentoo.

Following this part of the guide will recommend a lot of reboots to make it easier to diagnose where a problem occurs. This process can be done in one step without rebooting, but I’m recommending that you follow the steps I’m giving if you’re not very familiar with the process.

Determining Hardware Compatibility


Not all hardware is compatible with passthrough. Let’s have a look at your system and see if you’ve got what it takes.

First, let’s look at your CPU. If you don’t know exactly which cpu you’ve got in your system, go ahead and execute cat /proc/cpuinfo. This will print a lot of information about each core of your CPU. You’ll be able see model name in the mix.

If you’ve got an Intel processor, you’ll need to find it’s ark page and under “Advanced Technologies”, you’ll find an option for Intel Virtualization Technology with Directed I/O. If this option is “No”, you’ll be out of luck.

Now, AMD is a bit trickier. AMD doesn’t have an official page for this, so you’ll have to refer to this Wikipedia page to get your information. If your CPU shows up on the list, you’re in business.

Now that we’ve sorted out the CPU, we need to make sure we’ve got Motherboard support for all this awesome tech. For this, Intel is more or less straightforward. If you’ve got a Z170, X99 or Qxx chipset, you should have support. There are some other oddities that have support, but you’ll have to consult either your motherboard’s documentation or this Wikipedia page for more information.

On the AMD front, you’ll be looking at this Wikipedia page for information about your motherboard. I wish I was more savvy with AMD’s motherboards, but I’ve been on intel since the Athlon 64 era ended.

Now, if both your Motherboard and CPU have support, you’ll just need to enable VT-d or AMD-Vi in the BIOS/UEFI config menu, then you’ll be ready to get on to the next step.

A note about GPU compatibility

The state of GPU compatibility is somewhat frustrating. Let me go into the problems surrounding each vendor’s GPU’s individually.

AMD GPU’s are completely compatible on the software side, and on the hardware side, they’re mostly compatible. The problem comes in when you try to use a reset command on certain AMD GPU’s. The problem is that AMD was lazy when they built this feature on their cards, presumably to save costs, and these cards don’t fully reset and fail to re-initialize. This means that every time you restart your VM, you need to restart the physical machine.

Nvidia has no physical incompatibilities like some AMD devices, but Nvidia wants you to buy the more expensive quadro cards if you’re going to be passing through your GPU, so if you’re using the Nvidia drivers in a VM, they fail to initialize a non-quadro GPU. You’ll experience an “error 43” issue with the device, which can be seen in the Windows device manager. There is a workaround, which I’ll go over in my next installation, at a small (1-2%) CPU performance hit.

Enable the IOMMU


First thing’s first, we need to enable the IOMMU. For this, we need only to edit the kernel command line arguments. This is done differently depending on which bootloader you’re using.

The command is different depending on which type of CPU you have. For an Intel CPU, you’ll be using intel_iommu=on and for AMD, you’ll be using amd_iommu=on. Make sure you use the proper command depending on your vendor.

If you’re using GRUB, you’ll need to edit /etc/default/grub and add to GRUB_CMDLINE_LINUX_DEFAULT the argument you’ve chosen above, depending on which vendor CPU you have. Once you’re done editing the file, you’ll need to rebuild your grub config with grub-mkconfig -o /boot/grub/grub.cfg.

If you’re using Goofiboot (Solus EFI), you’ll be editing the /boot/efi/loader/entries/solus.conf file and appending the command to the line that begins with options.

If you’re using systemd-boot, you’ll be doing something similar, editing /boot/loader/entries/entry.conf where entry is the name of your installation entry.

For other bootloaders, have a look at the archwiki kernel parameters page. You should find what you need there.

Once done you should have something like: root=UUID=96b7100c-6b49-4843-bf6c-6c8c78918a3a intel_iommu=on in the kernel command line configuration.

Now reboot so you can look at the IOMMU groups.

To check that the IOMMU is enabled, you’ll need to execute the following command:

$ dmesg | grep DMAR
[    0.000000] ACPI: DMAR 0x0000000076A37E48 000078 (v01 INTEL  SKL      00000001 INTL 00000001)
[    0.000000] DMAR: IOMMU enabled
[    0.065046] DMAR: Host address width 39
[    0.065047] DMAR: DRHD base: 0x000000fed90000 flags: 0x1
[    0.065051] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap d2008c40660462 ecap f050da
[    0.065051] DMAR: RMRR base: 0x00000076779000 end: 0x00000076798fff
[    0.065052] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed90000 IOMMU 0
[    0.065053] DMAR-IR: HPET id 0 under DRHD base 0xfed90000
[    0.065053] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    0.065053] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    0.066337] DMAR-IR: Enabled IRQ remapping in xapic mode
[    1.038203] DMAR: [Firmware Bug]: RMRR entry for device 03:00.0 is broken - applying workaround
[    1.038204] DMAR: No ATSR found
[    1.038537] DMAR: dmar0: Using Queued invalidation
[    1.038542] DMAR: Setting RMRR:
[    1.038560] DMAR: Setting identity map for device 0000:00:14.0 [0x76779000 - 0x76798fff]
[    1.038579] DMAR: Setting identity map for device 0000:03:00.0 [0x76779000 - 0x76798fff]
[    1.038583] DMAR: Prepare 0-16MiB unity mapping for LPC
[    1.038598] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[    1.038601] DMAR: Intel(R) Virtualization Technology for Directed I/O


You’re looking for the line that says DMAR: IOMMU enabled.

A look at your IOMMU groups.


Now that you’ve got your IOMMU enabled, let’s have a look at how the motherboard is grouping these devices by default. For this, this archwiki page has a useful bash script that will print out, in a neat list, the IOMMU groups and the devices belonging to those groups.

for d in /sys/kernel/iommu_groups/*/devices/*; do 
    n=${d#*/iommu_groups/*}; n=${n%%/*}
    printf 'IOMMU Group %s ' "$n"
    lspci -nns "${d##*/}"
done;

What we’re looking for is the GPU and other devices that we’re going to be passing through. We need to make sure that there are no other devices in the same group as the GPU or other devices for passthrough. If there are, they’ll either need to be passed through as well, or we’re going to need to enable the PCIe ACS override patch. More on compiling this into your kernel here soon to come.

As an example, this is what my IOMMU groups look like AFTER applying the patch. The devices I’m looking for are 02:00.0, 02:00.1 and 06:00.0.

IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:191f] (rev 07)
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a145] (rev 31)
IOMMU Group 10 00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a121] (rev 31)
IOMMU Group 10 00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a170] (rev 31)
IOMMU Group 10 00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a123] (rev 31)
IOMMU Group 11 00:1f.6 Ethernet controller [0200]: Intel Corporation Device [8086:15b8] (rev 31)
IOMMU Group 12 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI Device [1002:7300] (rev cb)
IOMMU Group 12 01:00.1 Audio device [0403]: Advanced Micro Devices [AMD] nee ATI Device [1002:aae8]
IOMMU Group 13 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:13c2] (rev a1)
IOMMU Group 13 02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fbb] (rev a1)
IOMMU Group 14 03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:1242]
IOMMU Group 15 04:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04)
IOMMU Group 16 06:00.0 USB controller [0c03]: Fresco Logic Device [1b73:1100] (rev 10)
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 07)
IOMMU Group 2 00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a12f] (rev 31)
IOMMU Group 3 00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a13a] (rev 31)
IOMMU Group 4 00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a102] (rev 31)
IOMMU Group 5 00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:a167] (rev f1)
IOMMU Group 6 00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a110] (rev f1)
IOMMU Group 7 00:1c.2 PCI bridge [0604]: Intel Corporation Device [8086:a112] (rev f1)
IOMMU Group 8 00:1c.5 PCI bridge [0604]: Intel Corporation Device [8086:a115] (rev f1)
IOMMU Group 9 00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a118] (rev f1)

If we have to apply the ACS patch, edit your kernel command line and add pcie_acs_override=downstream to the line and if needed, reconfigure grub with grub-mkconfig -o /boot/grub/grub.cfg.

VFIO and blacklisted modules.


VFIO works with GPU drivers (or modules) in a peculiar way. It only works properly* if the drivers are completely blacklisted from the OS. Let’s get started blacklisting modules. Most distributions use the /etc/modprobe.d/ directory to handle configuration and options for kernel modules. We’re going to be creating and working with two files here.

Let’s start by blacklisting the driver for the GPU you’re going to pass through. If you’re passing an nvidia GPU, we’re going to blacklist the nouveau module by editing /etc/modprobe.d/nouveau.conf to contain blacklist nouveau. If you’re passing an AMD GPU, blacklist the fglrx and amdgpu modules in the /etc/modprobe.d/amdgpu.conf file, each on separate lines, like so:

blacklist fglrx
blacklist amdgpu

Now that we’ve blacklisted the video drivers, we can bind the vfio module to the device. To do this, we need the device ID’s of our devices. to find that, we issue lspci -nn which will show us (in square brackets) the device ID of each PCI device connected to the computer. Let’s look at my output, for example.

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:191f] (rev 07)
00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 07)
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a12f] (rev 31)
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a13a] (rev 31)
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a102] (rev 31)
00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:a167] (rev f1)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a110] (rev f1)
00:1c.2 PCI bridge [0604]: Intel Corporation Device [8086:a112] (rev f1)
00:1c.5 PCI bridge [0604]: Intel Corporation Device [8086:a115] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a118] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a145] (rev 31)
00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a121] (rev 31)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a170] (rev 31)
00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a123] (rev 31)
00:1f.6 Ethernet controller [0200]: Intel Corporation Device [8086:15b8] (rev 31)
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices [AMD] nee ATI Device [1002:7300] (rev cb)
01:00.1 Audio device [0403]: Advanced Micro Devices [AMD] nee ATI Device [1002:aae8]
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:13c2] (rev a1)
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fbb] (rev a1)
03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:1242]
04:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 04)
06:00.0 USB controller [0c03]: Fresco Logic Device [1b73:1100] (rev 10)

Now, the device ID’s I’m looking for are for my nvidia GPU and the USB controller, so it’s going to be 1b73:1100, 10de:13c2 and 10de:0fbb. Now that we’ve got the device ID’s, we edit /etc/modprobe.d/vfio.conf and add the following line:

options vfio-pci ids=1b73:1100,10de:13c2,10de:0fbb

Obviously, you’ll want to substitute the device ID’s with your own, but you get the idea.

Now, to make sure these bind early at boot, we’ve got to edit our kernel command line once again. Add the following to the kernel command line section.

vfio-pci.ids=1b73:1100,10de:13c2,10de:0fbb

Now that that’s done, we need to make sure we’ve got all the VFIO modules in the initrd or initcpio. There are a few different ways to do this.

If your system uses dracut (to find out, issue ls /etc/dracut.conf. If it prints /etc/dracut.conf you’re using dracut), edit /etc/dracut.conf and under the add_drivers+= line, make sure you add vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd to the list of drivers, save and rebuild dracut with sudo dracut -f.

If your system uses mkinitcpio (use ls /etc/mkinitcpio.conf to find out if you’re using it), under the modules section, add vfio, vfio_iommu_type1, vfio-pci, vfio_virqfd to the list of modules and run sudo mkinitcpio -p yourkernelname to rebuild the initrd.

Now that we’ve got vfio configured, if you’re using grub, the last thing you need to do is rebuild the grub config with grub-mkconfig -o /boot/grub/grub.cfg.

*This isn't 100% true, but for the sake of this guide, let's assume it is. This assumption will make things easier going forward. I'll be going into how this becomes false at a later time, once I sort out the details.

Conclusion


If you followed this guide to completion, you’ve got a GPU that’s ready to be passed through to a VM. The last bit is easy, but that’s going to come in the next installation because there’s a lot of supplementary knowledge that’s going to give people an in depth knowledge bout Linux and KVM. Here’s hoping this guide has been helpful to you.

This article was posted on the 28th of December, the one year Anniversary of my joining the forum. My first post was one discussing Passthrough, and since I’ve moved entirely away from Windows on the bare metal and have truly embraced Linux and virtualization. It’s only fitting that I celebrate 1 year with the forum by talking about virtualization.

I have been working on a few projects on the side so, get ready to see some more articles from me.

Feel free to check out smaller updates at my blog at blog.kebrx.tech.

26 Likes

I currently don't have a system that'll work very with passthrough (4690k and 8GBs of DDR3) but I am bookmarking this for when I have a capable system.

The 4690k supports an IOMMU. Just make sure your motherboard supports it and consider 16GB of ram.

Have a look here: intel ark 4690k and find the vt-d entry under Advanced Tech.

1 Like

I know my hardware has support for IOMMU and I have messed around with getting passthrough to work in the past but I feel like I need at least a couple extra cores to get the most out of it. Plus memory is a bigger issue. I kind of don't want to get any more memory for this system because DDR3 is more expensive than it used to be (in some cases costs more than DDR4 at this point) so my idea is to wait for RYZEN or something similar before I tackle passthough.

I would recommend waiting to at least see what the options are. New AMD CPU's are usually a bit cheaper than their Intel counterparts and if the benchmarks from the release stream are accurate, it's going to be damn good.

Yeah, that's because DDR3 is still a major thing in servers. Most servers are still using DDR3 because unless you've bought your hardware this year, that's what the board supported. People are also going hogwild building whiteboxes out of their old gaming hardware. I just turned my 3770k and 8350 rigs into proxmox nodes and stuffed them to the brim with ram.

EDIT: The same thing happened with DDR2 around 2012-2013

2 Likes

Yea, actually had the same problem with my older computer. had a Q8300 on a ASUS P5P41D and it would've cost me around £74 to get the same capacity (4x1GB sticks) that was only slightly faster than the currently modules I had and couldn't find anything else. What I needed were 4x2GB sticks but couldn't find any, to be honest the 1GB sticks were already super expensive so imagine how much the 2GBs would've been if I could find them.

Funny thing is, I've actually salvaged some 2GB sticks of DDR2 but none of them work in the motherboard which is a shame, pulled them out of a 2 DIMM slot HP OEM motherboard.

1 Like

I've got a Q6600 with the same problem. It's got working ram, but I actually decided to retire the CPU and put it in a display case Tony Stark style.

I think I've got 4x2GB ddr2 (don't quote me on that). I'll check and might be willing to part with it.

I've pretty much retied the machine also, I actually have two Q8300s because my brother had one in his old Dell computer, also have a load of Dual-Cores from the same era, mostly from old Dell/HP systems but haven't bothered to display any.

Well, that CPU has some fond memories. My first watercooling experience, back when Danger Den was a thing. I brought it from 2.4GHz up to 3.35 stable. It was my first foray into serious overclocking and I ran that thing as my main system until 2013 when I bought the 3770k.

1 Like

Good post, thank you.

For the record, PCI passthrough works on the ASUS M5A97 EVO R2.0 (it isn't listed in the wikipedia article).

1 Like

Good to know. I should add in that you can always check your bios for VT-d or AMD-Vi options, but I figured that the list was more or less exhaustive.

Woot. Waiting for the continuation. The Core i7 4770 I have supposedly supports IOMMU.

P.S. - Been trying to get it to pass-through the Intel GPU for a while now. It was supposed to be working, then it got an update from Citrix and it claims it is no longer supported. I guess they have a deal with nVidia.

The iGPU isn't going to pass through easily.

The only way I know of is using Xen and that's buggy. I'm using a 970 with no problems and I'll be going into detail on that next week with the next installation. Wish I had better news on the iGPU situation.

That's a shame. The Intel GPU is definitely fast enough to handle media center duties. Oh well, I also have a GTX 760 with a UEFI bios. Hopefully that will pass-through fine.

P.S. - the other option would be to make the host a media center PC.

the 760 should do just fine. Just be sure to have the host boot from the iGPU.

I'll do some digging on the iGPU to see if I can find anyone who's doing it currently on kernel 4.x.

Thanks, I would appreciate that. As far as I can see XenServer 7 is based on CentOS 7 + kernel 3.10 with patches, so that is probably one of the limiting factors.

Another quick question.

Would, for example, and Ubuntu 16.04 based system with KVM/Xen enabled kernel be acceptable or do I really need to go with Arch/Manjaro on this?

I am not to keen to switching over to Arch/Manjaro.

1 Like

It really depends on how much work you want to do.

Ubuntu would probably require some compiling from source, but it's really going to boil down to what I find out in my research.

If switching to Arch/Manjaro is not something you want to do, you don't have to do it.

Thanks. Will probably try it out this weekend.

What happens when I want to pass a PCI-E Intel Network adapter and the IOMMU IDs are the same for both devices?

IOMMU Group 1 01:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:105e] (rev 06)
IOMMU Group 1 01:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:105e] (rev 06)

edit: added it twice. Seems to be working. :) The Nvidia card does not fit the case, so I installed Manjaro Desktop. I am going to virtualize the rest of the stuff on top of that and use the base system for media. The passthrough of network devices was trivial. Just used the virt-manager GUI.

Yeah, all you've got to do is pass through both device addresses.

Glad to hear it's working for you.

I've been quite ill this week, so I haven't been able to write. Hopefully that will happen tomorrow and I'll be able to get the article out.