Skylake GPU passthrough issues

Hey all,

I'm having some major issues with Archlinux trying to get PCIe passthrough working properly on my R9 380.

I've got a 6700K on an ASUS Z170-A board. I've got the IOMMU enabled both in UEFI and boot parameters, and intel.iommu=igfx_off is also set, but when I add the vfio modules to initcpio, the host will no longer launch lightdm. I'm using cinnamon as my DE, but startx fails as well, so I think it's environment agnostic. When I change my mkinitcpio.conf MODULES definition from "vfio vfio_iommu_type1 vfio_pci vfio_virqfd i915" to "i915", effectively eliminating the VFIO definitions, it boots just fine. I guess what this means is that the Intel graphics device is being grabbed by VFIO, although, I don't know enough to confirm this.

I'm going to try to pass though the 380 wihtout using VFIO and see what happens. I'll update with my results.

Any input would be helpful. Thanks!

@blanger may or may not be able to help you out. He's pretty smart with this stuff.

Sweet. I used to think I was good with linux, up until I tried to get this working. It's been one heck of an ego-check.

So I started the windows 10 install (through the spice console on virt-manager) and it was going smoothly. I got back from dinner and the screen was off and it was not responding to input, so I'm pretty sure it crashed. A quick hard stop and it can't find /dev/sda. I'm thinking my backplane has gone bad. Hopefully I get the system to finish booting soon.

I feel like there's a relevant XKCD for this...

I think @Quixotic_Autocrat is really good (much better than myself) at this stuff. You ought to ask him and see if he can provide some insight.

What you are trying to do is difficult because the UEFI has an active switch. You may have to have it initialize the igpu first before the 380

I may have gotten it. I disabled VFIO entirely, used the virt-manager gui to pass the gpu and HDMI audio controller through to the VM, set up the OVMF UEFI and I've got it working, on a very basic level. I haven't gotten it past the UEFI yet because I need to figure out how best to pass through a USB host for a keyboard.

2 Likes

I have a Z170 Pro Gaming board, running Arch Linux w/ kernel 4.4rcX (I'm following the releases). Make sure your BIOS is set to use the onboard graphics, that should force the intel drivers but I would just add i915 to the beginning of your modules, following by the vfio modules. I'm passing through a GTX 780 Ti but as long as you properly set the IDs for the GPU in /etc/modprobe.d/vfio.conf it should grab it after the intel driver loads. I also black listed nouveau.

Most likely virt-manager is grabbing them just fine because you've black listed radeon? But glad it's working at least! Definitely get UEFI running. I use a display port usb KVM switch for my guest (windows VM for some casual gaming).

Since I had the i915 listed at the end of the string, it was probably loading it after the catch-all VFIO.

I'm happy to announce that I've passed through a spare keyboard and mouse and I'm installing windows 10 again (over the 380) so it will be configured for UEFI. I'll post results as soon as I've got it up and running. I guess I just needed to work at it a bit more.

I think I'll use synergy for now, but I'm debating getting a KVM switch. (or maybe I'll make one myself. Could be a fun project)

I have 3 monitors so I'm also using synergy to go back to controlling the host if needed. :)

Alright, So I've got windows installed, but it's not quite working/stable. I rebooted after setting up username/password and whatnot and the whole host system locked up. It's odd, because the physical reset button on the case is not doing anything. It's also a bit concerning. I had a VM issue earlier where the CPU's it was assigned to were pinned at 100% with no graphics output, so I switched the initcpio modules back to include all the VFIO components in it, thinking that would help.

I got the intel graphics working by putting the i915 entry at the beginning of the initcpio modules definition. That said, the whole system locked up, so there's obviously something going on. This wasn't a lockup like the SNA issue with intel graphics on some systems, video, audio, input, and hardware buttons failed. I had to flip the switch on the PSU.

I've got my Z170 UEFI set to use the IGPU first, so that's sorted, and it seems to boot up again properly. I get the feeling something's wrong with my SATA devices. I've got a spare sata controller laying around from an old NAS and I'm going to use that for some testing to see if it's for the controller or the drives. (they're a couple years old, but my Z77 board I had prior to this rig wasn't giving errors)

I booted it back up and it locked up again, same situation. I also got an issue with the 380 in the VM where it was displaying what looked like 16 color depth at 800x600 for the entirety of the system. I tried to manually install the drivers for the 380 and the system locked up at the stage where the screen would normally flash (don't know how else to call it) during the installation of the driver itself. The whole system locked up, this time I had audio playing in the background and it played back about 1.5 seconds of audio over and over until I flipped the PSU switch.

I wish I could help you but what your doing is several levels above the way I got the pass through to work using pci-stub not to mention I know nothing about Arch, I'm going to assume you looked at the thread the guys made a few months ago about a Skylake build doing the pass through? I'll post the link just in-case you haven't.

https://forum.teksyndicate.com/t/gta-v-on-linux-skylake-build-hardware-vm-passthrough-tek-syndicate/87440

I'm sure he's busy but maybe @wendell will see this thread and post a reply.

I appreciate the help. Honestly, the article was what made me want to get into it. I read a bit about it shortly after I upgraded my system in 2013, but my 3770k didn't have VT-d, so I couldn't do it at the time. I got myself a nice little christmas gift and thought I'd give it a go this year.

So it's been about 3 reformats on my arch install now and I think I've just about got it, after installing unraid and digging around quite a bit, I found a lot of settings that seem to make it "just work" I'm doing some work on getting the system more stable, when I have it stable, I'll probably post an article in the tutorials section, because this is more difficult than it should be. I feel like it's something that can be easy if you understand how everything works. I'll probably need to put a good few hours of videos up to explain it.

Also, I've found that AMD cards are an order of magnitude easier to get working (NVIDIA drivers to blame here). Nvidia likes to throw "error 43" which is pretty much "hey, we've noticed you're passing this through to a VM. We don't like that, for whatever reason." So, this fits well with the inbox video that just went up today talking about how to support AMD.

Forgive me for rambling, I'm just happy to post a positive update.

LOL...that sounds about right, honestly I went through a lot of grief also between hardware that was suppose to work and being very new to Linux it was a challenge, but in the end it was worth the effort and I've learned a lot which made it worth the effort, I built a lot of test KVMs working up to my current configuration I 'm on my 7th and getting the parts together to add a USB sound card to solve the audio latency problem I have along with a few other tweaks I want to make.

Yeah Nvidia is a pain, it is do-able but I just switched to a 270x which worked out great until Fall Out 4 came out and I've been struggling with driver issues ever since, every other game runs well but FO4 has been a pain still I've logged a few 100 hours playing it and the latest driver update last week seems to solved most of the problems I've had.

Good luck, and do post a article when your finished it will greatly help others....and I would be interested in hearing your thoughts on running the system after you have been using it for awhile.

So, I've got some progress, I've got my Intel graphics working again, and I've got the GPU passed through to a VM. Issue now is that the display that's been passed through has major color issues. I'm not sure how to describe it, but it's almost as if it's gone into failover mode. I'm going to try to pass through the 660 and see what happens there.

Any thoughts there @blanger?

Did you load the drivers for the card? I've had some pretty weird looking graphics with the Windows default VGA drivers, but I've also had problems with the last couple AMD drivers (till the update last Sunday), when mine is weird I'll get artifacts and color issues but rebooting the KVM (do a full shutdown not a reboot of Windows) most times the problems went away or changed without changing the drivers if that makes sense.

One thing you want to watch is rebooting the guest system (rebooting Windows) when you do that for a short time (seconds) the virtual hardware being passed through using QEMU/vert-manager will be un-binded from the guest and rebinded to the host, then un-binded from the host and rebinded to the guest, this happens really fast and it can cause problems in both systems, it's recommended that instead of rebooting windows running in a KVM that it is shut down and waiting a minuet or so before restarting the guest so the host system has time to digest what is happening.

I've gotten into a habit of just choosing shut-down in Windows and when the screen goes black killing the guest in the virt-manager console, wait a minuet or so then restart the KVM (guest) in vert-manager, this seems to cause the least problems for both systems (host and guest)

One question I have is are you using a separate monitor for your KVM and if so what type of connection are you using ie VGA, DP, HDMI, or DVI? my cards and my monitors seem to work best using the DVI interface, HDMI is flaky on my system, DP does work but I only have one monitor that supports it and it's being used in the host system. (I run 4 monitors...lol)

My monitors are a mishmosh of connection types. I've got one connection in the windows VM that's running DVI. I can switch to either DP or HDMI as well.

I've actually gotten this wierd configuration working using unraid. It seems to be working nicely, but there are a few things I'm not too happy with. I'm now running arch in a VM and windows in a VM as well. My GTX 660 is passed through to arch and the R9 380 is passed into windows and it seems to be really stable.

The main problem I was having with the old system was that the GTX 660 and R9 380 were sharing an iommu group. I was having a hard time finding out how to pass through the 380 while keeping the 660 as a host device. Maybe that's a job for pci-stub, but I'm finding pci-stub to be relatively poorly documented. I'll keep working at getting to actually work within arch, but I really need to get some work done over the next few days, so I'll be using the unraid setup I've got. I'll report back and let yall know how things work out. (by the way, unraid took about 15 minutes to set up to the point where I had GPU passthrough to a windows vm)

I'm going to have to give this all another go in a few days time. It's just getting to be too many days without commiting a change to projects at work.

EDIT: I suppose that it's just my luck that the minute I report stability, linux starts throwing segfaults everywhere. Back to the drawing board.

EDIT2: here's me ignoring a large portion of your response. Sorry about that

The issue is when I start up the VM. If I go into the OVMF "configuration" if you want to call it that, the graphical errors disappear 100% of the time. If I don't, they stay. Also, I've installed GPU drivers in the VM now and it won't get past the windows loading screen now. just goes black after that. Although, the gpu did crash in the middle of installing drivers, so that may have contributed. (I monitored the cpu usage of QEMU and it kept fluctuating and writing to disk for quite some time after the GPU quit, so I left it for about 40 minutes, then echo q | nc localhost:7100

Since I'm having some major issues with the stability of unraid, I'm going back to an Arch host. It seems to be working alright, but I'd really like to get my GTX 660 working on the host. For now, I'm going to be working on getting the R9 380 working on the guest. It seems like QEMU (or OVMF?) isn't initializing the device properly. I'm getting green and pink pixels when I start it, and it's going to a really disgusting green color background once it starts booting windows. Everything is in 800x600 resolution (1080p tv).

I'm thinking I may have a bad version of OVMF, or my configuration is bad.

As a note: I got my OVMF code from here This appears to be a nightly build, so that could definitely be the cause. I'm going to look for a more stable variant.

Here's what I've put for my configuration

qemu-system-x86_64 \
-serial none \
-parallel none \
-nodefconfig \
-enable-kvm \
-name Windows \
-cpu host,check \
-smp sockets=1,cores=2,threads=2 \
-m 8192 \
-device ich9-usb-uhci3,id=uhci \
-device usb-ehci,id=ehci \
-device nec-usb-xhci,id=xhci \
-rtc base=localtime \
-vga none \
-net bridge,br=bridge0 \
-net nic \
-device ioh3420,bus=pci.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1 \
-device vfio-pci,host=02:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
-device vfio-pci,host=02:00.1,bus=root.1,addr=00.1 \
-drive if=pflash,format=raw,readonly,file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd \
-drive if=pflash,format=raw,file=/home/sgt/my_vars.fd \
-drive file=/home/sgt/windows10-test.raw,format=raw \
-cdrom /home/sgt/Downloads/virtio-win-0.1.112.iso \
-boot order=d \

I'm still puzzling over this, but the minute I get some progress, I'll update.

I've found that one piece that is immensely helpful in increasing the stability of a system is this:

-device ioh3420,bus=pci.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1

Which appears to be some sort of PCIe controller, which allows you to attach PCI devices you're passing through to a virtual pci controller in the guest machine. Otherwise, it will attach your GPU to one of the USB controllers, which windows doesn't like, and Nvidia REALLY doesn't like. This along with -hyperv off can get you a long way in getting nvidia drivers working under a windows VM.

EDIT: now I've found that when the windows VM doesn't have graphical issues, the Linux host does. The odd part is that the host is using the Intel IGPU, so it's not a power issue. (850W psu anyways, so it should be able to handle 2 mid-range gpu's and 3 hard drives) I'm starting to think it's an issue with the way the memory addresses are handled on the IOMMU, because it looks like the IGPU memory gets corrupt or there's some sort of overflow or something, causing the video to be messed up.

Heh, funny, I upgraded from a 3770k which is now in my server for almost the exact same configuration as you. I SOMETIMES get that same kind of lockup where I can't even use the reset button, but I was honestly attributing that to my overclock. I was running 4.7 for a while, sometimes it would be stable for hours, sometimes not, so I dialed it back to 4.6 and it's been more stable but I still sometimes have that hard freeze. I can hold the power button and it will shut off.

Honestly I'm sorry to hear you're having so many issues! I found the initial learning curve a little steep, nothing crazy but luckily I was on vacation or I would have just been tinkering for a few hours each night, but after a day and a bit I had Arch installed using BTRFS raid 0 on my SSDs (sweet!), QEMU and KVM running with virt-manager, OVMF-git built (although it built bin files, and virt-manager uses the QEMU config so I needed the built fd files, I got them from the same link you posted), pass through of my 780 Ti working, Windows 10 installed. I was golden and ready to go.

I even had issues with updating my BIOS that drove me nuts and I ended up reinstalling Arch a few times before I figured out what the heck was going on with that (hint: secure boot can suck it when I don't know how to sign anything myself for it to work). I think the Z170 boards still have some kinks to work out. I almost regret not going X99 but overall my experience has been positive, it's just some bugs for being on the bleeding edge of everything at all times.

Yeah, I'm getting this lockup on 100% stock speeds, not even using XMP this week, to test that for issues. I'm thinking there may be some sort of problem in the motherboards software, rather than it being a hardware issue. or it could be intel's out of band management causing problems. (nsa no like us using linux, #tinfoilhat)

luckily I was on vacation or I would have just been tinkering for a few hours each night

I'm pretty much working on linux from 8 to 5ish on my vacation. Not really vacation to most, but to me it's just fine.

I'm not using uefi on my host OS. I would recommend you don't do that either if you can help it. It complicates things ten fold. The Z170 board should let you sign the binaries within the warning menu somehow. I remember doing that to the antergos installer when I was messing around.

I can't get my damn bios to connect over the onboard nic to their website to update. I think I'm going to have to find a way to download them and prep a USB drive. :/

I think the Z170 boards still have some kinks to work out. I almost
regret not going X99 but overall my experience has been positive, it's
just some bugs for being on the bleeding edge of everything at all
times.

I sort of agree, I just wouldn't think for a second about x99. I'm thinking I'll just pull out one of my 8320 boards and run the 380 on that, headless so it can simply be a stream box to my linux machine, because that's really all I need it for. So glad my 380 only takes one 8pin rather than the 2x6 that they had for a while.

Anyways, thanks for reassuring me that it's not just my board, although I'm not sure if this alternative is better. Maybe ASUS can fix it over the air though. I'm heading out to a new years event. I'll check in at some point tomorrow if I'm able to function.

Our boards are incredibly similar. I haven't tried doing an update from within the BIOS but I know it's there. I also know that the network stack in the BIOS is disabled by default, maybe have to turn that on. But I did my updates from a USB.

It's all good, I'm running UEFI now, just with secure boot turned off. Whenever I would switch from Microsoft Windows (AKA UEFI) to Other OS it would just default back. Not sure why.

I went through the whole PCI passthrough ordeal about a month ago. With @blanger helping me, to boot. Here's my thread: https://forum.teksyndicate.com/t/pci-passthrough-qemu-kvm-virtual-machine-help/91597/54 Maybe you can glean some information from what I did.

Also, I found these links immensly helpful for my situation:



https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#vfio-pci

My passthrough of a GTX 970 and a USB pci-e card to a Windows 8.1 guest has been running smoothly for almost a month now. Once you getting it running it is very satisfying. I am playing all my games with native performance in a VM.

1 Like