VFIO passthrough guest takes the host with it

lildirt · January 24, 2020, 8:33am

Hi all. Here’s my life story.

I’m lost on how to continue with this setup I have at the moment. I’ve had a VFIO GPU passthrough setup for a while now on two different systems but one of them is giving me so much trouble.

System / What I’ve tried with the hardware

The host system is running an Intel Xeon E3-1280 V3 and I have tried both a MSI Z97 PC MATE and a Gigabyte Z87 D3HP as motherboards. I have replaced all of the memory. I swapped in another video card (R9 390X and Vega 56). The host is Debian 10 (upgraded from 9, wasn’t caused by the upgrade) and the guest is Windows 10. Before all of this, I did swap the power supply for something better that I removed from another working system. I don’t imagine the CPU is the problem as this was a working, stable setup for over a year (during that time, had a motherboard swap which worked great – previous board died mysteriously, just failed to boot one day; issues, again, started over 6 months after this). The host USB controller is disabled on the host and has been passed into the virtual machine; this didn’t seem to have any effect either and has worked since day 1 regardless of motherboard.

The video card has been tested in the second PCIe x16 (electrically x4) slot and is now in the first slot (x16; was configurable in BIOS of the Gigabyte board). The MSI board was on x8.

Finally, the OS disk and the VM root disk are on the same drive, a SanDisk SSD. Hasn’t seemed like a problem before and the disk is giving no signs of trouble. My next idea is to reinstall the host OS and import the old VM image. The secondary disk image is to evade problems with anti-cheat software failing over network files.

When it started

A few months ago, I began to have issues with artifacting within the guest. It would be brief and you wouldn’t notice it if you weren’t directly watching the screen, except for the audible “pop” that came from the monitor’s speaker (audio over DP). This behavior caused me to switch the power supply as I suspected power delivery problems. It didn’t. It began to occur more … and more … before one day it finally had a hard reset with no trace.

It’s not heat related. I’ve watched all temperature sensors and it is in a cool room with plenty of circulation (4U rackmount case with intake fans on the front and output in the rear). The system has also been dusted.

What I’ve tried with the software

I’ve tried a I440FX chipset. I’ve tried different CPU configurations and architectures, including kvm64 and variations of Intel Haswell and Broadwell era processors. I’m allocating only 75% of the physical system’s available memory; I’ve reserved 50% at some point which didn’t help.

As an aside, I do have numerous applications hosted on my NAS. There’s a 1Gbit link between the NAS and this system. Critical software (drivers, etc.) aren’t installed on this, only user applications (VLC, TS3, monitoring software, Steam library, so on).

Log / Domain

This is the kernel log of a recently crashed instance, with the Gigabyte board and Vega 56. https://pastebin.com/zn2Y7m7G … I’m not seeing anything useful. Yes, I know about the data leak problems.

This is the domain XML for the virtual machine. https://pastebin.com/bRH97UiU

SOS

FurMark and a game open at the same time are a solid way to crash this. It’s consistent and the host and guest are very slow and unstable after one of these shutdowns. It seems to only ever be under GPU load.

Thank you for those that read this wall of text. I’ve been fumbling with this for quite some time and now it’s barely usable for more than web browsing. In short, I’m desperate.

pantato · January 24, 2020, 9:22am

The only time I’ve experienced an issue that resembled this is when I tried the ACS override to passthrough a USB controller for hot swapping. Sounds like you’re using this as well. I recommend trying to go without it and seeing if that helps. I know a lot of people recommend against it as it can cause stability problems and I would bet this is the root of your issue. Hot swapping is nice but definitely not worth problems like this.

lildirt · January 24, 2020, 5:27pm

I removed all other PCI devices that weren’t the video card and its HDMI audio source and allowed the host to load their drivers again. Instead, I did USB redirection for my keyboard and mouse. Unfortunately this hasn’t changed the situation and I still face crashes just as often.

Thank you for your reply.

edit: also realized I posted this in Linux and not in VFIO. Oops!

pantato · January 24, 2020, 9:45pm

but did you actually turn the ACS override option off in the kernel flags?

lildirt · January 25, 2020, 2:47am

EDIT2: The short of this is “I never had ACS override enabled.”

Apologies, I should have specified. I realize now that I may not have ever applied the ACS override patch for this kernel instance. Based on my boot flags in the log file I posted last night,
Command line: BOOT_IMAGE=/vmlinuz-4.19.0-6-amd64 root=/dev/mapper/SYSTEM--vg-root ro quiet intel_iommu=on intremap=no_x2apic_optout
(side note: intremap has since been removed, with no change; I don’t think this was making a difference at all, since Enabled IRQ remapping in x2apic mode was present in both logs)
I never had ACS override parameter specified (and yes I update-grub).

(blurred was ruled out by edit)

There’s no ACS override patch … but my logs also say this:
Jan 24 00:32:47 SYSTEM kernel: [ 0.228178] pci 0000:00:1c.0: Intel PCH root port ACS workaround enabled Jan 24 00:32:47 SYSTEM kernel: [ 0.228443] pci 0000:00:1c.4: Intel PCH root port ACS workaround enabled
which leads me to believe that I do somehow.

I don’t recall ever applying the ACS override patch. I installed Debian 9, had no problems for a while, decided to upgrade my distro, then continued to have no problems on Debian 10. I don’t really use this system for anything else other than a Docker daemon on the host, so my changes to it have been very minimal beyond that.

I’ll give you that my memory isn’t 100% accurate. I don’t think the ACS parameter accepts an “off” but I’m not aware of how I might have it enabled.

EDIT: swapped back to the Z97 PC MATE (the more stable of the two) and the ACS workaround message is gone.

lildirt · January 27, 2020, 12:48am

Just wanted to post an update on the situation. (tl;dr at the bottom. This was a saga for me, spending multiple weekends on this affair).

Through all of this, I eventually decided to run on the original hardware configuration: Z97 PC MATE and R9 390X.

I reinstalled my host OS onto a different disk to no avail (had to RMA two disks of the model I was using within the past 3 months, so I tried it). Completely rebuilt the hypervisor with no end in sight.

… it was while hooking up the hardware that I made a side comment to myself about how old those PSU cables to the graphics card are. 7 years old. You can see where I’m going. To test my theory, I underclocked the snot out of my GPU in a final attempt to see if it was the problem. I did this through the Radeon software, haphazardly dropping my sliders as low as they could go (-40% frequency, -50% power limit).

I had settled on loading a particular game (Farm Together) to see if it’d work, since I would always consistently crash while loading the multiplayer menu or joining a large multiplayer game. So far, I’ve had no crashes despite running at max settings (a challenge for the -50% R9 390X).

I’ve spent more time on this than I’d like to admit and I realize now that I am incredibly lucky: how I haven’t fried any hardware by now, I don’t know!

I’ve ordered replacement power cables. Hopefully this is the end of my dilemma.

tl;dr My power supply cables for the video card were SEVEN years old. I checked the pin mapping sometime ago and they matched, though it looks like all of this trouble came up because they were finally failing.

BillFleming · February 2, 2020, 4:52am

Only 7 years old? I’m still using my ~10 year old Corsair 850 over here.