Hi all. Here’s my life story.
I’m lost on how to continue with this setup I have at the moment. I’ve had a VFIO GPU passthrough setup for a while now on two different systems but one of them is giving me so much trouble.
System / What I’ve tried with the hardware
The host system is running an Intel Xeon E3-1280 V3 and I have tried both a MSI Z97 PC MATE and a Gigabyte Z87 D3HP as motherboards. I have replaced all of the memory. I swapped in another video card (R9 390X and Vega 56). The host is Debian 10 (upgraded from 9, wasn’t caused by the upgrade) and the guest is Windows 10. Before all of this, I did swap the power supply for something better that I removed from another working system. I don’t imagine the CPU is the problem as this was a working, stable setup for over a year (during that time, had a motherboard swap which worked great – previous board died mysteriously, just failed to boot one day; issues, again, started over 6 months after this). The host USB controller is disabled on the host and has been passed into the virtual machine; this didn’t seem to have any effect either and has worked since day 1 regardless of motherboard.
The video card has been tested in the second PCIe x16 (electrically x4) slot and is now in the first slot (x16; was configurable in BIOS of the Gigabyte board). The MSI board was on x8.
Finally, the OS disk and the VM root disk are on the same drive, a SanDisk SSD. Hasn’t seemed like a problem before and the disk is giving no signs of trouble. My next idea is to reinstall the host OS and import the old VM image. The secondary disk image is to evade problems with anti-cheat software failing over network files.
When it started
A few months ago, I began to have issues with artifacting within the guest. It would be brief and you wouldn’t notice it if you weren’t directly watching the screen, except for the audible “pop” that came from the monitor’s speaker (audio over DP). This behavior caused me to switch the power supply as I suspected power delivery problems. It didn’t. It began to occur more … and more … before one day it finally had a hard reset with no trace.
It’s not heat related. I’ve watched all temperature sensors and it is in a cool room with plenty of circulation (4U rackmount case with intake fans on the front and output in the rear). The system has also been dusted.
What I’ve tried with the software
I’ve tried a I440FX chipset. I’ve tried different CPU configurations and architectures, including kvm64 and variations of Intel Haswell and Broadwell era processors. I’m allocating only 75% of the physical system’s available memory; I’ve reserved 50% at some point which didn’t help.
As an aside, I do have numerous applications hosted on my NAS. There’s a 1Gbit link between the NAS and this system. Critical software (drivers, etc.) aren’t installed on this, only user applications (VLC, TS3, monitoring software, Steam library, so on).
Log / Domain
This is the kernel log of a recently crashed instance, with the Gigabyte board and Vega 56. https://pastebin.com/zn2Y7m7G … I’m not seeing anything useful. Yes, I know about the data leak problems.
This is the domain XML for the virtual machine. https://pastebin.com/bRH97UiU
FurMark and a game open at the same time are a solid way to crash this. It’s consistent and the host and guest are very slow and unstable after one of these shutdowns. It seems to only ever be under GPU load.
Thank you for those that read this wall of text. I’ve been fumbling with this for quite some time and now it’s barely usable for more than web browsing. In short, I’m desperate.