Windows VM crashes due to VFIO GPU "surprise disconnect" at 11 minute uptime

I’m running into the most perplexing, amusing-if-it-wasn’t-so-infuriating issue of GPU-passthrough on KVM and I’m running out of ideas as to what else to try.

The setup: Gigabyte B550, Ryzen 5750GE, 2x32GB Kingston ECC RAM, IronWolf NVMe for system, Samsung 970 Pro NVMe & NVidia A4000 passed through to a VM. All this in a Monsterlabo The First passive case.

The system is running Debian 11 Bullseye, there are 4x Samsung QVO 4T for ZFS storage but the Windows VM runs entirely off its own passed-through NVMe SSD. The host is a headless server, it runs a couple of other services and VMs and doubles as a NAS, none of those have any issues. The Windows VM on the other hand…

The Windows VM runs Windows 10 Home, installed onto the NVMe that is passed through along with the GPU. It runs perfectly fine currently on 4 cores 4 threads and 8GB of memory which seems irrelevant to the issue.

Then, precisely after the 11 minutes of uptime Windows crashes and reboots. Constantly and reproducibly at the same exact moment, 0:11:00 after booting. From the event log and the Minidump it becomes apparent that a 0x00000113 VIDEO_DXGKRNL_FATAL_ERROR happened due to the “sudden unexpected disconnection” of the graphics card. In fact, one can hear the Windows “device disconnected” sound juust before the VM freezes and reboots.

I can confirm that this worked before, and have no idea why or when this started happening.

This is also not a hardware or windows issue, as I can actually boot straight into Windows on the NVMe directly and the issue never presents, hence this has to be specifically related to the pass-through and KVM/Debian somehow. I will share the VM XML below but please note that I have already massaged the XML, turning features on and off and playing with the CPU, pinning, cache, hostdev config, all to no avail. Debian bundles QEMU 5.2 which is not very new but this should not be an issue (also, this has worked before).

If anyone has an idea what else should I try or where to look feel free to throw it in there.

Libvirt XML:

<domain type="kvm">
  <name>windows</name>
  <uuid>46f8d73b-...</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">8388608</memory>
  <currentMemory unit="KiB">8388608</currentMemory>
  <vcpu placement="static">8</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu="0" cpuset="4"/>
    <vcpupin vcpu="1" cpuset="5"/>
    <vcpupin vcpu="2" cpuset="6"/>
    <vcpupin vcpu="3" cpuset="7"/>
    <vcpupin vcpu="4" cpuset="12"/>
    <vcpupin vcpu="5" cpuset="13"/>
    <vcpupin vcpu="6" cpuset="14"/>
    <vcpupin vcpu="7" cpuset="15"/>
    <emulatorpin cpuset="3,11"/>
    <iothreadpin iothread="1" cpuset="3,11"/>
    <vcpusched vcpus="0" scheduler="rr" priority="1"/>
    <vcpusched vcpus="1" scheduler="rr" priority="1"/>
    <vcpusched vcpus="2" scheduler="rr" priority="1"/>
    <vcpusched vcpus="3" scheduler="rr" priority="1"/>
    <vcpusched vcpus="4" scheduler="rr" priority="1"/>
    <vcpusched vcpus="5" scheduler="rr" priority="1"/>
    <vcpusched vcpus="6" scheduler="rr" priority="1"/>
    <vcpusched vcpus="7" scheduler="rr" priority="1"/>
    <iothreadsched iothreads="1" scheduler="fifo" priority="98"/>
  </cputune>
  <os firmware="efi">
    <type arch="x86_64" machine="pc-q35-5.2">hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vpindex state="on"/>
      <synic state="on"/>
      <stimer state="on"/>
      <reset state="on"/>
      <vendor_id state="on" value="flakicloud00"/>
      <frequencies state="on"/>
    </hyperv>
    <kvm>
      <hidden state="off"/>
    </kvm>
    <vmport state="off"/>
    <ioapic driver="kvm"/>
  </features>
  <cpu mode="host-passthrough" check="none" migratable="on">
    <topology sockets="1" dies="1" cores="4" threads="2"/>
    <cache mode="passthrough"/>
  </cpu>
  <clock offset="localtime">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="yes"/>
    <timer name="hypervclock" present="yes"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="yes"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <controller type="usb" index="0" model="qemu-xhci" ports="15">
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
    </controller>
    <controller type="pci" index="0" model="pcie-root"/>
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="1" port="0x10"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="2" port="0x11"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x1"/>
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="3" port="0x12"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x2"/>
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="4" port="0x13"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x3"/>
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="5" port="0x14"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x4"/>
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="6" port="0x15"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x5"/>
    </controller>
    <controller type="pci" index="7" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="7" port="0x16"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x6"/>
    </controller>
    <controller type="pci" index="8" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="8" port="0x17"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x7"/>
    </controller>
    <controller type="pci" index="9" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="9" port="0x18"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="10" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="10" port="0x19"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x1"/>
    </controller>
    <controller type="pci" index="11" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="11" port="0x1a"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x2"/>
    </controller>
    <controller type="pci" index="12" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="12" port="0x1b"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x3"/>
    </controller>
    <controller type="pci" index="13" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="13" port="0x1c"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x4"/>
    </controller>
    <controller type="pci" index="14" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="14" port="0x1d"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x5"/>
    </controller>
    <controller type="pci" index="15" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="15" port="0x8"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0"/>
    </controller>
    <controller type="pci" index="16" model="pcie-to-pci-bridge">
      <model name="pcie-pci-bridge"/>
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
    </controller>
    <controller type="virtio-serial" index="0">
      <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
    </controller>
    <interface type="bridge">
      <mac address="02:f4:aa:ee:b5:db"/>
      <source bridge="vmbr0"/>
      <model type="virtio-non-transitional"/>
      <driver name="vhost" txmode="iothread" packed="on"/>
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
    </interface>
    <channel type="unix">
      <target type="virtio" name="org.qemu.guest_agent.0"/>
      <address type="virtio-serial" controller="0" bus="0" port="2"/>
    </channel>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <graphics type="spice" autoport="yes">
      <listen type="address"/>
      <gl enable="no"/>
    </graphics>
    <video>
      <model type="none"/>
    </video>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <driver name="vfio"/>
      <source>
        <address domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
      </source>
      <boot order="1"/>
      <rom bar="on"/>
      <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <driver name="vfio"/>
      <source>
        <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
      </source>
      <rom file="/usr/share/qemu/gpu-20221001T211042.rom"/>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0" multifunction="on"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x1"/>
    </hostdev>
    <redirdev bus="usb" type="spicevmc">
      <address type="usb" bus="0" port="1"/>
    </redirdev>
    <redirdev bus="usb" type="spicevmc">
      <address type="usb" bus="0" port="2"/>
    </redirdev>
    <memballoon model="none"/>
  </devices>
</domain>

Hi,

If you are using MSI Afterburner you need to run MSIAfterburnerSetup463.exe or older.

Regards,
Dan

Did you find a solution for this? I keep encountering a similar problem on Manjaro after installing a new gpu. Just switched the VM’s GPU to an RTX 4060 Ti from an AMD card and I now get random reboots of the VM preceded by a disconnect sound.
Unfortunately I can’t find anything in the windows logs. I have to admit though that I find the windows event log slightly confusing.
With the old gpu inside the VM everything worked fine.

For whoever finds this issue, it turned out that the problem was caused by Open Hardware Monitor which was launched on startup and exposed the hardware stats (incl. the passed-through GPU thermals and usage) in a locally running webserver to be scraped from outside the VM for monitoring.

Once I disabled the autorun on startup and closed Open Hardware Monitor, the issue never surfaced again.

I sometimes had intermittent problems when running GPU-Z, or benchmarks that access GPU data (furmark) with the VM crashing so I think the two are related, but I don’t have more info on what is precisely causing this.

If the hardware is getting pulled from the VM you may want to check your host logs. I’ve never had this issue and I would be curious to see what if anything is going on under the hood.

There was nothing in the host logs, these crashes were completely invisible outside (AFAICT).

“…and I now get random reboots of the VM preceded by a disconnect sound.”

I think the ‘hardware going away’ in this case is simply the GPU (driver) crashing, the only thing about mine that was different that it wasn’t random at all as it came very precisely at the 11-minute mark, but otherwise the same, disconnect sound and the VM reboots, Windows never even got a chance to do a core dump, but event logs would show the crash.

Since it was at exactly 11min I was going to suggest checking task scheduler but you already found the answer

@Darin755 you are correct, something is running,

@flaki try loading gpu-z or anything after 11-12 mins of boot and this error should not happen super strange

although using kvm cpu works but if you try to load gpuz/furmark after 11 mins then it seems to work fine, but if u launch it with in the 11 mins then it will BSOD

i am still not sure what task is running as i investigated the task scheduler but yea something fishy there