Kernel panic when starting a VM - vfio-pci fails to claim an Nvidia card - is there any way to prevent this?

Daniel_Krajnik · July 13, 2023, 12:18am

I’m having unexpected kernel panics when booting a windows vm with a GPU passthrough - most recent seems to give meaningful reason for this - vfio gets stuck waiting for nvidia card to become available.

I’ve had Xorg blocking Nvidia card before, but it never caused vfio drivers to get stuck and cause kernel panic. Has anyone seen anything like this before?

Jul 13 01:03:26 host3 kernel: vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Jul 13 01:03:26 host3 kernel: xhci_hcd 0000:01:00.2: remove, state 4
Jul 13 01:03:26 host3 kernel: usb usb4: USB disconnect, device number 1
Jul 13 01:03:26 host3 kernel: xhci_hcd 0000:01:00.2: USB bus 4 deregistered
Jul 13 01:03:26 host3 kernel: xhci_hcd 0000:01:00.2: remove, state 4
Jul 13 01:03:26 host3 kernel: usb usb3: USB disconnect, device number 1
Jul 13 01:03:26 host3 kernel: xhci_hcd 0000:01:00.2: USB bus 3 deregistered
Jul 13 01:03:26 host3 libvirtd[52416]: libvirt version: 9.5.0
Jul 13 01:03:26 host3 libvirtd[52416]: hostname: host3
Jul 13 01:03:26 host3 libvirtd[52416]: Domain id=1 name='win10-off' uuid=aab5554f-8df9-4344-b691-776796a3ab04 is tainted: custom-argv
Jul 13 01:03:26 host3 systemd-machined[922]: New machine qemu-1-win10-off.
Jul 13 01:03:26 host3 systemd[1]: Started Virtual Machine qemu-1-win10-off.
Jul 13 01:03:26 host3 qemu-system-x86_64[52589]: xoauth2_scope is not set
Jul 13 01:03:34 host3 kernel: vfio-pci 0000:01:00.0: not ready 1023ms after resume; waiting
Jul 13 01:03:35 host3 kernel: vfio-pci 0000:01:00.0: not ready 2047ms after resume; waiting
Jul 13 01:03:37 host3 kernel: vfio-pci 0000:01:00.0: not ready 4095ms after resume; waiting
Jul 13 01:03:41 host3 kernel: vfio-pci 0000:01:00.0: not ready 8191ms after resume; waiting
Jul 13 01:03:50 host3 kernel: vfio-pci 0000:01:00.0: not ready 16383ms after resume; waiting
Jul 13 01:03:59 host3 libvirtd[52416]: Cannot start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats for domain win10-off; current job is (async nested, none, start) owned by (52423 remoteDispatchDomainCreate, 0 <null>>
Jul 13 01:03:59 host3 libvirtd[52416]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)
Jul 13 01:04:03 host3 systemd[1233]: Started scdaemon-reload-to-clear-cache.service.
Jul 13 01:04:07 host3 kernel: vfio-pci 0000:01:00.0: not ready 32767ms after resume; waiting
Jul 13 01:04:19 host3 kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=465751 end=465752) time 1933 us, min 1052, max 1079, scanline start 907, end 121
Jul 13 01:04:29 host3 libvirtd[52416]: Cannot start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats for domain win10-off; current job is (async nested, none, start) owned by (52423 remoteDispatchDomainCreate, 0 <null>>
Jul 13 01:04:29 host3 libvirtd[52416]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)

Janos · July 13, 2023, 12:50am

do you attache vfio-pci at boot or with libvirt hook just when the VM starts?

Daniel_Krajnik · July 13, 2023, 12:55am

neither, I’m sharing nvidia card with host (it’s an optimus laptop - it does render offload ), but by default nvidia card isn’t used by Xorg (I’ve removed /usr/share/X11/xorg.conf.d/10-nvidia-drm-outputclass.conf and /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json), so it should be free to use for either vm or host (of course not at the same time).

When last kernel panic happened though I’ve made sure to remove any nvidia drivers:

sudo modprobe -r nvidia_uvm nvidia_drm nvidia_wmi_ec_backlight nvidia_modeset nvidia

the command succeeded, so vfio-pci should have been free to grab the card.

Usually when the card wasn’t free (Xorg held it) vm just refused to start.

It would be good to have some way to assure that libvirt won’t block the system again, but I’m not sure where to look for it.

I’m testing it now and three successive boots of that windows vm worked fine - the only change I’ve done so far was removing shm device (looking glass) for vm’s xml libvrt configuration. But I don’t know why that would be related.

Janos · July 13, 2023, 1:07am

no sorry no idea, I thought it might be the browser hardware acceleration is blocking VFIO, but with your config that shouldn’t be the case

system · April 11, 2024, 7:08pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.