Dynamic driver unbinding dual Nvidia GPU setup

Yes, yet another post about dynamic driver unbinding with nvidia GPUS.

Relevant specs

  • Some old 4th gen intel i7
  • GTX 1050 - 0000:06:00.0 - /dev/nvidia1 - The one I want to dynamically bind
  • GTX 1060 6GB - /dev/nvidia0 - Always on host

Software

I’ve tried the following on both X and Wayland. On Wayland tried with two compositors, KWin and Hyprland. I am using the proprietary driver, because I need CUDA support for computing and folding.

What I want

I want to passtrough my 1050 to a VM, hence using the vfio-manager, and when the VM is not in use rebind the 1050 to the host. If I restart the VM I want to unbind from the Nvidia driver.

What happens

The passtrough is working just fine, and I can unbind from the vfio-pci driver and binding to the nvidia one as:

echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/bind 

Then I can use any app on the host on any of the two GPUs, neat. I can see what apps are actively using the gpus with nvidia-smi.
Note that I say actively, because running lsof /dev/nvidia1, shows that any app that has been launched after the gpu has been binded to host is using such file, and hence preventing me from running

echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/unbind 

Because I bind the 1050 once wayland is running, wayland is not on that list.
Only when the lsof /dev/nvidia1 i0s empty I can unbind the GPU from the nvidia driver.

Plus in Wayland, killing all the apps that are using the /dev/nvidia1 makes the compositor crash and freezes the whole computer (Alt+Ctrl+Fx not working).

Practical example

  • Boot with 1050 binded to vfio-pci and 1060 to nvidia
  • Run firefox
  • Unbind 1050 from vfio-pci and bind to nvidia driver
  • Run kate
  • lsof /dev/nvidia0:
COMMAND     PID USER  FD   TYPE DEVICE SIZE/OFF NODE NAME
Xwayland   1047 user mem    CHR  195,0           846 /dev/nvidia0
plasmashe  1109 user mem    CHR  195,0           846 /dev/nvidia0
xwaylandv  1336 user mem    CHR  195,0           846 /dev/nvidia0
firefox    1379 user 318u   CHR  195,0      0t0  846 /dev/nvidia0
kate       12132 user 17u   CHR  195,1      0t0  878 /dev/nvidia0
  • lsof /dev/nvidia1:
COMMAND     PID USER  FD   TYPE DEVICE SIZE/OFF NODE NAME
kate       12132 user 17u   CHR  195,1      0t0  878 /dev/nvidia1

If i was then to kill kate, the whole system would freeze. I haven’t checked any kernel logs as I do not want to pursue another rabbit-hole.

Who to blame

Now, when I was using Xorg I was convinced it was a fault of the Xserver, but as the same behaviour happens with Wayland with two different compositors I do not know where to start to look.

A glimpse of hope

If I repeat the above mentioned practical example but instead of launching kate I launch FAHControl (FAHControl-gtk3 port on github), it does not attach itself to /dev/nvidia1!! Hence lsof /dev/nvidia1 is empty and I can unbind from the nvidia driver and bind back to vfio-pci.

I am unsure as why this app does not make use of the file. Both apps run natively on wayland, and other apps using xwayland do appear when lsof /dev/nvidia0. FAHControl is launched through python: /usr/bin/python /usr/bin/FAHControl, maybe there is the answer.

Anyway, I am just asking this in case someone has encountered the same issue and succeeded or has some idea where I can open some issue or bug ticket. Any help or guidanc will be appreciated.

1 Like

This seems no tot be a Wayland issue but rather a KWin issue, since the crash does not happen on Hyprland.

Then I can run fuser -k /dev/nvidia1 to kill every thing using the 1050 and rebind it to vfio-pci without rebooting.

I know KWin has some problems with freezes on logging out, so I will wait for a fix for that before I open any issues, since I have the filling it is the same kind of failure.