Yes, yet another post about dynamic driver unbinding with nvidia GPUS.
Relevant specs
- Some old 4th gen intel i7
- GTX 1050 - 0000:06:00.0 -
/dev/nvidia1
- The one I want to dynamically bind - GTX 1060 6GB -
/dev/nvidia0
- Always on host
Software
I’ve tried the following on both X and Wayland. On Wayland tried with two compositors, KWin and Hyprland. I am using the proprietary driver, because I need CUDA support for computing and folding.
What I want
I want to passtrough my 1050 to a VM, hence using the vfio-manager, and when the VM is not in use rebind the 1050 to the host. If I restart the VM I want to unbind from the Nvidia driver.
What happens
The passtrough is working just fine, and I can unbind from the vfio-pci driver and binding to the nvidia one as:
echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/bind
Then I can use any app on the host on any of the two GPUs, neat. I can see what apps are actively using the gpus with nvidia-smi.
Note that I say actively, because running lsof /dev/nvidia1
, shows that any app that has been launched after the gpu has been binded to host is using such file, and hence preventing me from running
echo "0000:06:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/unbind
Because I bind the 1050 once wayland is running, wayland is not on that list.
Only when the lsof /dev/nvidia1
i0s empty I can unbind the GPU from the nvidia driver.
Plus in Wayland, killing all the apps that are using the /dev/nvidia1
makes the compositor crash and freezes the whole computer (Alt+Ctrl+Fx not working).
Practical example
- Boot with 1050 binded to vfio-pci and 1060 to nvidia
- Run firefox
- Unbind 1050 from vfio-pci and bind to nvidia driver
- Run kate
lsof /dev/nvidia0
:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
Xwayland 1047 user mem CHR 195,0 846 /dev/nvidia0
plasmashe 1109 user mem CHR 195,0 846 /dev/nvidia0
xwaylandv 1336 user mem CHR 195,0 846 /dev/nvidia0
firefox 1379 user 318u CHR 195,0 0t0 846 /dev/nvidia0
kate 12132 user 17u CHR 195,1 0t0 878 /dev/nvidia0
lsof /dev/nvidia1
:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kate 12132 user 17u CHR 195,1 0t0 878 /dev/nvidia1
If i was then to kill kate, the whole system would freeze. I haven’t checked any kernel logs as I do not want to pursue another rabbit-hole.
Who to blame
Now, when I was using Xorg I was convinced it was a fault of the Xserver, but as the same behaviour happens with Wayland with two different compositors I do not know where to start to look.
A glimpse of hope
If I repeat the above mentioned practical example but instead of launching kate I launch FAHControl (FAHControl-gtk3 port on github), it does not attach itself to /dev/nvidia1
!! Hence lsof /dev/nvidia1
is empty and I can unbind from the nvidia driver and bind back to vfio-pci.
I am unsure as why this app does not make use of the file. Both apps run natively on wayland, and other apps using xwayland do appear when lsof /dev/nvidia0
. FAHControl is launched through python: /usr/bin/python /usr/bin/FAHControl
, maybe there is the answer.
Anyway, I am just asking this in case someone has encountered the same issue and succeeded or has some idea where I can open some issue or bug ticket. Any help or guidanc will be appreciated.