Swap GPUs at the Press of a Button: Hot dismount Nvidia GPU under linux?

I’ve watch this recent video and was very confused about the ‘hot swapping that just work under linux’

From reading about VFIO and Looking glass, something i really want to put in place for the few game and software needing GPU that don’t run under wine/proton, i’ve learned a lot.
I have a dual GPU setup, the main one is AMD, with a 3080TI i planed to give to the VM.
But as far as i know, they wasn’t a way to hot-unload an NVIDIA Gpu, you had to blacklist the pci device before boot.

This is very tedious, because i use my nvidia gpu for AI test and to offload some stuff under linux most of the time, and i really didn’t want to reboot the computer and select the special grub entry to then do VFIO … at this rate it’s simpler to reboot under windows.

So given this video, what changed ? What new magic allow you to hotswap nvidia gpu under linux ?

I would love to have this now relatively “old” setup run it’s intended purpose :smiley:

1 Like

Does something like this work for you?

sadly echo "0000:0d:00.0" > /sys/bus/pci/devices/0000:0d:00.0/driver/unbind hang forever, with a standard NVRM: Attempting to remove device 0000:0d:00.0 with non-zero usage count! log in kernel, and the driver folder vanish …

So what did i miss ?
How does Wendell swap gpu with a gui running without issue ?

Should i tag him ?

What troubleshooting steps did you take after you got the error?

He is currently at Computex, but remind me in about 2 weeks to ask him lol

I need to dig more but i just looked at what was using the gpu, and it was the display manager, even if i disable the display in the setting.

No worry, i’m overwhelmed with tax form for the week, and don’t have time to go in depth right now anyway.

Thank you, i appreciate you chiming in :smiley:

Did you ever figure this out?

1 Like

No, i need to, but i have been caught by a big network migration at home that was prioritised.

I just no-longer play game that don’t work on linux for now since i can’t get this to work.

1 Like

I have this working with an AMD iGPU + 3060. The trick is to bind to vfio on boot, then rebind to NVIDIA after starting the DE or wm. Unless the DE supports gpu hotswap, that’s the only way. Otherwise the DE latches onto the GPU and won’t let go. But you can use prime offload and obviously compute without the DE using the gpu.

Also see this solution, that I took some parts from. It is written for KDE and I’ve had to adapt a bit to gnome and hyprland:

2 Likes

Given the rise some years ago of egpu, i though most DE would handle this without issue, but from a quick google search, it doesn’t look like it :frowning:

I need to look into gnome support for this … fedora have a very recent release of it so maybe there is something

1 Like

With fedora gnome I have it working when only binding to NVIDIA after starting gnome… it’s a bit annoying but it works.

This part is important though (if I don’t do this gnome WILL spontaneously bind to the gpu but it won’t let go anymore):

  • If you don’t want stuff binding to /dev/nvidia0, you can set the __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json envvar in /etc/environment aswell, to exclude the nvidia EGL files, since it bypasses KWIN Study this image carefully. Just don’t forget setting it to the nvidia file, in case of games.
1 Like

You mean after the user log in, or from the GDM session it’s ok ?

1 Like

After user login. Gnome will detect gpus and latch on to them at user login, any point after that should be ok.

1 Like

Coming back to this because i had 2 free hours to spend on it … and sadly i’m still stuck at the same point.

I’ve followed @quilt pointer, but maybe this isn’t fully compatible with gnome ?

My gpu is correctly bounded to vfio at boot (i get a “missing nvidia driver, falling back to nouveau” on plymouth) and my display is “off”.

If i disable all vfio modules, and start the nvidia one, it’s immediately picked up and added to the gnome desktop.
Nothing is bound to /dev/nvidia0

lsof /dev/nvidia0                                                                   
lsof: status error on /dev/nvidia0: No such file or directory

yet, i still can’t unload nvidia_drm
echo -n "remove" > /sys/bus/pci/devices/0000:0d:00.0/drm/card0/uevent does turn off the display, but that seem to be a kde trick and it doesn’t do enough for gnome
echo "0000:0d:00.0" > /sys/bus/pci/devices/0000:0d:00.0/driver/unbind still hang with NVRM: Attempting to remove device 0000:0d:00.0 with non-zero usage count!

~# rmmod nvidia_drm    
rmmod: ERROR: Module nvidia_drm is in use

~# lsmod | grep nvidia 
nvidia_drm            135168  3
nvidia_modeset       1650688  1 nvidia_drm
nvidia_uvm           6844416  0
nvidia              72577024  2 nvidia_uvm,nvidia_modeset
video                  81920  3 amdgpu,nouveau,nvidia_modeset

And nvidia-smi don’t return any processes … but showing the display setting of gnome still show the one on the nvidia gpu

I think i just don’t have enough free time for what is basically unsupported by gnome, hot gpu unplug …
I wish Wendell had a magic bullet in that video.

1 Like