(workaround found, see this comment)
Intro:
I’m working on a Debian11 system (system specifics at the end of post) where I was hoping to be able to dynamically attach/detach GPUs from the host and VMs on demand using driverctl
.
(note: I’ve also posted this on r/VFIO, no need to reply to both if you lurk the Reddit … I’ll update both places with my end result)
Problem:
I can’t use driverctl
overrides on Nvidia driver
Using the nouveau driver this has been successful. However once I install the nvidia-driver and firmware-misc-nonfree and try to apply an override with driverctl
the shell hangs and I can’t even kill the driverctl process.
When I start up without the nvidia-driver package installed, the system binds nouveau to my 1060 (since I’ve got PCIE2 specified as the default video device in BIOS, so that’s a good thing). I can run these commands and see the override work immediately (including blanking my console screen on the 1060).
# 3080 TI ... first = GPU, second = HD Audio
driverctl set-override 0000:0b:00.0 vfio-pci
driverctl set-override 0000:0b:00.1 vfio-pci
# 1060 ... first = GPU, second = HD Audio
driverctl set-override 0000:0c:00.0 vfio-pci
driverctl set-override 0000:0c:00.1 vfio-pci
Likewise I can dynamically switch back to the defaults like this:
# 3080 TI
driverctl unset-override 0000:0b:00.0
driverctl unset-override 0000:0b:00.1
# 1060
driverctl unset-override 0000:0c:00.0
driverctl unset-override 0000:0c:00.1
At which point my 1060 displays the local console login screen again. And I can do this any number of times.
BUT
Once I install nvidia-driver
and reboot to take effect, if I try to ‘driverctl set-override’ on the ‘.0’ GPU devices (either GPU), the command hangs and the SSH session is unresponsive.
I can SSH in a second time, so it isn’t hanging the OS. But I can’t:
- Halt the ‘driverctl’ command with ^C
- Background it with ^Z
kill -9 {driverctl-pid}
killall driverctl
- Top shows ‘driverctl’ constantly running even when let run over 18 hours
After a reboot, I am able to apt remove nvidia*
and, after another reboot, I’m back to a nouveau setup and things work again.
My workaround:
I can leave the host without the nvidia-driver and just use nouveau when I need a host desktop. But … that’s really not what I was going for. I don’t expect to use the GPUs on the base host often but I did want to have the option to use my full bare metal system when desired.
I’m wondering if either of these are true?
- Is there a known problem with unbinding nvidia drivers with
driverctl
? - If not a known issue and I’m probably doing something wrong, what order should I do things and/or is there a step I should insert between installing
nvidia-driver
and attempting to override withdriverctl
?- I’m wondering if maybe there’s a way to blacklist nvidia-driver so it doesn’t bind on boot but can then be bound with an override.
- But, if there is an issue with unbinding nvidia-driver dynamically specifically then I’d just ram my head into that after the first time I try to do it.
Notes:
- I haven’t been using driverctl long, just a few days.
- I’m also learning about ZFSBootMenu.
- I’m documenting the entire system config and will publish a guide once I have kinks worked out.
- Treat me like a newb
System
-
OS:
- Debian 11 Bullseye
- ZFSBootMenu (root-on-ZFS, dracut instead of grub)
-
Hardware::
- Aorus X570 Master rev 1.0 BIOS F34
- Ryzen 9 5950X
- Nvidia RTX 3080 TI FE 12GB
- Gigabyte GTX 1060 6GB
- 4 x Kingston 32MB ECC DDR4 KSM32ED8/32ME
- 512GB Samsung EVO850 boot/root drive (other drives, not yet configured)