Problem: can't use `driverctl` overrides on Nvidia driver

(workaround found, see this comment)

Intro:

I’m working on a Debian11 system (system specifics at the end of post) where I was hoping to be able to dynamically attach/detach GPUs from the host and VMs on demand using driverctl.

(note: I’ve also posted this on r/VFIO, no need to reply to both if you lurk the Reddit … I’ll update both places with my end result)


Problem:

I can’t use driverctl overrides on Nvidia driver

Using the nouveau driver this has been successful. However once I install the nvidia-driver and firmware-misc-nonfree and try to apply an override with driverctl the shell hangs and I can’t even kill the driverctl process.

When I start up without the nvidia-driver package installed, the system binds nouveau to my 1060 (since I’ve got PCIE2 specified as the default video device in BIOS, so that’s a good thing). I can run these commands and see the override work immediately (including blanking my console screen on the 1060).

# 3080 TI ... first = GPU, second = HD Audio
driverctl set-override 0000:0b:00.0 vfio-pci
driverctl set-override 0000:0b:00.1 vfio-pci
# 1060 ... first = GPU, second = HD Audio
driverctl set-override 0000:0c:00.0 vfio-pci
driverctl set-override 0000:0c:00.1 vfio-pci

Likewise I can dynamically switch back to the defaults like this:

# 3080 TI
driverctl unset-override 0000:0b:00.0
driverctl unset-override 0000:0b:00.1
# 1060
driverctl unset-override 0000:0c:00.0
driverctl unset-override 0000:0c:00.1

At which point my 1060 displays the local console login screen again. And I can do this any number of times.

BUT

Once I install nvidia-driver and reboot to take effect, if I try to ‘driverctl set-override’ on the ‘.0’ GPU devices (either GPU), the command hangs and the SSH session is unresponsive.

I can SSH in a second time, so it isn’t hanging the OS. But I can’t:

  • Halt the ‘driverctl’ command with ^C
  • Background it with ^Z
  • kill -9 {driverctl-pid}
  • killall driverctl
  • Top shows ‘driverctl’ constantly running even when let run over 18 hours

After a reboot, I am able to apt remove nvidia* and, after another reboot, I’m back to a nouveau setup and things work again.


My workaround:

I can leave the host without the nvidia-driver and just use nouveau when I need a host desktop. But … that’s really not what I was going for. I don’t expect to use the GPUs on the base host often but I did want to have the option to use my full bare metal system when desired.

I’m wondering if either of these are true?

  • Is there a known problem with unbinding nvidia drivers with driverctl?
  • If not a known issue and I’m probably doing something wrong, what order should I do things and/or is there a step I should insert between installing nvidia-driver and attempting to override with driverctl?
    • I’m wondering if maybe there’s a way to blacklist nvidia-driver so it doesn’t bind on boot but can then be bound with an override.
    • But, if there is an issue with unbinding nvidia-driver dynamically specifically then I’d just ram my head into that after the first time I try to do it.

Notes:

  1. I haven’t been using driverctl long, just a few days.
  2. I’m also learning about ZFSBootMenu.
  3. I’m documenting the entire system config and will publish a guide once I have kinks worked out.
  4. Treat me like a newb :wink:

System

  • OS:
    • Debian 11 Bullseye
    • ZFSBootMenu (root-on-ZFS, dracut instead of grub)
  • Hardware::
    • Aorus X570 Master rev 1.0 BIOS F34
    • Ryzen 9 5950X
    • Nvidia RTX 3080 TI FE 12GB
    • Gigabyte GTX 1060 6GB
    • 4 x Kingston 32MB ECC DDR4 KSM32ED8/32ME
    • 512GB Samsung EVO850 boot/root drive (other drives, not yet configured)

Addendum:

My kvm package install list:

apt -y install qemu-kvm qemu-utils libvirt-daemon-system libvirt-clients virt-manager ovmf driverctl

Replying just in case someone else finds this as a breadcrumb looking for a similar answer.

It appears I can get a working setup. The order of operations is important.

  1. Install driverctl before nvidia-driver
  2. Bind the devices to either vfio-pci or nouveau with an override:
    a. to boot with the GPU bound to vfio-pci headless
    * driverctl set-override 0000:0c:00.0 vfio-pci where 0c:00.0 is my GPU
    * driverctl set-override 0000:0c:00.1 vfio-pci where 0c:00.1 is the GPU’s HD audio
    b. OR to boot with the GPU bound to host for a console
    * driverctl set-override 0000:0c:00.0 nouveau
    * driverctl set-override 0000:0c:00.1 snd_hda_audio
  3. Install the nvidia-driver for use on host after you’ve successfully set the override
  4. Reboot

BUT

  • … don’t do this to bind to the nvidia driver:*
driverctl set-override 0000:0c:00.0 nvidia
driverctl set-override 0000:0c:00.1 snd_hda_intel

Reason: this command will persist the config through a reboot. And that’s where there is a problem. If the system boots with the GPU bound to the nvidia driver, drivertctl will hang when you try to change the override.

Instead, leave the override state set to ‘vfio-pci’ or ‘nouveau’ + ‘snd_hda_intel’ and when you’re looking to use the GPU directly on the host, use ‘–nosave’:

driverctl --nosave set-override 0000:0c:00.0 nvidia
driverctl --nosave set-override 0000:0c:00.1 snd_hda_intel

I’m not sure if the fault lies with me, the nvidia driver (my guess), or driverctl … but … this seems to be a workaround.

Once I have my full config documented and working I’ll add a reply with a link.

I’ve posted this as a possible issue to the nvidia linux forum to see if there is anything more to learn there and/or see if they address it.

1 Like

Disabling persistence mode with nvidia-smi -pm 0 allows the nvidia kernel module to be unloaded, and driverctl to work as expected

Try unloading the driver first.

echo “0000:01:00.0” | sudo tee /sys/bus/pci/devices/0000:01:00.0/driver/unbind

Then use driverctl. If that works then the Nvidia device is refusing the new driver because it really wants to keep the one it has loaded. Know this wont work if you have applications running on that card. Check with nvidia-smi.

Not sure a permanent solution but now you know why.