Return to Level1Techs.com

Problem: can't use `driverctl` overrides on Nvidia driver

(workaround found, see this comment)

Intro:

I’m working on a Debian11 system (system specifics at the end of post) where I was hoping to be able to dynamically attach/detach GPUs from the host and VMs on demand using driverctl.

(note: I’ve also posted this on r/VFIO, no need to reply to both if you lurk the Reddit … I’ll update both places with my end result)


Problem:

I can’t use driverctl overrides on Nvidia driver

Using the nouveau driver this has been successful. However once I install the nvidia-driver and firmware-misc-nonfree and try to apply an override with driverctl the shell hangs and I can’t even kill the driverctl process.

When I start up without the nvidia-driver package installed, the system binds nouveau to my 1060 (since I’ve got PCIE2 specified as the default video device in BIOS, so that’s a good thing). I can run these commands and see the override work immediately (including blanking my console screen on the 1060).

# 3080 TI ... first = GPU, second = HD Audio
driverctl set-override 0000:0b:00.0 vfio-pci
driverctl set-override 0000:0b:00.1 vfio-pci
# 1060 ... first = GPU, second = HD Audio
driverctl set-override 0000:0c:00.0 vfio-pci
driverctl set-override 0000:0c:00.1 vfio-pci

Likewise I can dynamically switch back to the defaults like this:

# 3080 TI
driverctl unset-override 0000:0b:00.0
driverctl unset-override 0000:0b:00.1
# 1060
driverctl unset-override 0000:0c:00.0
driverctl unset-override 0000:0c:00.1

At which point my 1060 displays the local console login screen again. And I can do this any number of times.

BUT

Once I install nvidia-driver and reboot to take effect, if I try to ‘driverctl set-override’ on the ‘.0’ GPU devices (either GPU), the command hangs and the SSH session is unresponsive.

I can SSH in a second time, so it isn’t hanging the OS. But I can’t:

  • Halt the ‘driverctl’ command with ^C
  • Background it with ^Z
  • kill -9 {driverctl-pid}
  • killall driverctl
  • Top shows ‘driverctl’ constantly running even when let run over 18 hours

After a reboot, I am able to apt remove nvidia* and, after another reboot, I’m back to a nouveau setup and things work again.


My workaround:

I can leave the host without the nvidia-driver and just use nouveau when I need a host desktop. But … that’s really not what I was going for. I don’t expect to use the GPUs on the base host often but I did want to have the option to use my full bare metal system when desired.

I’m wondering if either of these are true?

  • Is there a known problem with unbinding nvidia drivers with driverctl?
  • If not a known issue and I’m probably doing something wrong, what order should I do things and/or is there a step I should insert between installing nvidia-driver and attempting to override with driverctl?
    • I’m wondering if maybe there’s a way to blacklist nvidia-driver so it doesn’t bind on boot but can then be bound with an override.
    • But, if there is an issue with unbinding nvidia-driver dynamically specifically then I’d just ram my head into that after the first time I try to do it.

Notes:

  1. I haven’t been using driverctl long, just a few days.
  2. I’m also learning about ZFSBootMenu.
  3. I’m documenting the entire system config and will publish a guide once I have kinks worked out.
  4. Treat me like a newb :wink:

System

  • OS:
    • Debian 11 Bullseye
    • ZFSBootMenu (root-on-ZFS, dracut instead of grub)
  • Hardware::
    • Aorus X570 Master rev 1.0 BIOS F34
    • Ryzen 9 5950X
    • Nvidia RTX 3080 TI FE 12GB
    • Gigabyte GTX 1060 6GB
    • 4 x Kingston 32MB ECC DDR4 KSM32ED8/32ME
    • 512GB Samsung EVO850 boot/root drive (other drives, not yet configured)

Addendum:

My kvm package install list:

apt -y install qemu-kvm qemu-utils libvirt-daemon-system libvirt-clients virt-manager ovmf driverctl

Replying just in case someone else finds this as a breadcrumb looking for a similar answer.

It appears I can get a working setup. The order of operations is important.

  1. Install driverctl before nvidia-driver
  2. Bind the devices to either vfio-pci or nouveau with an override:
    a. to boot with the GPU bound to vfio-pci headless
    * driverctl set-override 0000:0c:00.0 vfio-pci where 0c:00.0 is my GPU
    * driverctl set-override 0000:0c:00.1 vfio-pci where 0c:00.1 is the GPU’s HD audio
    b. OR to boot with the GPU bound to host for a console
    * driverctl set-override 0000:0c:00.0 nouveau
    * driverctl set-override 0000:0c:00.1 snd_hda_audio
  3. Install the nvidia-driver for use on host after you’ve successfully set the override
  4. Reboot

BUT

  • … don’t do this to bind to the nvidia driver:*
driverctl set-override 0000:0c:00.0 nvidia
driverctl set-override 0000:0c:00.1 snd_hda_intel

Reason: this command will persist the config through a reboot. And that’s where there is a problem. If the system boots with the GPU bound to the nvidia driver, drivertctl will hang when you try to change the override.

Instead, leave the override state set to ‘vfio-pci’ or ‘nouveau’ + ‘snd_hda_intel’ and when you’re looking to use the GPU directly on the host, use ‘–nosave’:

driverctl --nosave set-override 0000:0c:00.0 nvidia
driverctl --nosave set-override 0000:0c:00.1 snd_hda_intel

I’m not sure if the fault lies with me, the nvidia driver (my guess), or driverctl … but … this seems to be a workaround.

Once I have my full config documented and working I’ll add a reply with a link.

I’ve posted this as a possible issue to the nvidia linux forum to see if there is anything more to learn there and/or see if they address it.