VFIO GPU passthrough and fan control

I recently noticed that my GPU fans run full speed if VM is not running. After a quick search I found that this is an expected behavior. vfio-pci is just a stub driver and not supposed to perform any device specific function (that is fan control on a GPU).

So I am looking for an alternative solution. Right now the idea is to switch the device control to the nouveau driver (I have a GTX card) if it is not in use. Before running a VM - switch it back to vfio-pci.
I’ve seen some posts that people do this with AMD cards.

I wrote a script loosely based on https://pastebin.com/zLQPHPQk
Switching from vfio to nouveau works fine (I boot up with the vfio by default). The fans are quiet.
But as soon as I try to reclaim the device back for vfio - the nouveau driver crashes with a NULL pointer dereference.

trace

Call Trace:
nouveau_bo_move_m2mf.constprop.24+0x121/0x1e0 [nouveau]
nouveau_bo_move+0xaa/0x450 [nouveau]
? nvif_vmm_unmap+0x38/0x60 [nouveau]
? nouveau_vma_unmap+0x20/0x30 [nouveau]
ttm_bo_handle_move_mem+0x28a/0x5a0 [ttm]
ttm_bo_evict+0x171/0x350 [ttm]
? do_detailed_mode+0x24e/0x5a0 [drm]
ttm_mem_evict_first+0x18d/0x210 [ttm]
ttm_bo_force_list_clean+0xa1/0x170 [ttm]
ttm_bo_clean_mm+0x89/0xf0 [ttm]
nouveau_ttm_fini+0x2b/0xc0 [nouveau]
nouveau_drm_unload+0x7b/0xd0 [nouveau]
drm_dev_unregister+0x3f/0xd0 [drm]
drm_put_dev+0x27/0x40 [drm]
nouveau_drm_device_remove+0x47/0x70 [nouveau]
pci_device_remove+0x3b/0xb0
device_release_driver_internal+0x182/0x250
unbind_store+0xb4/0x180
kernfs_fop_write+0x10f/0x190
vfs_write+0xad/0x1a0
ksys_write+0x52/0xc0
do_syscall_64+0x55/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9

So I guess I should pursue the issue with the nouveau developers (haven’t tried the closed source nvidia driver yet).

Is this a viable idea?
Does anyone have a better one? How do you guys deal with the GPU fan control issue?

Thanks!

2 Likes

I do not have much of a software answer, I would also be interested in a solution. On the previous iteration of my PC I went full MacGyver and controlled the fans with an arduino board and a python script that would start the fans when the VM was runing. Not very elegant, I know.

1 Like

I am having the same problem and I got the driver switch working by using the following script as a QEMU hook:

#!/bin/bash
TMP_FILE=/tmp/qemu-hook
DOMAIN=$1
VM_STATE=$2

touch $TMP_FILE
echo "Event $DOMAIN $VM_STATE" >> $TMP_FILE

# Attaches a PCIe device to the given driver
# Args:
#   $1 - Bus number
#     For example: "0000:03:00.0"
#   $2 - Vendor ID
#     For example: "0x10de"
#   $3 - Product ID
#     For example: "0x17c8"
#   $4 - Driver to attach
#     For example: "vfio-pci"
attach_driver()
{
  if [ $# -eq 4 ]; then
    echo "Attach driver: $1 $2 $3 $4"
  fi

  if [ $# -eq 3 ]; then
    echo "Attach driver: $1 $2 $3"
  fi

  if [ -d /sys/bus/pci/devices/$1/driver/ ]; then
    echo "$1" > /sys/bus/pci/devices/$1/driver/unbind
  fi

  if [ $# -eq 4 ]; then
    echo "$2 $3" > /sys/bus/pci/drivers/$4/new_id
  fi
}

# Script for win10
if [[ $DOMAIN == "win10" ]]; then
  if [[ $VM_STATE == "prepare" ]]; then
    echo "Windows 10 VM preparing PCIe devices" >> $TMP_FILE
    attach_driver "0000:09:00.0" "0x10de" "0x17c8" "vfio-pci" # GTX 980 Ti video
    attach_driver "0000:09:00.1" "0x10de" "0x36b6" "vfio-pci" # GTX 980 Ti HDMI sound
  fi

  if [[ $VM_STATE == "release" ]]; then
    echo "Windows 10 VM releasing PCIe devices" >> $TMP_FILE
    attach_driver "0000:09:00.0" "0x10de" "0x17c8" "nouveau" # GTX 980 Ti video
    attach_driver "0000:09:00.1" "0x10de" "0x36b6" # GTX 980 Ti HDMI sound
  fi
fi

To get the hook working just copy it into /etc/libvirt/hooks/, name it qemu and make it executable.

Unfortunately the nouveau driver isn’t able to control the fans of my GPU (Nvidia GTX 980 Ti).
However, the proprietary nvidia driver is able to control the fan speeds but the driver switch does not work with this one because nvidia is not present in the /sys/bus/pci/drivers/ folder.

I also tried to use driver_override but that did not work either.

echo "nvidia" > /sys/bus/pci/devices/0000:09:00.0/driver_override

Does anybody know how to reassign the proprietary nvidia driver properly?

Some people told me that I should try to override the vBIOS of the card with one that has a manipulated section for fan control.
https://pve.proxmox.com/wiki/Pci_passthrough#romfile
The thing is that I don’t know how to manipulate a vBIOS.

Are there more options to achieve reasonable fan control or even shut the fans down completely while the VM is turned off?