Looking for suggestions on how to improve guest GPU bind/unbind

roobre · August 19, 2023, 11:16am

Hey everyone,

I’ve been running a vfio setup for several years now, and I’m looking for suggestions on how to make runtime bind/unbind a bit more pleasant to work with.

What I have:

Two Nvidia GPUs, one of them used only for the guest VM, and another one used by my Arch (btw) host. Swapping the host GPU for a non-nvidia one is not entirely out of the question, if required.

What I want:

Avoid binding the guest GPU early on boot, to prevent the tty from breaking (as reported in Linux Kernel 6 seems to be incompatible with the vfio_pci module needed for PCI passthrough)
Keep my guest GPU in power saving state when the VM is off. This is important as otherwise it will suck 100W permanently, getting hot and reducing fan lifespan.

What I knew:

To achieve 1, I must not let VFIO grab the guest GPU early on boot (that is, let the nvidia module have it)
To achieve 2, the guest GPU has to be bound to the nvidia module, and nvidia-persistenced has to be running. It needs to be kept running, it does not work (to my knowledge) to start it and then stop it.
To be able to launch the VM, I need to unbind the guest GPU from the nvidia module. To do this, no process must be using that GPU
X.org will start using any GPU bound to the nvidia module, so the guest GPU must not be bound to the nvidia module when X.org starts

With this info, I made an sketchy setup that roughly consists in the following:

VFIO is not started at boot, so nvidia binds both GPU
A systemd unit is run Before=display-manager.service, which unbinds the guest GPU
X.org starts
Another systemd unit is run After=display-manager.service, which rebinds the guest GPU to nvidia
nvidia-persistenced starts after that, putting the GPU in low power mode
When the VM is started, libvirt hooks stop nvidia-persistenced, reset the gpu with nvidia-smi, unbind the GPU from nvidia, and bind it to vfio before starting the VM.

I thought this was going to work fine, but it does not. It turns out that some applications (I believe those that use OpenGL?) will attach themselves to the guest GPU even if X.org itself is not. Oddly enough, applications only start using the guest GPU if nvidia-persistenced is running (I have no idea why)

root@Archiroo
13:14:58 ~ #> fuser -v /dev/nvidia0
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        nvidia-persistenced  3884809 F.... nvidia-persiste
                     roobre    3884870 F...m alacritty
                     roobre    3885098 F...m alacritty
                     roobre    3886061 F...m Forkgram

And this is annoying, as with the setup above I need to quite those applications before starting the VM, which I do not like.

I wonder if someone here has faced a similar problem, and if they have advice on how to circumvent this problem. Any thought is appreciated!

quilt · August 19, 2023, 1:29pm

Which hardware do you have? On my system with a nvidia 3060 and asus x670 proart, my GPU goes into D3hot after binding it to vfio-pci on boot:

❯ cat /sys/bus/pci/devices/0000:01:00.0/power_state
D3hot

The fan is not turning and the heatsink feels pretty warm but not hot. I have no idea how much power it is actually pulling but it is for sure much less than 100W.

Perhaps you could check bios options for PCI power management?

I do sometimes have the GPU bound to the host, and I also have some apps that keep running on the GPU spontaneously. Usually Firefox or Chrome…

roobre · August 19, 2023, 2:08pm

Hi, thanks for dropping your thoughts!

I have a Gigabyte Z390 M GAMING (Intel-based), and the GPU is a 3080 Founders. I’ll take a look at PCI power management but I have the suspicion the PCI bus power state may not be the same thing as the GPU power state.

I check the GPU power state with nvidia-smi, which, when bound to the nvidia module and with nvidia-persistenced running, says:

root@Archiroo
16:04:29 ~ #> nvidia-smi
Sat Aug 19 16:06:10 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0 Off |                  N/A |
|  0%   39C    P8              13W / 320W |      2MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1050 Ti     On  | 00000000:04:00.0  On |                  N/A |
| 70%   49C    P0              N/A /  75W |   1230MiB /  4096MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Running the command you posted returns D0 for both GPUs:

root@Archiroo
16:06:10 ~ #> cat /sys/bus/pci/devices/0000:01:00.0/power_state
D0

root@Archiroo
16:10:03 ~ #> cat /sys/bus/pci/devices/0000:04:00.0/power_state
D0

The P8 there is a very nice power saving level. Obviously nvidia-smi cannot check those values while the gpu is bound to vfio, but if I swap from vfio to nvidia and check right after that, it says P0 and ~100W. Most noticeably, the fan is spinning and the GPU is very hot to the touch. I could not keep my finger on the backplate for more than a few seconds.

My guess here is that the 3060 might not get as hot in P0 as the 3080 does.

quilt · August 19, 2023, 2:35pm

Does the power_state show D3hot while it is bound to vfio? If not, getting D3hot with vfio might solve your issue? There are some bios options on my board related to that, if it is not in D3hot.

While bound to nvidia, I also get D0. Right after binding I see 47 W / 170 W used, after a bit it goes down to 18 W.

Sadly there is no way to monitor power without binding nvidia AFAIK

roobre · August 19, 2023, 7:15pm

It does show D3hot when bound to vfio, unfortunately

root@Archiroo
21:14:11 ~ #> cat /sys/bus/pci/devices/0000:01:00.0/power_state
D3hot

root@Archiroo
21:14:22 ~ #> l /sys/bus/pci/devices/0000:01:00.0/driver
lrwxrwxrwx 1 root root 0 Aug 19 21:13 /sys/bus/pci/devices/0000:01:00.0/driver -> ../../../../bus/pci/drivers/vfio-pci

quilt · August 19, 2023, 7:32pm

Hmm, maybe I should reconsider my own approach! I always assumed D3 implied it would be in a low power state…

One thing to check for you would be switcherooctl, what does it show as default? Mine is:

❯ switcherooctl 
Device: 0
  Name:        Advanced Micro Devices, Inc. [AMD®/ATI] Raphael
  Default:     yes
  Environment: DRI_PRIME=pci-0000_6d_00_0

Device: 1
  Name:        NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate]
  Default:     no
  Environment: __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only

And perhaps using the environment variables of the iGPU explicitly for those programs would work?

quilt · August 19, 2023, 7:55pm

After some testing I find that my card stays cooler with your method too. With both methods the fan does not spin up, but with yours the heatsink is much cooler to touch.

roobre · August 19, 2023, 8:04pm

With both methods the fan does not spin up, but with yours the heatsink is much cooler to touch.

Makes sense and aligns with the idea that GPU and PCI have different (complementary) power saving settings. Howerver using nvidia for power saving is a huge pain in the ass, if you can live with just the PCI power savings I’d stick with that.

And perhaps using the environment variables of the iGPU explicitly for those programs would work?

Gave this a quick try, but no deal. I remember reading some years ago that the nvidia drivers did work with PRIME as a secondary device, but not as a primary one. I’m not sure if that still applies, but I’d bet it does.

The weirdest thing is that applications do not attach themselves to both GPUs if nvidia-persistenced is not running. I don’t know if there’s anything that could be done around that.

quilt · August 19, 2023, 8:12pm

Hmm, well our setups are quite different, as I have iGPU+nvidia and use Wayland… But perhaps this may work:
https://wiki.archlinux.org/title/PRIME#NVIDIA

roobre · August 20, 2023, 9:15pm

I’ve come up with a really, really stupid workaround for this that somehow works.

I’ve created a dummy linux VM, to which I pass the guest GPU when the Windows guest is off. This dummy linux VM just has the nvidia module and nvidia-persistenced enabled. When this VM launches, it grabs the GPU and puts it in power saving mode.

In the host, it is still bound to the vfio module so no application does funny things with it.

I still can believe how stupid it sounds that I can save 60W of power (and fan lifespan) by starting a VM that does nothing, but there’s that.