GPU passthrough works but no fan spin, potentially due to missing a couple of PCI devices but I can't find them in QEMU/KVM PCI Host Device

Update #2:
One thing troubled me after I setup the whole VFIO thing. My host system would freeze if I tried to:

  • move a window
  • open a new app
  • resize a window
  • separate a browser tab from the main browser window
  • probably also when closing a window but I don’t remember

BTW, when my host system froze, my guest VM continued to run just fine, USB passthrough still worked, etc. So likely it was gnome that was hung.

At a certain point I realized all these things happen when a window’s size/location needs to be recalculated. I recalled that before starting the VM, Fedora would actually use both monitors. So I made a change in display settings, so that Fedora only uses the monitor connected to the host GPU.

Good news, my host system freezing issue is resolved.

Bad news, now I cannot get my GPU fan to spin, at all. No matter what I try, rebooting Windows VM or the host system, changing display back to on both monitors, nothing works. My GPU fan just won’t spin even if I leave it on overnight (I know it sounds weird, but previously I was able to get the GPU fan to spin by simply waiting for anywhere between 1 minute to a few hours after booting up the Windows VM).

I’m not sure if the fact that my host machine boots up with two monitors both displaying at the login screen is related (after logging in, it only displays on 1 monitor).


Update #1:
So I discovered this by accident. I left GPU fan on 100% (but it was not spinning) and went out to do some shopping. When I came back, the GPU fan was spinning REEEEEEEEEEEally loud…yep it was spinning at 100%. I changed it back to automatic control and it appeared to work properly.

Now, if I reboot Windows, it stops working again, but it “fixes” itself after a while.

So, issue is not related to the uninstalled PCI devices or the Vega PCI bridge devices…

I played a few games and most of them were fine except one that suffered severe jittering and/or frame loss.

I guess it’s not unacceptable to wait a while before I can play games, but I still want to know why :joy:


OP:
Fedora 33
X399 Taichi
TR 1950X
Guest GPU is an ASUS Vega 64

Currently the passthrough kind of works. I have output from the VM, I can launch games and FPS seems fine. Problem is there is no fan spin even if I manually crank fan curve up to 100%. As a result, when playing more demanding games, my GPU overheats and shuts down the VM.

I have verified that the card works in another system running Windows bare metal. Fan spins up when temperature goes up. So it’s not a hardware issue unless it’s the X399 Taichi motherboard, which I doubt because my host Linux GPU fan spins just fine.

I’m suspecting it’s because a couple of PCI devices are not passed into the VM properly. I can see them from lspci (09:00.0 and 0a:00:0):

$ lspci | grep -i vega
09:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge (rev c1)
0a:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Vega 10 PCIe Bridge
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c1)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon RX Vega 56/64]

And I have put all four of them in /usr/sbin/vfio-pci-override.sh file.

However, when I try to “Add New Virtual Hardware” to the guest Windows VM, 09:00.0 and 0a:00.0 are not showing up in PCI Host Device list, but 0b:00.0 and 0b:00.1 are.

Meanwhile, in the guest Windows VM Device Manager I see two devices without proper driver installed. I’m not sure if they are related though.

The drivers for this device are not installed. (Code 28)
There are no compatible drivers for this device.

PCI Device (PCI bus 6, device 0, function 0)

Hardware Id: PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01
Device PCI\VEN_1AF4&DEV_1045&SUBSYS_11001AF4&REV_01\4&1743037d&0&0015 requires further installation.

PCI Simple Communications Controller (PCI bus 3, device 0, function 0)

Hardware Id: PCI\VEN_1AF4&DEV_1043&SUBSYS_11001AF4&REV_01
Device PCI\VEN_1AF4&DEV_1043&SUBSYS_11001AF4&REV_01\4&1ab0bb95&0&0012 requires further installation.

I’ve run out of ideas why my GPU fan is not spinning (no rookie errors as far as I’m aware). I’m not sure if the symptoms I observe above are even related. Appreciate helps and pointers :slight_smile:

I have not dealt with vega before. All info I could fine shows that it shouldn’t have the bridge being part of it for the passthrough.

– Granted I may be wrong.

Try just passing through the video card only, at address 0b:00.0 and the audio component at 0b:00.1

Remove both the 09:00.0 and 0a:00.0 from the VM and the vfio-pci-override script you have. Any time I’ve seen bridge devices you do not need to pass those through.

Worth a shot?

So I discovered this by accident. I left GPU fan on 100% (but it was not spinning) and went out to do some shopping. When I came back, the GPU fan was spinning REEEEEEEEEEEally loud…yep it was spinning at 100%. I changed it back to automatic control and it appeared to work properly.

Now, if I reboot Windows, it stops working again, but it “fixes” itself after a while.

So, issue is not related to the uninstalled PCI devices or the Vega PCI bridge devices…

I played a few games and most of them were fine except one that suffered severe jittering and/or frame loss.

I guess it’s not unacceptable to wait a while before I can play games, but I still want to know why :joy:

Update #2:

One thing troubled me after I setup the whole VFIO thing. My host system would freeze if I tried to:

  • move a window
  • open a new app
  • resize a window
  • separate a browser tab from the main browser window
  • probably also when closing a window but I don’t remember

BTW, when my host system froze, my guest VM continued to run just fine, USB passthrough still worked, etc. So likely it was gnome that was hung.

At a certain point I realized all these things happen when a window’s size/location needs to be recalculated. I recalled that before starting the VM, Fedora would actually use both monitors. So I made a change in display settings, so that Fedora only uses the monitor connected to the host GPU.

Good news, my host system freezing issue is resolved.

Bad news, now I cannot get my GPU fan to spin, at all. No matter what I try, rebooting Windows VM or the host system, changing display back to on both monitors, nothing works. My GPU fan just won’t spin even if I leave it on overnight (I know it sounds weird, but previously I was able to get the GPU fan to spin by simply waiting for anywhere between 1 minute to a few hours after booting up the Windows VM).

I’m not sure if the fact that my host machine boots up with two monitors both displaying at the login screen is related (after logging in, it only displays on 1 monitor).

Sounds like your gpu fan is dying, and is having issues starting.

Things like passthrough shouldn’t effect that at all, they card should automatically run the fans either all the time or after a temp threshold is reached. This can then be modified by drivers within whatever OS has control of the card.

Hey I know there’s a lot of text there, but I did mention testing the card in another system. In fact, to double check, I just tested it again, the card worked fine in the other system. Fan spins normally. I can use Radeon Software to manually change the fan speed and it works.

I don’t think I made rookie mistakes like that. The GPU fan also spins fine when system is booting up. I don’t think the fan is failing.

@flyingdoggy Can you confirm the vfio-pci driver is in use for the vega card after boot?

lspci -v -s 0c:00.0 | grep -E '(VGA|driver)'

If you followed the F33 guide see my comment on the initramfs creation.

After VM started:

$ lspci -v -s 0b:00.0 | grep -E '(VGA|driver)'
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c1) (prog-if 00 [VGA controller])
	Kernel driver in use: vfio-pci

Hmm mine is like that even before the VM is started. The host doesn’t have control over the 2nd GPU directly. This is why on my initramfs I’ve instructed the kernel to override the driver instead of nvidia (in my case GTX 2070S) to use vfio-pci.

Yeah mine is amdgpu or something before VM starts/after VM stops. Let me see if making changes to initramfs works :confused:

To be honest I’m not sure if this will fix the issue with the fan. Also I am using dracut to create my initramfs instead of mkinitcpio - I’m not sure how to do the override with mkinitcpio. If you manage it let me know as I’m interested.

Hmm, if its grabbing the graphics card intended for the guest. It sounds like its fully initializing the guest gpu, not blocking it on startup. It only seems to disable it when you start the VM

That makes sense its show amdgpu. It sounds like it isn’t blocking the driver for it at all on startup

If it is loading the amdgpu driver first, then somehow switching to the vfio It could be “initializing” the card on the host, and not on the guest properly

Forcing the host machine to completely block any drivers but vfio. Making it load the vfio driver as soon as possible, without any binding, unbinding driver change as the vm is starting. I could honestly see it causing the problem.

What does your vfio-pci-override file look like?

Standard issue from Wendell’s guide. 0b is my guest GPU, 08 is an SSD.

#!/bin/sh
PREREQS=""
DEVS="0000:0b:00.0 0000:0b:00.1 0000:08:00.0"

for DEV in $DEVS; do
        echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override
done

modprobe -i vfio-pci

I’ve followed @artafinde 's comment. Now my splash screen and login screen only display on one monitor and lspci shows vfio-pci driver for the guest GPU before starting the guest VM.

Fan’s still not spinning though! :joy:

I don’t think you need to passthrough your SSD - but out of scope for this post :smiley:

So you enabled early KMS for your host GPU :+1:

Starting to think this is more and more like a hardware failure :smiley:

But the fan works fine in another system (Windows 10 running on bare metal) :confused:

I can use Radeon Software to manually control the fan spin speed. It doesn’t make sense to me :joy: