The state of AMD RX 7000 Series VFIO Passthrough (April 2024)

With the last few months I have had time to dive into and figure out exactly what is going on with the 7000 series GPUs, what works, what does not, and what is really going on here both at a technical level and corporate level at AMD.

Note: This is not a guide, the idea of this article is to correct some misunderstood issues/problems and hopefully answer some of the questions people have on why these GPUs seem to behave so erratically in VFIO use cases.

The Reset Issue

Yes, it’s back again, the RX 7000 series GPUs do not reset. So what? what are the implications, etc…

  1. If the GPU has crashed due to a fault and/or bug, or whatever, it can’t be brought back into a good state reliably. Sometimes it is still recoverable but not always. This is not such a huge deal for us home gamers, crashes of this nature are somewhat rare and we all know we should shutdown the guest OS when stopping the VM.
  1. If the GPU has been used in Windows, or Linux, the drivers upload firmwares to the GPU to operate on. The firmware images used are different and incompatible, the firmware for Linux does not work with the Windows drivers, etc. Without the abillity to reset the GPU there is no way to unload/reset the GPU to accept a different firmware.

  2. Once the GPU has been posted once by either your motherboard BIOS or the VMs BIOS, allowing it to POST again, will corrupt the GPUs state, usually requiring a cold boot of the host system to recover it.

What is being done about this

Until recently not much unfortunately, however I can say that things seem to have started to move.

“The” post I made on reddit to draw attention to this issue (again) this time fell on the ears of some VERY large organisations, some of which have quite substantial pull with AMD (one of which AMD is their client, not the other way around). These clients raised this as an issue with AMD internally, and has had the attention of management all the way up the tree to Lisa Su herself (so I am told anyway).

While there is no commitment or comment from AMD on fixing any specific GPU families yet (if it’s even possible for already released generations), it seems they now are acutely aware of the importance of addressing this, not just for VFIO usage, but compute and ML applications.

Limitations when using this GPU for VFIO passthrough

If your guest is Windows, do not allow the host’s amdgpu kernel module to bind to the GPU ever. This means, no dynamic bind/unbind of the device to use outside of a Windows VM.

If your guest is Linux, in theory it should be fine to bind/unbind on the host, but I have not tested this. There might be some issue with firmware version missmatches but this would require testing to confirm.

Allow your BIOS to post the GPU with your system, but prevent the guest VM from attempting to post it again. The easiest way to do this is to provide a VGA ROM to the device that is invalid.

But wait I hear you say, “I set my VGA ROM to the one I dumped from GPU-Z or downloaded from TechPowerUp and it works fine!”.

Well yes, but no.

If your firmware is a full firmware dump (> 1MB) then you have the firmware for the GPU itself, but unlike other GPUs the ROM BIOS part is not at the start of the firmware. The GPU at power on copies out (or maps) only the VGA ROM portion to the ROM BAR.

By setting the ROM BAR to this firmware image you have effectively provided a corrupt VGA ROM which the guest BIOS will ignore and not even attempt to use, solving the problem, by mistake.

If your ROM BIOS is several hundred KB, you likely extracted the actual ROM part which would normally be correct and what you would want to use, but if you do… you will be letting the card post twice, corrupting it’s state.

One solution is to disable the ROM BAR entirely which works for many, however in my instance I have found this is not as reliable as providing the VGA ROM that is just all zeros, for instance:

dd if=/dev/zero of=dummy.rom bs=1M count=1

While this seems to be more reliable for me, I have no direct proof that it’s better then just disabling the ROM BAR itself.

If at any point the GPUs state becomes corrupted/invalid, you MUST cold boot your PC to put the GPU back into a clean state. I have seen the GPU load into windows and report no errors but be completely broken with no output if the cold boot was omitted after it was put into a bad state.

There is no output on my monitor, but windows reports no driver errors

This could be because of a GPU crash/fault, etc… however it’s more likely because the GPU drivers (Adrenaline) know that you are running in a VM.

Please note that this is NOT an attempt to cripple these devices for VFIO usage by AMD.

The cause is the GPU drivers do not distinguish between a Virtual Function (SR-IOV) GPU and a normal GPU, instead they detect the VM and assume that the GPU is a virtual GPU anyway. This prevents the drivers from initialising the physical GPU outputs, and instead the AMD drivers present a fake monitor to windows.

This could be used by Looking Glass users as it (in theory) negates the need to attach a display/dummy plug to the virtual machine.

For those that still want to actually have a monitor attached to the GPU you need to prevent the drivers detecting the VM environment by using the exact same procedure that was used to avoid the Code 43 issue with NVidia GPUs.

<hyperv>
  <vendor_id state="on" value="0123456789ab"/>
</hyperv>
<kvm>
  <hidden state="on"/>
</kvm>

I have rased a bug report internally with AMD and suggested that a flag or registry option be provided that we can use to set which mode we want the GPU to operate with instead, but default to enabling the physical outputs.

Crashing when rendering in the guest.

This only applies if the “Smart Access Memory” option in Adrenalin is greyed out and you have Resizable BAR enabled on the host, or you enlarged the BAR using the resource2_resize sysfs interface.

Adrenalin seems to detect if SAM can be used or not based on external factors, one of which seems to be the MMIO address space available/reported. We are not entirely sure as to the cause here but have found that enlarging the MMIO address space seems to resolve this and allow the option to be configured in Adrenalin.

The solution is fairly straight forward, the guest BIOS needs to have Above 4G decoding enabled too. Unfortunately the TianoCore BIOS does not expose this as an option, but you can set this value on the QEMU command line.

<qemu:commandline>
  <qemu:arg value="-fw_cfg"/>
  <qemu:arg value="opt/ovmf/X-PciMmio64Mb,string=65536"/>
</qemu:commandline>

You will know if this worked if you see the following option in Adrenalin and you can enable it.

image

Summary

Hopefully this clears some things up. Everything here posted is based on first hand experience with these GPUs. Please note that even after doing all of this, the GPUs still may not work correctly for VFIO and this is the limit of what I know with these GPUs at present.

16 Likes

Reserved…

2024-05-14

  • Re-worded “Crashing when rendering in the guest” based on feedback.
  • Suggest a ROM of all zeros as an alternative to just disabling the ROM BAR.
1 Like

Why not just make a patched vbios and impliment pcie resetting correctly?

Amd could literally solve everything just by giving people a fixed blob to dd into their vbios.
If amd is too lazy or “cant afford to”, then why dont they just release the source and let the community do their job for them? There are people who would gladly fix amd trash code, if only amd released it.

Oh awesome, just tried out the dummy VGA ROM trick, that’s a big QoL improvement for my setup!

Before this I would 1) boot up Linux, 2) boot the VM once, but there’d no video, 3) force shutoff the VM, 4) suspend the host to RAM and wake up again, then 5) boot the VM, now with video.

If I wanted to shut down the VM and boot it up again, I’d have to suspend to RAM again in between.

This ugly hack also made Windows update a real pain, as Windows’ own reboot procedure wouldn’t just work, at least not with video.

With the dummy VGA ROM I can boot, shutdown, boot again, Windows reboot, it all works :smile:

Thanks for sharing your findings!

2 Likes

I very much appreciate the writeup. I was not looking forward to trying to extract the most salient and up to date conclusions on the matter from deep within various threads.

By setting the ROM BAR to this firmware image you have effectively provided a corrupt VGA ROM which the guest BIOS will ignore and not even attempt to use, solving the problem, by mistake.

:joy: That is a perfect example of the cargo cult solutions we end up trying to rely on for trying to fix extremely specialized and obfuscated tech problems.

1 Like

Any update on this? Did AMD made a registry entry so I can stop it from assigning my real displays as vDisplays?

1 Like

Just leaving this here in case anyone else had the same issue I did: for some reason, I was only able to get my Linux guest to correctly load the amdgpu driver with a display connected to the GPU (a dummy HDMI) was if I enabled “fast boot” on the host. No HDMI, no problem, but the second I’d connect an HDMI or try to boot with one enabled, amdgpu would either vomit into the kernel log or fatally fail to init and I’d have to do cold boot as described by @gnif.

I’ve been doing VFIO passthrough for 10y at this point and I’ve never heard of this before (and I did a quick web search and didn’t notice anybody talking about this). Furthermore, on a much older platform (Intel X99) but with a Windows guest, I had a flawless experience with no ROM BAR shenanigans. (Though, I may have had “fast boot” on there too… I forget.) I’m going to see if I even need the ROM BAR to have stable resets here…

(My system has an AMD X870 chipset running a AMD Zen 5 CPU and my GPU is a 7900XT.)

Hi, I’m testing a strange use-case here and it is working (almost) flawlessly.

I’m using a Thunderbolt attached 7600XT on my Asus Zenbook 14 with Core Ultra 9 185H.

Tested now with Windows 10 and 11, what I did was blacklist the amdgpu module, then use the hyperv and kvm config OP provided. I can successfully shutdown and start other VMs or the same VM if and only if:

1- ROM BAR is disabled
2- After each shutdown, rmmod vfio-pci unnecessary with ROM BAR disabled and amdgpu blacklist
3- Voila!

It’s working mostly flawlessly. Now the vDisplay introduced by AMD can be a nice feature but I’ve found that it suffers at high refresh rates when used with looking glass. I found a dummy HDMI works best here, tested 2880x1800 @ 100Hz, 1440x900 @ 120Hz, 2560x1440 @ 120hz all fine.

I’m actually in the market for a dummy DisplayPort 2.0 plug but haven’t found one yet, to allow 2880x1800 @ 120Hz

Overall great solution for notebook users, and in the worst case you could unplug the TB cable, power off the GPU, and then try again (no reboot required afaict)

EDIT: Specific device I’m using in case anyone is in the same boat https://www.amazon.com/dp/B0DX6PJTMZ?ref=ppx_yo2ov_dt_b_fed_asin_title