The state of AMD RX 7000 Series VFIO Passthrough (April 2024)

With the last few months I have had time to dive into and figure out exactly what is going on with the 7000 series GPUs, what works, what does not, and what is really going on here both at a technical level and corporate level at AMD.

Note: This is not a guide, the idea of this article is to correct some misunderstood issues/problems and hopefully answer some of the questions people have on why these GPUs seem to behave so erratically in VFIO use cases.

The Reset Issue

Yes, it’s back again, the RX 7000 series GPUs do not reset. So what? what are the implications, etc…

  1. If the GPU has crashed due to a fault and/or bug, or whatever, it can’t be brought back into a good state reliably. Sometimes it is still recoverable but not always. This is not such a huge deal for us home gamers, crashes of this nature are somewhat rare and we all know we should shutdown the guest OS when stopping the VM.
  1. If the GPU has been used in Windows, or Linux, the drivers upload firmwares to the GPU to operate on. The firmware images used are different and incompatible, the firmware for Linux does not work with the Windows drivers, etc. Without the abillity to reset the GPU there is no way to unload/reset the GPU to accept a different firmware.

  2. Once the GPU has been posted once by either your motherboard BIOS or the VMs BIOS, allowing it to POST again, will corrupt the GPUs state, usually requiring a cold boot of the host system to recover it.

What is being done about this

Until recently not much unfortunately, however I can say that things seem to have started to move.

“The” post I made on reddit to draw attention to this issue (again) this time fell on the ears of some VERY large organisations, some of which have quite substantial pull with AMD (one of which AMD is their client, not the other way around). These clients raised this as an issue with AMD internally, and has had the attention of management all the way up the tree to Lisa Su herself (so I am told anyway).

While there is no commitment or comment from AMD on fixing any specific GPU families yet (if it’s even possible for already released generations), it seems they now are acutely aware of the importance of addressing this, not just for VFIO usage, but compute and ML applications.

Limitations when using this GPU for VFIO passthrough

If your guest is Windows, do not allow the host’s amdgpu kernel module to bind to the GPU ever. This means, no dynamic bind/unbind of the device to use outside of a Windows VM.

If your guest is Linux, in theory it should be fine to bind/unbind on the host, but I have not tested this. There might be some issue with firmware version missmatches but this would require testing to confirm.

Allow your BIOS to post the GPU with your system, but prevent the guest VM from attempting to post it again. The easiest way to do this is to provide a VGA ROM to the device that is invalid.

But wait I hear you say, “I set my VGA ROM to the one I dumped from GPU-Z or downloaded from TechPowerUp and it works fine!”.

Well yes, but no.

If your firmware is a full firmware dump (> 1MB) then you have the firmware for the GPU itself, but unlike other GPUs the ROM BIOS part is not at the start of the firmware. The GPU at power on copies out (or maps) only the VGA ROM portion to the ROM BAR.

By setting the ROM BAR to this firmware image you have effectively provided a corrupt VGA ROM which the guest BIOS will ignore and not even attempt to use, solving the problem, by mistake.

I have found the most reliable way to prevent the GPU from faulting due to the double post issue is to provide VGA ROM that is just all zeros, for instance:

dd if=/dev/zero of=dummy.rom bs=1M count=1

If your ROM BIOS is several hundred KB, you likely extracted the actual ROM part which would normally be correct and what you would want to use, but if you do… you will be letting the card post twice, corrupting it’s state.

If at any point the GPUs state becomes corrupted/invalid, you MUST cold boot your PC to put the GPU back into a clean state. I have seen the GPU load into windows and report no errors but be completely broken with no output if the cold boot was omitted after it was put into a bad state.

There is no output on my monitor, but windows reports no driver errors

This could be because of a GPU crash/fault, etc… however it’s more likely because the GPU drivers (Adrenaline) know that you are running in a VM.

Please note that this is NOT an attempt to cripple these devices for VFIO usage by AMD.

The cause is the GPU drivers do not distinguish between a Virtual Function (SR-IOV) GPU and a normal GPU, instead they detect the VM and assume that the GPU is a virtual GPU anyway. This prevents the drivers from initialising the physical GPU outputs, and instead the AMD drivers present a fake monitor to windows.

This could be used by Looking Glass users as it (in theory) negates the need to attach a display/dummy plug to the virtual machine.

For those that still want to actually have a monitor attached to the GPU you need to prevent the drivers detecting the VM environment by using the exact same procedure that was used to avoid the Code 43 issue with NVidia GPUs.

<hyperv>
  <vendor_id state="on" value="0123456789ab"/>
</hyperv>
<kvm>
  <hidden state="on"/>
</kvm>

I have rased a bug report internally with AMD and suggested that a flag or registry option be provided that we can use to set which mode we want the GPU to operate with instead, but default to enabling the physical outputs.

Crashing when rendering in the guest.

Unlike NVidia, the AMD drivers detect and enable SAM (Smart Access Memory, aka Resizable BAR) support based on what the guest BIOS reports via ACPI. If your guest GPU has already had it’s BAR resized by your BIOS at boot (ie, Above 4G enabled, and ReBAR enabled), the guest needs to know this and program the GPU accordingly.

The solution is fairly straight forward, the guest BIOS needs to have Above 4G decoding enabled too. Unfortunately the TianoCore BIOS does not expose this as an option, but you can set this value on the QEMU command line.

<qemu:commandline>
  <qemu:arg value="-fw_cfg"/>
  <qemu:arg value="opt/ovmf/X-PciMmio64Mb,string=65536"/>
</qemu:commandline>

You will know if this worked if you see the following option in Adrenalin and you can enable it.

image

Summary

Hopefully this clears some things up. Everything here posted is based on first hand experience with these GPUs. Please note that even after doing all of this, the GPUs still may not work correctly for VFIO and this is the limit of what I know with these GPUs at present.

11 Likes

Reserved…

Why not just make a patched vbios and impliment pcie resetting correctly?

Amd could literally solve everything just by giving people a fixed blob to dd into their vbios.
If amd is too lazy or “cant afford to”, then why dont they just release the source and let the community do their job for them? There are people who would gladly fix amd trash code, if only amd released it.

Oh awesome, just tried out the dummy VGA ROM trick, that’s a big QoL improvement for my setup!

Before this I would 1) boot up Linux, 2) boot the VM once, but there’d no video, 3) force shutoff the VM, 4) suspend the host to RAM and wake up again, then 5) boot the VM, now with video.

If I wanted to shut down the VM and boot it up again, I’d have to suspend to RAM again in between.

This ugly hack also made Windows update a real pain, as Windows’ own reboot procedure wouldn’t just work, at least not with video.

With the dummy VGA ROM I can boot, shutdown, boot again, Windows reboot, it all works :smile:

Thanks for sharing your findings!

2 Likes

I very much appreciate the writeup. I was not looking forward to trying to extract the most salient and up to date conclusions on the matter from deep within various threads.

By setting the ROM BAR to this firmware image you have effectively provided a corrupt VGA ROM which the guest BIOS will ignore and not even attempt to use, solving the problem, by mistake.

:joy: That is a perfect example of the cargo cult solutions we end up trying to rely on for trying to fix extremely specialized and obfuscated tech problems.

1 Like