VFIO/Passthrough in 2023 - Call to Arms

lI_Simo_Hayha_Il · August 24, 2023, 7:19am

AMD Ryzen 9, 7950X3D, X670E

$ dmesg |grep AMD-Vi
[    0.109558] AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
[    1.615958] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    1.617231] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.617233] AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC
[    1.617237] AMD-Vi: Interrupt remapping enabled
[    1.630358] AMD-Vi: Virtual APIC enabled
[    1.723341] AMD-Vi: AMD IOMMUv2 loaded and initialized

Neon0Blue · August 24, 2023, 7:39am

Attempting to contribute. Ryzen 7 1800x, RX 6700 XT, X470 chipset

[ 0.272748] AMD-Vi: Using global IVHD EFR:0xf77ef22294ada, EFR2:0x0
[ 0.495102] iommu: Default domain type: Translated
[ 0.495102] iommu: DMA domain TLB invalidation policy: lazy mode
[ 0.514058] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.518918] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 0.518925] AMD-Vi: Extended features (0xf77ef22294ada, 0x0): PPR NX GT IA GA PC GA_vAPIC
[ 0.518931] AMD-Vi: Interrupt remapping enabled
[ 0.518938] AMD-Vi: Virtual APIC enabled
[ 0.519154] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 0.537222] AMD-Vi: AMD IOMMUv2 loaded and initialized

Sephi · August 24, 2023, 8:45am

Hello. These are my findings.

Setup:

Guest GPU: Gigabyte 7900 XTX Gaming OC (no USB-C slot on the card, so only 2 devices are present (the video and the audio one))
Host GPU: none
CPU: 5950X (one CCD for host, one for guest)
MoBo: Gigabyte X570 Aorus Master
VM storage device: dedicated passed-through NVMe SSD
Resizable BAR: On
Boot GPU: Any (1st, 2nd or 3rd PCIe slot; does not matter)
Passed-through IOMMU Groups:
- GPU video device group
- GPU audio device group
- Motherboard USB controller group
- Motherboard sound device group
- Dedicated NVMe SSD for VM group

VM config: originally created for passthrough of a GTX1080, which worked without issues. A GT1030 was also used as a host GPU.

Radeon Migration:

The GT1030 has been removed. My initial tests use no host GPU, while another machine allows interaction with the host through SSH.
Initially only swapped the IDs of the GPU and its sound device in the bootloader entry file and the VM’s xml. This has resulted in the American Megatrends and a couple of lines of systemd being printed during the host’s boot, after which nothing. Starting up the VM removed the previous output and set the monitor to sleep. This was presumably caused by the reset bug.
To overcome this I have unplugged the display cable from the GPU, rebooted the host, started the guest again and then plugged the display cable back into the GPU. This brought me every time to a Windows recovery menu, which in all exit cases resulted in a reboot, triggering the reset bug again.
After using a virtual GPU to roll-back Windows to a working state and settings Resizable BAR to OFF, doing the display cable unplug/replug trick again brought me to the Windows desktop, where I have extracted the GPU’s BIOS using GPU-Z.
Adding the VBIOS to the VM’s xml fixed my reset issue. I can freely reboot the VM without issues, however…
Single GPU passthrough also works without me setting it up. I boot to the Linux desktop without isolating the GPU, after which I start the VM from virt-manager. My monitor loses signal, then without showing the TianoCore boot screen the VM login screen simply shows up. After shutting down the VM I get dumped back to the tty, where the login prompt awaits. I can proceed to start X again without rebooting the host.
Installation of the drivers inside the VM went without issues. The guest is a bit stuttery, but it’s an old install I experimented on, which has been repaired and rolled back countless times. 3D performance seems about right, in the sense than an eyeballed average in CS:GO went from 350 (native) to 300 (VM), which can be explained by the CPU being virtualized. I did not spend much time testing, nor did I try with a fresh install of Windows.

My to-do list consists of proper performance and stability testing with a fresh install and trying to enable Resizable BAR.
Hope this helps.

gnif · August 24, 2023, 10:05am

Here is EPYC Milan.

[    0.056958] AMD-Vi: Using global IVHD EFR:0x59f77efa2094ade, EFR2:0x0
[    0.341276] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    0.341995] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    0.342535] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.343202] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.344037] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.344039] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    0.344044] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.344045] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    0.344048] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.344049] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    0.344052] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.344053] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    0.344056] AMD-Vi: Interrupt remapping enabled
[    0.344057] AMD-Vi: X2APIC enabled
[    0.540806] AMD-Vi: AMD IOMMUv2 loaded and initialized

Be sure you have set IOMMU to Enabled NOT Auto or you will get the following result:

AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.

gnif · August 24, 2023, 10:54am

Correct, this is the AMD reset issue, this has already initialised the GPU hardware and the guest BIOS will attempt to do it a second time and as the GPU can’t be reset back to it’s pre-boot state, it’s like trying to start your car when it’s already running… bad things happen.

Sorry but again, this is far too late, the GPU BIOS executed at your system POST regardless of if a cable was attached or not.

The easiest way I have found to prevent this happening is to ensure your AMD GPU has only a EFI BIOS (pretty much all do) and boot the system in compatibility mode (PC-BIOS, non-UEFI, or CSM mode), which prevents the GPU’s BIOS from being executed as it’s incompatible with EFI bios images.

If you’re lucky some BIOS’s will actually allow you to disable posting a device from it’s bios entirely based on slot, usually you only see that though on server/workstation motherboards.

Note though that once you do this, the guest can only be run the first time, if the guest crashes or is shutdown you likely will need to cold boot your system again to get your VM working again.

Finally, you could try the vendor-reset project if you have not already which attempts to reset the GPU using AMD’s internal reset mechanisms extracted from the amdgpu driver. It’s not 100% but for many people it is enough.

This won’t help, QEMU has no support for it and intentionally hides the feature flag from the guest to prevent major problems. You should not apply any patch to undo this and instead wait for proper resizable bar support to be added to QEMU.

vic · August 24, 2023, 4:03pm

Thanks to all three gents replies. I believe I see sufficient ‘dmesg’ outputs to validate my thought. A quick summary:

Zen 1 does support “IOMMU AVIC”. Congrats to Zen 1/Zen+ owners.
Zen 4 seems to support “IOMMU AVIC” as well. Lucky you.
Zen 3 seems not support “IOMMU AVIC” unfortunately.

So a conspiracy theory from me: I guess something ‘bad’ happened in the development cycles of Zen 2 and Zen 3 which seemed to be worked on in parallel. “IOMMU AVIC” was disabled in the final shipping products.

Here is a quick trick for determination. If you see “Virtual APIC enabled”, then “IOMMU AVIC” is supported by your hardware.

For the hardcore, if you see both “GA” and “GA_vAPIC” flags, then “IOMMU AVIC” is supported by your hardware. Missing one of the two, it’s not supported.

vvk · August 24, 2023, 5:14pm

Is this important for performance? I suppose I don’t have it enabled or on auto as I don’t get either line like that in dmesg.

vic · August 24, 2023, 5:49pm

What line are you missing? Perhaps paste your “dmesg|grep AMD-Vi” output to clarify your question.

IOMMUv2 support PCIe devices that have “PCI PRI and PASID interface” functionality. I haven’t looked into what that exactly means. But if your devices don’t support them, then you aren’t missing anything.

IOMMUv2 is a kernel build option, most distributions seem to have it enabled by default. So if your hardware supports it, then it’ll be used.

All Zen processors support IOMMUv2. You don’t see the IOMMUv2 line in my paste above because I had it disabled in my custom kernel early this week for the investigation of this AVIC thing. I’ll turn it back on when I do my next build.

IOMMU has to be set to “enable” in UEFI. Otherwise, you can’t do VFIO passthrough.

bambinone · August 24, 2023, 6:44pm

Hard to imagine passthrough performance being any better on my 5950X but I guess there’s always something. Perhaps it would help with nested virtualization, which has been a pain point for me.

vic · August 24, 2023, 7:11pm

Also don’t forget your Zen 3 or 5950x still benefit from “SVM AVIC” You should enable AVIC in your QEMU config & etc. I believe it still benefits quite a bit to Linux and Windows guests.

Edit:
Here is a brief guide [0].

Should ignore the “cpu_pm=on” and “preempt=voluntary” bits since they’re irrelevant to AVIC enablement.

[0] Reddit - Dive into anything

Janos · August 25, 2023, 12:00am

does x2apic just work for you or does it require a kernel parameter or anything other than just setting x2apic in the bios?

if I set the APIC mode to “x2apic” on my system, the system hangs shortly after grub at “Early Hooks Plymouth”

# dmesg | grep -i AMD-Vi
[    0.135714] AMD-Vi: Using global IVHD EFR:0x246577efa2254afa, EFR2:0x0
[    0.390234] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.390633] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.390633] AMD-Vi: Extended features (0x246577efa2254afa, 0x0): PPR NX GT [5] IA GA PC GA_vAPIC
[    0.390636] AMD-Vi: Interrupt remapping enabled
[    0.667348] AMD-Vi: Virtual APIC enabled
[    0.756574] AMD-Vi: AMD IOMMUv2 loaded and initialized

# zgrep CONFIG_X86_X2APIC /proc/config.gz
CONFIG_X86_X2APIC=y

# journalctl -b | grep -i "irq rem"
Aug 25 01:52:29 manja-02 kernel: x2apic: IRQ remapping doesn't support X2APIC mode

GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt amdgpu.sg_display=0 pcie_aspm=off mitigations=off udev.log_priority=3"

It looks like X2APIC is not only for servers with more than 256 threads, it also seems to be a requirement for ROCm

Known Impact


If AMD ROCm is installed, the system may report failure or errors when running workloads such as *bandwidth test*, *clinfo*, and *HelloWord.cl*. Note, it may also result in a system crash.

* IO PAGE FAULT
* IRQ remapping doesn’t support X2APIC mode
* NMI error

In a correct ROCm installation, the system must not encounter the errors mentioned above.

Janos · August 25, 2023, 1:18am

add “amdgpu.sg_display=0” to your grub config and works just fine.

I use the ZEN4 iGPU for the host and replaced my 6800XT with a 7900XTX, I didn’t have to change anything, except the 7900XTX needs a VBIOS copy via libvirt

bambinone · August 25, 2023, 1:57am

Yes, I’ve been running with the avic and stimer Hyper-V enlightenments enabled for quite some time. Have you tried ~~Maxim’s patch for~~ (EDIT: actually looks like it’s been mainlined) kvm_amd.force_avic?

ETA: Hmm, no change here. So that option must force the SVM AVIC, which isn’t necessary on my 5950X. I wonder if we could force the IOMMU AVIC as well…

vic · August 25, 2023, 2:31am

Then you’re all set wrt AVIC. With all-core passthrough, I bet you may be able to benchmark a difference in GB6 or CR23 perhaps.

I think all Zen processors can do “SVM AVIC”. My Zen 2 does it too (which I wasn’t aware of until I figured out this mess lately).

What’s missing for Zen 2 and Zen 3 are the “IOMMU AVIC”. It cannot be forced on because it’s missing the hardware feature (as per recent kernel codes).

bambinone · August 25, 2023, 3:34am

Yes, I understand, but it might be possible to bypass the hardware check and enable it regardless. Assuming the feature is implemented in hardware and not fused off or disabled in firmware.

vic · August 25, 2023, 3:37am

I tried it already during my investigation. The kernel will hang on startup.

SirReptitious · August 25, 2023, 3:54am

That’s a fairly easy problem to solve. But it could be an expensive solution depending on how you have to do it. First, do you want 10 NVMe or SATA SSDs? If SATA, you just need to get a tower case that was designed to be useful, not one that is solely made to show off tons of RGB LEDs through glass panels. YES, there are still some cases like that being made; you just have to shop around a bit more to find them. You might not be able to find any with 10 drive bays. But there are 5.25" bay adapters that allow you to put 4,6, or 8 SATA SSDs into one 5.25" bay. I have a 4-in-1 in my system with 4 SSDs. I was lucky that I got it on a REALLY good Newegg sale many years ago. The regular prices of these units are expensive, but if you are really going to use it then it will be worth it. Then of course you have the problem of having enough SATA connections. Most mobos top out at 6 these days with many having 4 or even 2! So you will need to get one or more PCIe cards that give you the number of SATA ports you need. My mobo has 6 SATA so I only needed to buy one PCIe SATA card with 4 ports. If you want to have 10 NVMe SSDs that’s definitely trickier. I’ll assume your mobo supports two. Ideally you would want to have a mobo that has 2 x16 slots free. Of course very few mobos have 3 x16 slots, since you need the first one for your gfx card. If this is going to just be a media server and not a gaming PC then you just freed up a x16 slot. There are PCIe cards that support up to 4 NVMe SSDs on a card that fit into those slots. I presume they will also function in x8 and x4 slots, but you would not get the full speed of the drives if you were using more than one at a time in a x4 slot, or two in an x8 slot. And if you buy Gen 4 SSDs and put them into Gen 3 slots you will not get the drives rated speed no matter how few drives are being used at a time. And these cards are not cheap either. So there you are. It’s 100% doable, you just have to pick which of the two paths you want to take and figure out if you can afford to do it. Good luck!

vvk · August 25, 2023, 4:46am

[    0.923983] AMD-Vi: Using global IVHD EFR:0x59f77efa2094ade, EFR2:0x0
[    1.311782] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    1.321926] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    1.333162] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    1.349477] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    1.363481] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.363489] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    1.363495] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.363497] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    1.363501] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.363502] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    1.363506] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.363507] AMD-Vi: Extended features (0x59f77efa2094ade, 0x0): PPR X2APIC NX GT IA GA PC
[    1.363510] AMD-Vi: Interrupt remapping enabled
[    1.363511] AMD-Vi: X2APIC enabled

So no AMD-Vi: AMD IOMMUv2 loaded and initialized or AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug. line in dmesg.

vic · August 25, 2023, 5:05am

Like I thought what you meant. So read my previous response to you regarding this line.

I thought you also meant this. I didn’t touch upon it in my previous response because I think this is an unnecessary digression on this topic.

If you want a fun read about this line, take a look at this Reddit post [0]. Spoiler alert: the poster got this line on an Intel system…

There might be a bug in earlier kernel versions which perhaps rectified already.

[0] Reddit - Dive into anything

vvk · August 25, 2023, 9:07am

Yeah, sure. I’m not worried about it, just wondering what is the “normal”. Next reboot I’ll try to remember to delve into BIOS and set it to Enabled instead of Auto and see if anything changes.

My Windows passthru VM has already 20+ days of uptime working flawlessly, so on the other why change something that works…