My attempt with pci passthrough

So, I tried to setup pci passthrough with qemu and it led to data corruption. I think I need to understand this little better before I try again. I managed to copy some binaries from other working system and managed to reinstall all apt packages, so I think everything is good now. At least everything seems to be working.

System:
ASUS PRIME x570-P
AMD 5900X
RTX 2080 TI (main gpu, nvidia-driver-535)
RX 580 (passthrough, kernel driver in use: vfio-pci)

I have amd_iommu in my grub config kernel parameters:
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on kvm.ignore_msrs=1 quiet splash"

Although now that I read dmesgs I noticed this line:
[ 0.037921] AMD-Vi: Unknown option - 'on'

Kernel documentation doesnt show “on” value for amd_iommu but it is mentioned in many instructions, so I don’t know if that has changed at some point:

        amd_iommu=      [HW,X86-64]
                        Pass parameters to the AMD IOMMU driver in the system.
                        Possible values are:
                        fullflush - Deprecated, equivalent to iommu.strict=1
                        off       - do not initialize any AMD IOMMU found in
                                    the system
                        force_isolation - Force device isolation for all
                                          devices. The IOMMU driver is not
                                          allowed anymore to lift isolation
                                          requirements as needed. This option
                                          does not override iommu=pt
                        force_enable - Force enable the IOMMU on platforms known
                                       to be buggy with IOMMU enabled. Use this
                                       option with care.
                        pgtbl_v1     - Use v1 page table for DMA-API (Default).
                        pgtbl_v2     - Use v2 page table for DMA-API.
                        irtcachedis  - Disable Interrupt Remapping Table (IRT) caching.

source: The kernel's command-line parameters — The Linux Kernel documentation

I did also find these timeouts which I should probably solve before trying this again:

2023-12-06T17:56:30.375476+02:00 badwolf kernel: [49224.144571] iommu ivhd0: AMD
-Vi: Event logged [IOTLB_INV_TIMEOUT device=0000:05:00.0 address=0x10021e600]
2023-12-06T17:56:30.375480+02:00 badwolf kernel: [49224.144577] iommu ivhd0: AMD
-Vi: Event logged [IOTLB_INV_TIMEOUT device=0000:05:00.0 address=0x10021e620]
2023-12-06T17:56:30.515501+02:00 badwolf kernel: [49224.285096] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:30.641119+02:00 badwolf kernel: [49224.410727] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:30.766553+02:00 badwolf kernel: [49224.536148] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:30.892141+02:00 badwolf kernel: [49224.661733] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:31.017560+02:00 badwolf kernel: [49224.787154] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:31.143093+02:00 badwolf kernel: [49224.912687] AMD-Vi: Completi
on-Wait loop timed out
2023-12-06T17:56:31.268516+02:00 badwolf kernel: [49225.038118] AMD-Vi: Completi
on-Wait loop timed out

IOMMU group 23 shouldn’t have anything else than my RX 580 so that shouldn’t be the problem:

IOMMU Group 23:
        05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7)
        05:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]

Pci-passthrough kind of works (except when it corrupts my ssd). Also when I start qemu my youtube video freezes for a moment and it usually recovers but not always. Sometimes my keyboard freaks out and stops working and it just blinks the numpad light when I press buttons. So there is some interference going on and things are not isolated properly or something like that.

I have tried different settings but this is the script I’m using now:

qemu-system-x86_64 \
-display default,show-cursor=on \
-cpu host,-hypervisor,kvm=off,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff \
-smp $(nproc) \
-vga qxl \
-m 8G \
-machine q35,accel=kvm \
-drive if=pflash,format=raw,unit=0,readonly=on,file=/home/henrixd/winvm/OVMF_CODE_4M.secboot.fd \
-drive if=pflash,format=raw,unit=1,file=/home/henrixd/winvm/OVMF_VARS_4M.ms.fd \
-drive file=ssd.img,format=raw,media=disk \
-device vfio-pci,host=05:00.0,x-vga=on \
-device vfio-pci,host=05:00.1 \
-device usb-host,hostbus=bus,hostport=port \
-device usb-host,hostbus=1,hostport=3

Any hints? Am I making something obviously stupid here that I’m too dumb to understand?

I noticed that, too. I assume that this option existed for older kernel versions.

I am adding iommu=pt to the kernel command line confident that I am following official documentation. But as I am reviewing AMD’s EPYC tuning guides I don’t see a clear recommendataion for this setting. In fact, it’s only mentioned in the Kafka tuning guide.

I think that on my system (ASUS WS 570X ACE) Linux doesn’t enable iommu without this parameter, but I don’t have the time to validate that right now.

I want to try that too. I have it written down, just haven’t got that far yet. I did try amd_iommu=pgtbl_v2. It seemed to remove those timeout errors but didn’t make things any better otherwise. When I launched qemu, it froze my whole system again. I’m kind of scared to try anymore because I don’t want to corrupt my SSD anymore.

My IOMMU seems to be enabled with all these commands. Groups are made perfectly. vfio-pci driver gets loaded to GPU. It just acts weird. I guess I just have to do some reading and try to learn this more deeply to understand what’s going on.

It might be obvious, but recheck if you are passing the correct pcie device to your VM - in my case the groups changed because I removed a PCIe USB card and the VM snatched the hosts nvme sdd instead - that was fun :confused: