Pass through error when using PCIe Switch Port

mnovi · March 11, 2021, 7:24pm

Hello,

I am building a proof-of-concept server for performing some GPGPU related calculations. The idea is to pass through all GPUs to VMs for easier testing of code and better isolation between program and host (if anything goes wrong in best case only the VM has to be stopped or at least the host does not hang and can be easily rebooted without the need for triggering a physical reset button).

The base specs of machine:

Motherboard: Supermicro X9DRi-LN4F+
CPU: 2x Intel Xeon E5-2670
RAM: 96GB DDR3
GPU: 6x AMD RX570 4GB (Sapphire Nitro+)
SSDs + ZFS + Kernel 5.4.101 (LTS) + VFIO modules + ACS patch

As the motherboard has only 6 PCIe slots they are populated like this:

CPU0, slot 1: GPU
CPU0, slot 2: GPU
CPU0, slot 3: GPU
CPU1, slot 4: NVMe SSD
CPU1, slot 5: NVMe SSD
CPU1, slot 6: ASM1184e PCIe Switch Port (https://www.amazon.com/XT-XINTE-PCI-express-External-Adapter-Multiplier/dp/B07CWPWDF8)
→ port 1: GPU
→ port 2: GPU
→ port 3: GPU

IOMMU groups: IOMMU groups - Pastebin.com
GPU1 PCI details (same for GPU2-3 except the addresses): GPU1 PCIe Details - Pastebin.com
GPU6 PCI details (slightly differ from GPU1-3, same for GPU4-5 except the addresses): GPU6 PCIe Details - Pastebin.com

I have successfully established VFIO for the GPUs in slots 1-3 (CPU0) with …

/etc/modprobe.d/vfio-pci.conf:
options vfio-pci ids=1002:67df,1002:aaf0 disable_vga=1

/etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT=“quiet intel_iommu=on iommu=pt pcie_acs_override=id:1b21:1184”

… and then for VM (Windows, Linux) with assigned GPU1-3 everything works perfectly.

The problem appears when I try to also pass through GPU4-6 (CPU1), which are on PCIe Switch Port. It doesn’t matter if I try to pass through only one of those GPU, result is the same. When I start the VM I can see this repeated multiple times in dmesg:
DMAR: DRHD: handling fault status reg 40

This line is also present on machine boot, before the: DMAR-IR: Enabled IRQ remapping in x2apic mode

However, after some time when VM is starting, I also start to get kernel errors, which I pasted here: DMAR errors - Pastebin.com (the host hangs at that point and just throws those errors via ssh/dmesg -wH, even though they are slightly different, but I have caught only what’s in the paste)

I tried a lot of different configs, from changing options of intel_iommu (igfx_off, sp_off, …), allowing unsafe interrupts, changing VM args/settings and I can’t figure out what is going wrong - I don’t know enough about kernel and it’s methods to understand what the errors which are thrown mean.

I found a partial solution to the problem which is unfortunately working only for Linux - but I’d like to have Windows also working. The secret there is to remove pcie_acs_override and add pci=nommconf . Then the PCIe Switch Port devices get into single IOMMU group and when assiging them to Linux VM there is no any error - everything is being recognized. On the other hand in Windows I always get exclamation in Device Manager, showing that there is not enough resources for the device to work properly and I need to first disable pcie option in VM settings, boot, shutdown and add pcie option again, because otherwise VM doesn’t boot at all.

For the VM config I also tried with another q35 machine type (3.1); kernel irqchip on, off, split; removing pcie=1 and adding romfile. For the pcie switch there is no bifurcation (at least not in the form of splitting lanes in BIOS), the chip on device handles this (it is a cheap solution with PCI x1 interface and low bandwidth) and this is probably the source of problems. I can also post “screenshots” of BIOS options, but with my knowledge about this I don’t find anything special that could be changed (no options for ACS, IOMMU, … are present there or at least not under that naming).

When I added pci=nommconf to cmdline, IOMMU group for pcie switch port started to contain all GPUs connected to it and that enabled boot for VMs, but Windows is having problems with that (other 3 GPUs which are not connected to switch stopped working). There is also list of IOMMU groups with that option set: IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Xeon E5/Core i7 DMI2 - Pastebin.com

If it’s of any help, there are also IOMMU groups when no ACS patch for pcie switch port is being used: IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Xeon E5/Core i7 DMI2 - Pastebin.com

I was also trying to pass through complete pcie switch port (as some are doing with USB hubs), but I’m not sure if that is possible (in ideal case to my understanding I should inject pci stub module before the kernel uses pcieport and resolves all devices under under and then pass this device to VM).

I have now spent almost a week trying to figure out what’s causing issues and it seems I don’t have enough skill to find that out by myself. I would be very happy for any help/tip/… to get this thing resolved if even possible, or at least the confirmation that there is no option this will work correctly. If there is possibility to pass through complete PCIe Switch with all the GPUs on it, this is also ok (I already tried to do that somehow, but GPUs are resolved at the boot time and can’t unbind/remove them later on, as pcieport module is being used on boot).

Thank you very much!