ASUS Zenith X399 Threadripper with multiple Vega 64 passthrough mixed results

Hi Guys,

As the title says:

  • ASUS Zenith X399
  • TR 1950x
  • 3x Vega64’s

-Fedora Server f27
-4.16 Kernel patched with the tr.patch

Initially I will say when I first started this project I was able to get this working with esxi, but the performance was a 35% drop. The cards were also really finicky about passing through correctly. Sometimes it would work, reboot, then it wouldn’t.

I then wiped that clean and installed Fedora 27 server and made an attempt at getting a headless KVM install going. Following the methods for the Ryzen guide I coudln’t get the VFIO driver to take hold of of my GPU. Following a few additional methods (I think this is a more legacy method for stubbing?) listed here:

…I was able to get the VFIO driver assigned and successfully pass through the cards. Performance is at or near 100% which is amazing. Althought I did have to use DDU and perform a clean install before I saw that. Had a lot of driver crashes but I think this is entirely due to Windows installing a driver at boot.

The issue I’m having is that after adding the tr.patch and rebuilding the kernel from rawhide/master, I still can’t seem to get the PCI reset to work correctly. When I reboot the VM, the cards never reset and I have to do a full reboot of f27 server to get things right again. If I’m doing something wrong please call me out on it. Would love yo hear your opinions! I followed the guide verbatim for patching and installing the kernel. Iill admit I’m not very experienced with compiling and rebuilding the kernel. I tried a few different methods after fedpkg local was complete

rpm -i kernel- kernel-core kernel-modules

rpm -i everything
(when I did this I received an error regarding the headers file, so I left the cross-headers file out)

dnf install --nogpgcheck kernel- kernel-core kernel-modules

3 builds and no luck with the reset for the Vega 64. Quite certain I’m doing something wrong XD

Have a look at this thread.


I don’t have TR, so I haven’t been following this too closely, wish I could give more assistance.

1 Like

That guide is what I followed when compiling the kernel. I’m thinking maybe I’m doing something wrong at the end when compiling (rpm -i?, dnf install?).

Side note: There were some required dependencies that were needed after a fresh Fedora install before the fedpkg local build could take place. Bison, m4, handful of others but I think all that is pretty trivial, especially since all thats required is to dnf install . Self explanatory really.

Everything on that guide looks right. Is the kernel compiling? You should be using dnf install ./filename to install.

Yeah, that’s normal. I think you can dnf build-dep kernel and it will install the build dependencies. (that might be an apt subcommand though)


Had a scroll through the guide you followed for passthrough. It’s quite outdated.

Essentially, you need to patch kernel, enable IOMMU, configure your driver bind (this is tricky because AMDGPU doesn’t support unbinding, so I don’t know how to do it.) and make sure you’ve got the proper OVMF firmware on your system.

Once that’s done, create the VM and you’re off to the races.

Things like the cgroup acl and qemu configuration are not really needed anymore, especially on Fedora since it’s properly configured by default.

1 Like

Very cool. Will give that a shot. Pretty sure I have rebuilds in my future :sweat_smile:

The kernel does compile and I can boot in to it from the menu as if everything is fine and dandy on restart. I do notice at the end of the build I think sometimes there’s a broken pipe or something called. Nothing else looks to erroneous that I can recall. I’ll try a new build this evening and paste exactly what I’m trying and anything that stands after the build.

If you’ve got a log, feel free to pastebin it, but otherwise, I wouldn’t worry about it.

Can you upload your spec file?

I’m fairly convinced your problem is either the bus reset issue or AMDGPU not being happy about being unbound.

Sure thing. I’m at work for a few more hours but as soon as I get home I’ll get it uploaded.

I think you’re exactly right. Initially on my first go at this, the plan was to make the host headless. I only had the 3 GPU’s in the host that I wanted to pass through and no additional GPU’s. I couldn’t get a single GPU to passthrough until I moved all the devies down a slot (which was a huge PITA since I’m running a custom hardline loop…more cutting and bending was in order :laughing: )and installed a little NVidia GT 710 in the first slot. This was even after I verfied that the VFIO driver was being assigned to the Vega card. Seems it just doesn’t want to reset.

Another thing you could look for is kernel messages on VM shutdown, reboot and start. Sometimes you can get errors in there that help lead you to a solution.

Once that was done, did you blacklist AMDGPU? It’s seeming like you’re definitely running into AMDGPU problems. Is the 710 still in there? If so, I’d leave it in.

I’ll put together some more cohesive information this evening.

  • the kvm.conf file
  • the kernel.spec file
  • the grub file
  • Anything that console spits out on startup and shutdown of the VM as well as the host.

Let me know if you can think of anything else I can provide. Really appreciate your help! I’ve spent countless hours trying to figure this out XD

For this, I’m looking specifically for dmesg output during that time. Some output is normal (kvm messages about vm starting and stopping, and vfio messages about passthrough), but there can also be errors.

It seems that TR is still not quite there, but we’re getting really close. Have you disabled ASPM in the bios? I remember something about that causing problems.

Got it, I’ll see what I can dig up.

I think I recall someone in a post somewhere mentioning to disable ASPM in the kernel somehow. I actually dug around yesterday evening looking to deactivate ASPM within the Zenith BIOS and couldn’t locate anything that spelled it out. From memory the bit’s I have enabled are:

Advanced\CPU Configuration\SVM Mode – enabled
Advanced\AMD PBS\Enumerate all IOMMU in IVRS – enabled
Advanced\AMD CBS\NBIO Common Options\NB Configuration\IOMMU – enabled
Advanced\AMD CBS\NBIO Common Options\ACS Enable – enabled
Advanced\AMD CBS\NBIO Common Options\PCIe ARI Support – enable

I’m mistaken, is a kernel parameter. You want pcie_aspm=off, according to:

https://wireless.wiki.kernel.org/en/users/documentation/aspm#force_enable_or_disable_aspm

1 Like

Looks like it can be set in policy:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/power_management_guide/aspm

I’ll give this a go and report in when I can get my configs and such uploaded.

did you add a custom #define to your kernel file like trpatch? so it would be added to the final rpm package name? You can do that when you are adding tr.patch to the file that fedora uses to patch its kernels… I suggest you do that to “be sure” your custom kernel is being built and applied properly

when you uname -a you will see kernel-123123123123.trpatch

Ok, as promised here’s what I’ve gathered so far:

grub:

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap iommu=1 amd_iommu=on iommu=pt rd.driver.pre=vfio-pci video=vesafb:off,efifb:off pci-stub.ids=1002:687f,1002:aaf8 pcie_aspm=off"
GRUB_DISABLE_RECOVERY="true"

kvm.conf:

> options vfio-pci ids=1002:687f,1002:aaf8

kernel.spec

https://pastebin.com/kEyZXxWL

Hi Wendell! I did in fact. see below and in the pastebin.

...
# define buildid .local

%define buildid .trpatch02

# baserelease defines which build revision of this kernel version we're
...

...
# CVE-2018-5750 rhbz 1539706 1539708
Patch651: ACPI-sbshc-remove-raw-pointer-from-printk-message.patch

Patch652: tr.patch

# END OF PATCH DEFINITIONS
...

And of course just a quick check to make sure I’m not a dummy and in the wrong kernel…

# uname -r
4.16.0-0.rc0.git7.1.trpatch02.fc28.x86_64

Rebuilding grub now and rebooting the VM with the cards attached to see what the console spits out. Will have those results shortly!

1 Like

Okay, so I added the “pcie_aspm=off” and rebuilt grub. Restarted the host and waited a few (the VM attached to the GPU’s is set to auto-start). Rebooted the VM and it never came back. Here’s the log and timestamps.

  • Boot up 6:45 - ~6:47

  • ~6:48 Issued a VM restart (from rdp session)… unable to rdp back to VM after waiting for a restart to complete. He’s dead Jim.

https://pastebin.com/MtmUxYvB

Thanks for having a look guys!

Hrrmmmmm. Uefi up to date??

Ohh. There is a weird iommu option in the zenith Mobo. Needs to be set to both or something. Unless it got cut I mentioned it in the zenith review. Weirdly on that board by default iommu is only on for one of the two Ryzen packages.

Gotcha, I’m guessing that would be “Enumerate all IOMMU in IVRS” which I have enabled. The tooltip reads:

[Enable] Enables the IOMMU on both CPU dies to map device-visible virtual addresses.

I’m on the latest stable version but I believe there is a branch of testing builds. I’ll try and flash one of those and see if there’s any change. I haven’t seen anything in comments for the board of there being any changes for IOMMU in those builds. But there’s a better chance it’s either not mentioned or I missed it.

Update: Nevermind, apparently UEFI version 0901 is the current latest beta :frowning:

1 Like

You nailed it. That’s frustrating. Iommu groups look good on the GPU otherwise? Can you try it in one of the pcie3.0 x8 slots just for giggles??