-Fedora Server f27
-4.16 Kernel patched with the tr.patch
Initially I will say when I first started this project I was able to get this working with esxi, but the performance was a 35% drop. The cards were also really finicky about passing through correctly. Sometimes it would work, reboot, then it wouldn’t.
I then wiped that clean and installed Fedora 27 server and made an attempt at getting a headless KVM install going. Following the methods for the Ryzen guide I coudln’t get the VFIO driver to take hold of of my GPU. Following a few additional methods (I think this is a more legacy method for stubbing?) listed here:
…I was able to get the VFIO driver assigned and successfully pass through the cards. Performance is at or near 100% which is amazing. Althought I did have to use DDU and perform a clean install before I saw that. Had a lot of driver crashes but I think this is entirely due to Windows installing a driver at boot.
The issue I’m having is that after adding the tr.patch and rebuilding the kernel from rawhide/master, I still can’t seem to get the PCI reset to work correctly. When I reboot the VM, the cards never reset and I have to do a full reboot of f27 server to get things right again. If I’m doing something wrong please call me out on it. Would love yo hear your opinions! I followed the guide verbatim for patching and installing the kernel. Iill admit I’m not very experienced with compiling and rebuilding the kernel. I tried a few different methods after fedpkg local was complete
rpm -i kernel- kernel-core kernel-modules
rpm -i everything
(when I did this I received an error regarding the headers file, so I left the cross-headers file out)
That guide is what I followed when compiling the kernel. I’m thinking maybe I’m doing something wrong at the end when compiling (rpm -i?, dnf install?).
Side note: There were some required dependencies that were needed after a fresh Fedora install before the fedpkg local build could take place. Bison, m4, handful of others but I think all that is pretty trivial, especially since all thats required is to dnf install . Self explanatory really.
Everything on that guide looks right. Is the kernel compiling? You should be using dnf install ./filename to install.
Yeah, that’s normal. I think you can dnf build-dep kernel and it will install the build dependencies. (that might be an apt subcommand though)
Had a scroll through the guide you followed for passthrough. It’s quite outdated.
Essentially, you need to patch kernel, enable IOMMU, configure your driver bind (this is tricky because AMDGPU doesn’t support unbinding, so I don’t know how to do it.) and make sure you’ve got the proper OVMF firmware on your system.
Once that’s done, create the VM and you’re off to the races.
Things like the cgroup acl and qemu configuration are not really needed anymore, especially on Fedora since it’s properly configured by default.
Very cool. Will give that a shot. Pretty sure I have rebuilds in my future
The kernel does compile and I can boot in to it from the menu as if everything is fine and dandy on restart. I do notice at the end of the build I think sometimes there’s a broken pipe or something called. Nothing else looks to erroneous that I can recall. I’ll try a new build this evening and paste exactly what I’m trying and anything that stands after the build.
Sure thing. I’m at work for a few more hours but as soon as I get home I’ll get it uploaded.
I think you’re exactly right. Initially on my first go at this, the plan was to make the host headless. I only had the 3 GPU’s in the host that I wanted to pass through and no additional GPU’s. I couldn’t get a single GPU to passthrough until I moved all the devies down a slot (which was a huge PITA since I’m running a custom hardline loop…more cutting and bending was in order )and installed a little NVidia GT 710 in the first slot. This was even after I verfied that the VFIO driver was being assigned to the Vega card. Seems it just doesn’t want to reset.
Another thing you could look for is kernel messages on VM shutdown, reboot and start. Sometimes you can get errors in there that help lead you to a solution.
Once that was done, did you blacklist AMDGPU? It’s seeming like you’re definitely running into AMDGPU problems. Is the 710 still in there? If so, I’d leave it in.
For this, I’m looking specifically for dmesg output during that time. Some output is normal (kvm messages about vm starting and stopping, and vfio messages about passthrough), but there can also be errors.
It seems that TR is still not quite there, but we’re getting really close. Have you disabled ASPM in the bios? I remember something about that causing problems.
I think I recall someone in a post somewhere mentioning to disable ASPM in the kernel somehow. I actually dug around yesterday evening looking to deactivate ASPM within the Zenith BIOS and couldn’t locate anything that spelled it out. From memory the bit’s I have enabled are:
Advanced\CPU Configuration\SVM Mode – enabled
Advanced\AMD PBS\Enumerate all IOMMU in IVRS – enabled
Advanced\AMD CBS\NBIO Common Options\NB Configuration\IOMMU – enabled
Advanced\AMD CBS\NBIO Common Options\ACS Enable – enabled
Advanced\AMD CBS\NBIO Common Options\PCIe ARI Support – enable
did you add a custom #define to your kernel file like trpatch? so it would be added to the final rpm package name? You can do that when you are adding tr.patch to the file that fedora uses to patch its kernels… I suggest you do that to “be sure” your custom kernel is being built and applied properly
when you uname -a you will see kernel-123123123123.trpatch
Okay, so I added the “pcie_aspm=off” and rebuilt grub. Restarted the host and waited a few (the VM attached to the GPU’s is set to auto-start). Rebooted the VM and it never came back. Here’s the log and timestamps.
Boot up 6:45 - ~6:47
~6:48 Issued a VM restart (from rdp session)… unable to rdp back to VM after waiting for a restart to complete. He’s dead Jim.
Ohh. There is a weird iommu option in the zenith Mobo. Needs to be set to both or something. Unless it got cut I mentioned it in the zenith review. Weirdly on that board by default iommu is only on for one of the two Ryzen packages.
Gotcha, I’m guessing that would be “Enumerate all IOMMU in IVRS” which I have enabled. The tooltip reads:
[Enable] Enables the IOMMU on both CPU dies to map device-visible virtual addresses.
I’m on the latest stable version but I believe there is a branch of testing builds. I’ll try and flash one of those and see if there’s any change. I haven’t seen anything in comments for the board of there being any changes for IOMMU in those builds. But there’s a better chance it’s either not mentioned or I missed it.
Update: Nevermind, apparently UEFI version 0901 is the current latest beta