Threadripper & PCIe Bus Errors

This fixed the problem for me too on my x370 taichi.

Im very dismayed that this has barely had any attention from amd or tech web sites.

this has been pretty much swept under the rug and ignored as a non issue.

however its not a non issue as it appears to be a hardware issue of some sort or worse a an issue with the cpus them selves.

but who knows, cause no one can seem to get an answer as to why it occurs. or when we will see a fix.

just simply frustrating.

Kinda? The aspm disable option has been 100% stable for me since always. I suspect it’s actually an asmedia/chipset issue…

Some errors are resolved on tr as of agesa 1005

not sure when this was published but the errata sheets for the ryzen processor was released.

1080 PCIe Link Exit to L0 in Gen1 Mode May Incorrectly Trigger NAKs, (on page 54) also
1083 on page 56
sounds like what may be causing the issue

unfortunately no fix is planned

good thing its not serious … still dont like seeing it especially when trying to trouble shoot something .

this is the errata pdf for those curious

1 Like

New developments are coming ahead of the Threadripper 2 launch. BIOSes are being pushed for Threadripper 2 compatibility which are said to include out of the box fixes for these issues

https://www.reddit.com/r/Amd/comments/7gp1z7/threadripper_kvm_gpu_passthru_testers_needed/ (Scroll down to Update 8)

Most board vendors are now pushing out official (non-BETA) BIOS updates with AGESA “ThreadRipperPI-SP3r2 1.1.0.0” including the proper fix for this issue. After updating you no longer need to use any of the temporary fixes from this thread. The BIOS updates comes as part of the preparations for supporting the Threadripper 2 CPUs which are due to be released in a few weeks from now.

2 Likes

I gave this a try on my new build and it results in a lot of issues. Tons of systemd-udevd timeouts. Posted about it in another thread:

I also still get the pcie errors from the start of this thread even on a slightly older bios. Still not entirely sure what the fix is.

Add pcie_aspm=off to your linux kernel line in grub?

1 Like

I can confirm that AGESA 1.1.0.0 fixes the PCI FLR reset bug on my system.

4 Likes

The boot line seems to do the trick I must have messed it up before.

Has anyone seen the gigabyte board fail to restart with boot code “0d” on the readout? Restarting seems to be hit or miss. The newer bios didn’t seem to exhibit the behavior but then I was plagued by the systemd-udevd issues.

So, it took a little while, but also here I’m not happy to report that the PCIe errors have disappeared on the Gigabyte Aorus X399 after updating the BIOS to F10 (the one preparing for second generation Threadripper :))

Yay!

1 Like

The PCI errors are not what this BIOS fixes, it corrects the bus reset problem. The PCI errors are a known eratta and are harmless.

See https://support.amd.com/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf

Simply disable ASPM to avoid the spurious error reports when running in Gen3 (pcie_aspm=off)

Okay I jumped back to the F10 bios update for gigabyte. There were no options to disable PSP or the PSP mailbox in the bios, at least on the designare motherboard.

I was able to compile a customer kernel with CONFIG_CRYPTO_DEV_SP_PSP=n set. So with that config option set and pcie_aspm=off things are looking much much better on my build.

What does PSP have to do with anything? ASPM is not PSP.

Sorry for the confusion, its a fix for the systemd-udevd timeouts/crashes I was seeing on the latest bios. I posted al ink to that issue a few replies up.

I’m not very familiar with linux (or anything boot/efi/grub related), but I haven’t been able to get the live CD for Fedora to boot up, much less anything else. I’ve used pci=nomsi, pci=noaer, and pcie_aspm=off. All of them get rid of the pci errors (pci=nomsi,noaer brings up ata errors, while pcie_aspm doesn’t), but they don’t get the system as far as booting up. With Pcie_aspm=off, I get as far as it saying “reached basic target system” but then it just hangs.

Anybody have any suggestions on what I should do? Linux Mint Live CD seems to boot without issue. Edit: Seems Linux Mint throws all those pci errors too, but it still boots up pretty quick.

I have a 1950x on a Gigabyte Aorus 7 MB running bios F10.

Bios f3j or f3g may be required for now. Fedora 27 would also work.

AMD has shot themselves in the foot accidentally. The new agesa is awesome, but together with some patches they submitted to the Linux kernel, it is not bootsble.

Can you roll back to an older bios temporarily?

1 Like

Ah ok, I was reading through this thread and thought that F10 was what was making it work for others.

I haven’t tried rolling back the bios before, but I’ll give it a shot and try out Fedora 27. Thanks for the quick response.

If you update to f28 its fine to keep booting kernel 4.15 with f28 just not the newer ones yet.

Gotcha, I knew Ryzen/Threadripper updates were rolled into the later kernels, so I was confused when Mint on 4.15 was booting up, while fedora on 4.17 wasn’t. The later kernels being the issue makes more sense now.

Late to the party with gen 2 + MSI MEG

  1. 4.17-5 Centos7 pre-packaged kernel seemed to work reasonably well, but KVM does not function - presumably because of the Secure VM bug?

  2. compiled 4.19-rc2 - screen flooded with PCIe errors - terminal unresponsive

  • added pcie_aspm=off Flood of errors is gone, but…
    (I got the impression from reading the above that should be fixed by now??)

EDIT/UPDATE - this worked:
MSI MEG BIOS:

  • SVM = Enabled
  • PSP = Disabled

4.19-rc kernel

  • Disable PSP in the kernel
    Cryptographic API->Hardware crypto devices->Support AMD Secure Processor[ ]

(don’t to forget to handle SELinux issues that will drive you bonkers… I tend to set it to Warn while I debug… Ultimately I usually need to deal with libvirt on NFS mounts. I keep that incantation in a file somewhere…