Threadripper & PCIe Bus Errors

Add pcie_aspm=off to your linux kernel line in grub?

1 Like

I can confirm that AGESA 1.1.0.0 fixes the PCI FLR reset bug on my system.

4 Likes

The boot line seems to do the trick I must have messed it up before.

Has anyone seen the gigabyte board fail to restart with boot code “0d” on the readout? Restarting seems to be hit or miss. The newer bios didn’t seem to exhibit the behavior but then I was plagued by the systemd-udevd issues.

So, it took a little while, but also here I’m not happy to report that the PCIe errors have disappeared on the Gigabyte Aorus X399 after updating the BIOS to F10 (the one preparing for second generation Threadripper :))

Yay!

1 Like

The PCI errors are not what this BIOS fixes, it corrects the bus reset problem. The PCI errors are a known eratta and are harmless.

See https://support.amd.com/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf

Simply disable ASPM to avoid the spurious error reports when running in Gen3 (pcie_aspm=off)

Okay I jumped back to the F10 bios update for gigabyte. There were no options to disable PSP or the PSP mailbox in the bios, at least on the designare motherboard.

I was able to compile a customer kernel with CONFIG_CRYPTO_DEV_SP_PSP=n set. So with that config option set and pcie_aspm=off things are looking much much better on my build.

What does PSP have to do with anything? ASPM is not PSP.

Sorry for the confusion, its a fix for the systemd-udevd timeouts/crashes I was seeing on the latest bios. I posted al ink to that issue a few replies up.

I’m not very familiar with linux (or anything boot/efi/grub related), but I haven’t been able to get the live CD for Fedora to boot up, much less anything else. I’ve used pci=nomsi, pci=noaer, and pcie_aspm=off. All of them get rid of the pci errors (pci=nomsi,noaer brings up ata errors, while pcie_aspm doesn’t), but they don’t get the system as far as booting up. With Pcie_aspm=off, I get as far as it saying “reached basic target system” but then it just hangs.

Anybody have any suggestions on what I should do? Linux Mint Live CD seems to boot without issue. Edit: Seems Linux Mint throws all those pci errors too, but it still boots up pretty quick.

I have a 1950x on a Gigabyte Aorus 7 MB running bios F10.

Bios f3j or f3g may be required for now. Fedora 27 would also work.

AMD has shot themselves in the foot accidentally. The new agesa is awesome, but together with some patches they submitted to the Linux kernel, it is not bootsble.

Can you roll back to an older bios temporarily?

1 Like

Ah ok, I was reading through this thread and thought that F10 was what was making it work for others.

I haven’t tried rolling back the bios before, but I’ll give it a shot and try out Fedora 27. Thanks for the quick response.

If you update to f28 its fine to keep booting kernel 4.15 with f28 just not the newer ones yet.

Gotcha, I knew Ryzen/Threadripper updates were rolled into the later kernels, so I was confused when Mint on 4.15 was booting up, while fedora on 4.17 wasn’t. The later kernels being the issue makes more sense now.

Late to the party with gen 2 + MSI MEG

  1. 4.17-5 Centos7 pre-packaged kernel seemed to work reasonably well, but KVM does not function - presumably because of the Secure VM bug?

  2. compiled 4.19-rc2 - screen flooded with PCIe errors - terminal unresponsive

  • added pcie_aspm=off Flood of errors is gone, but…
    (I got the impression from reading the above that should be fixed by now??)

EDIT/UPDATE - this worked:
MSI MEG BIOS:

  • SVM = Enabled
  • PSP = Disabled

4.19-rc kernel

  • Disable PSP in the kernel
    Cryptographic API->Hardware crypto devices->Support AMD Secure Processor[ ]

(don’t to forget to handle SELinux issues that will drive you bonkers… I tend to set it to Warn while I debug… Ultimately I usually need to deal with libvirt on NFS mounts. I keep that incantation in a file somewhere…

So, Linux noob here. Compiled 4.19-rc2 for Ubuntu. Went fine at first until I realized my mistake on graphics support and installing the proprietary driver fragged it. So installed the 4.19-rc2 generic. Was going fine until I ran upgrade. Getting the systemd-udevd:567. I did the above in cli started in safe, it changed the number reported after the colon, but same rough result. I was wondering if someone could walk me through troubleshooting.

I also have had issues with snapd errors and dpkg is saying python errors exist, although forcing a reinstall at root line did not resolve the issue. As I said, bit of a noob. I can follow technical directions, but still don’t even have all of the commands in linux committed to memory yet.

I am on an Asrock X399 Taichi with 1950X using beta bios/uefi 3.23b with AGESA 1.1.0.1. Bios 3.30 was seemingly fine enough with AGESA 1.1.0.0. Any assistance would be appreciated. Either way, forcing myself off of Windows 10 Ent. because M$.

That’s your problem. These days unless you have a very specific reason to compile the kernel you should not be doing this at all. What is your reason for building it yourself?

The console reports the errors for a reason, if you do not provide them, nobody can help.

My reason is simple: to learn linux and that includes how to compile and properly set the flags for optimizations with my hardware. Plus, you don’t need a reason to compile a kernel. If someone wants to do it, you shouldn’t discourage it. Take a bit of an issue with the tone, don’t know if that is what you intended, but how it reads.

I prefer ubuntu because of a large community base. But, doesn’t mean you need to go Gentoo or Arch to dive deeper for the purposes.

In any case, the log images from the screen are on my phone because none of the kernels boot. It is the PSP error for the logs on why not booting. Unfortunately, after having done the apt upgrade, seems to now effect all kernels on the system, whether I compiled it or not. Imagine that!

Also, only way I could access the logs was going into the gnome 2 safe mode and pulling them up in vi. At the initial time of my post, I hadn’t gone in to check, but after hearing there was a way and google magic, found that answer. I have a couple other errors, including the PCIe bus running at like 1/4 or 1/2 rate, but I’ll deal with that after I compile another kernel and apply the PSP patch.

Put those plans on hold, generally, because I plan on just wiping the drive and reloading my backup for the October Windows 10 update within the next month. So, if it is going to wipe off the Linux partition as well, which don’t have anything that needs saving on it, might as well wait and start after I’ve googled up each error I found in the logs.

Also, when a person is a noob, and tells you such, maybe you should recommend what information is needed and how to find it instead of assuming. I did not know the directory at the original posting time to pull up the logs because I couldn’t boot into Ubuntu and rudimentary knowledge of CLI.

I am not discouraging it, I am simply stating the fact that it’s usually overkill and performed as part of a “guide” when it’s usually not required. Good on you for wanting to learn.

Perhaps so, but by using Ubuntu and building the kernel yourself you have to do things “The Debian Way” to do it properly, and it’s likely the reason you’re having issues. If you wish to continue and have not already done so, you should become familiar with make-kpkg.

This is again likely due to your choice to use Ubuntu as Ubuntu and Debian both use and expect an initrd image, and integration with dkms.

By failure to boot what do you mean? blank screen? failure to mount the root filesystem?, kernel panic?

If your goal is to learn the nuts and bolts of how Linux works, I suggest Linux From Scratch (http://www.linuxfromscratch.org/). I would not suggest this for a primary os, but it does make a good side project to learn how Linux operates at a low level. You will also then be learning the pure nuts and bolts and not distro specific methods of how to build/compile and package things.

1 Like

I don’t mind overkill. I’m doing this as my intro (as well as learning where things are, work, and go before moving on to android for my devices as I have some EOL products that not even LineageOS or the development forums have updated roms for (at least last I checked, and think Android AIO monitors, some android set top boxes, etc.)). I did plan on eventually moving to a guided build, but was starting at a different point, just diving in the fray, so to speak.

As to the debian way, that I already knew. And the problem did not arrive with my freshly compiled Kernel, believe it or not. I used that Kernel with little issue at all for over a week. Then, the graphics card issue of not preparing the Kernel for the proprietary driver came up, which I gathered isn’t so different from what I did to prep the kernel in menuconfig for Zen CPU specifically (I didn’t know about the PSP fix at the time, but had SEV disabled in that menu, cannot remember the setting I had on PSP, but did minimal tweaks after taking the settings from the stock install I was building the kernel on). I planned on looking it back up when I get ready to compile again (I looked at multiple Kernel compile “guides” for Ubuntu, primarily, before starting, and, in fact, bounced between multiple ones to fill in the gaps left by one or another, or to get more information on a step, etc.).

I found make-kpkg while searching for solutions to this issue, actually, and plan on using it in the future. Kind of my learning process, find information, execute, break, examine, research, repeat. It creates conditioned feedback loops and makes me learn the interaction of things through destruction.

And, what I mean is kernel panic, where it repeats the PSP error until it finally sits there, although still runs through a proper shutdown with ctrl+alt+del.

I’ll take a look at Linux from scratch, but cannot guarantee I will not continue my pursuits here. I have Windows 10 Enterprise as my primary OS, have tri-boot with a stripped down win 10 for benching and tooling around, and win 7 for legacy and benching. I haven’t gone through my normal DISM to strip them down because of the upcoming build update (and the new headaches that are sure to come with that), while doing a new data retention and backup scheme. Just planned on a linux distro as a side to tool with, start learning that, and eventually also playing with android builds.

In any case, thank you for your response and pointing me toward additional resources. I do appreciate it!

Hi, does anybody know if ACPI bus segmentation can be enabled in Threadripper? I am planning to build but based on the lspci -vt output of Threadripper that I have seen, the two PCI root complexes are not placed on different segments / domains. Threadripper seems very crippled for what it is. The only vendor with SR-IOV support seems to be Asrock. I extracted a whole heap of IFRs from the BIOS ROMs of X399 motherboards.

I would expect to see devices with 0000:00:00.0 and 0001.00.00.0 format. Each segment can only have 256 busses. The new Titan Ridge add-in cards chew through bus numbers, and 256 busses is only enough for two cards. I am aiming for four.

Any ideas would be appreciated. Oh, if Linux can override the BIOS then that is acceptable.

Cheers!