Threadripper & PCIe Bus Errors

Ah ok, I was reading through this thread and thought that F10 was what was making it work for others.

I haven’t tried rolling back the bios before, but I’ll give it a shot and try out Fedora 27. Thanks for the quick response.

If you update to f28 its fine to keep booting kernel 4.15 with f28 just not the newer ones yet.

Gotcha, I knew Ryzen/Threadripper updates were rolled into the later kernels, so I was confused when Mint on 4.15 was booting up, while fedora on 4.17 wasn’t. The later kernels being the issue makes more sense now.

Late to the party with gen 2 + MSI MEG

  1. 4.17-5 Centos7 pre-packaged kernel seemed to work reasonably well, but KVM does not function - presumably because of the Secure VM bug?

  2. compiled 4.19-rc2 - screen flooded with PCIe errors - terminal unresponsive

  • added pcie_aspm=off Flood of errors is gone, but…
    (I got the impression from reading the above that should be fixed by now??)

EDIT/UPDATE - this worked:
MSI MEG BIOS:

  • SVM = Enabled
  • PSP = Disabled

4.19-rc kernel

  • Disable PSP in the kernel
    Cryptographic API->Hardware crypto devices->Support AMD Secure Processor[ ]

(don’t to forget to handle SELinux issues that will drive you bonkers… I tend to set it to Warn while I debug… Ultimately I usually need to deal with libvirt on NFS mounts. I keep that incantation in a file somewhere…

So, Linux noob here. Compiled 4.19-rc2 for Ubuntu. Went fine at first until I realized my mistake on graphics support and installing the proprietary driver fragged it. So installed the 4.19-rc2 generic. Was going fine until I ran upgrade. Getting the systemd-udevd:567. I did the above in cli started in safe, it changed the number reported after the colon, but same rough result. I was wondering if someone could walk me through troubleshooting.

I also have had issues with snapd errors and dpkg is saying python errors exist, although forcing a reinstall at root line did not resolve the issue. As I said, bit of a noob. I can follow technical directions, but still don’t even have all of the commands in linux committed to memory yet.

I am on an Asrock X399 Taichi with 1950X using beta bios/uefi 3.23b with AGESA 1.1.0.1. Bios 3.30 was seemingly fine enough with AGESA 1.1.0.0. Any assistance would be appreciated. Either way, forcing myself off of Windows 10 Ent. because M$.

That’s your problem. These days unless you have a very specific reason to compile the kernel you should not be doing this at all. What is your reason for building it yourself?

The console reports the errors for a reason, if you do not provide them, nobody can help.

My reason is simple: to learn linux and that includes how to compile and properly set the flags for optimizations with my hardware. Plus, you don’t need a reason to compile a kernel. If someone wants to do it, you shouldn’t discourage it. Take a bit of an issue with the tone, don’t know if that is what you intended, but how it reads.

I prefer ubuntu because of a large community base. But, doesn’t mean you need to go Gentoo or Arch to dive deeper for the purposes.

In any case, the log images from the screen are on my phone because none of the kernels boot. It is the PSP error for the logs on why not booting. Unfortunately, after having done the apt upgrade, seems to now effect all kernels on the system, whether I compiled it or not. Imagine that!

Also, only way I could access the logs was going into the gnome 2 safe mode and pulling them up in vi. At the initial time of my post, I hadn’t gone in to check, but after hearing there was a way and google magic, found that answer. I have a couple other errors, including the PCIe bus running at like 1/4 or 1/2 rate, but I’ll deal with that after I compile another kernel and apply the PSP patch.

Put those plans on hold, generally, because I plan on just wiping the drive and reloading my backup for the October Windows 10 update within the next month. So, if it is going to wipe off the Linux partition as well, which don’t have anything that needs saving on it, might as well wait and start after I’ve googled up each error I found in the logs.

Also, when a person is a noob, and tells you such, maybe you should recommend what information is needed and how to find it instead of assuming. I did not know the directory at the original posting time to pull up the logs because I couldn’t boot into Ubuntu and rudimentary knowledge of CLI.

I am not discouraging it, I am simply stating the fact that it’s usually overkill and performed as part of a “guide” when it’s usually not required. Good on you for wanting to learn.

Perhaps so, but by using Ubuntu and building the kernel yourself you have to do things “The Debian Way” to do it properly, and it’s likely the reason you’re having issues. If you wish to continue and have not already done so, you should become familiar with make-kpkg.

This is again likely due to your choice to use Ubuntu as Ubuntu and Debian both use and expect an initrd image, and integration with dkms.

By failure to boot what do you mean? blank screen? failure to mount the root filesystem?, kernel panic?

If your goal is to learn the nuts and bolts of how Linux works, I suggest Linux From Scratch (http://www.linuxfromscratch.org/). I would not suggest this for a primary os, but it does make a good side project to learn how Linux operates at a low level. You will also then be learning the pure nuts and bolts and not distro specific methods of how to build/compile and package things.

1 Like

I don’t mind overkill. I’m doing this as my intro (as well as learning where things are, work, and go before moving on to android for my devices as I have some EOL products that not even LineageOS or the development forums have updated roms for (at least last I checked, and think Android AIO monitors, some android set top boxes, etc.)). I did plan on eventually moving to a guided build, but was starting at a different point, just diving in the fray, so to speak.

As to the debian way, that I already knew. And the problem did not arrive with my freshly compiled Kernel, believe it or not. I used that Kernel with little issue at all for over a week. Then, the graphics card issue of not preparing the Kernel for the proprietary driver came up, which I gathered isn’t so different from what I did to prep the kernel in menuconfig for Zen CPU specifically (I didn’t know about the PSP fix at the time, but had SEV disabled in that menu, cannot remember the setting I had on PSP, but did minimal tweaks after taking the settings from the stock install I was building the kernel on). I planned on looking it back up when I get ready to compile again (I looked at multiple Kernel compile “guides” for Ubuntu, primarily, before starting, and, in fact, bounced between multiple ones to fill in the gaps left by one or another, or to get more information on a step, etc.).

I found make-kpkg while searching for solutions to this issue, actually, and plan on using it in the future. Kind of my learning process, find information, execute, break, examine, research, repeat. It creates conditioned feedback loops and makes me learn the interaction of things through destruction.

And, what I mean is kernel panic, where it repeats the PSP error until it finally sits there, although still runs through a proper shutdown with ctrl+alt+del.

I’ll take a look at Linux from scratch, but cannot guarantee I will not continue my pursuits here. I have Windows 10 Enterprise as my primary OS, have tri-boot with a stripped down win 10 for benching and tooling around, and win 7 for legacy and benching. I haven’t gone through my normal DISM to strip them down because of the upcoming build update (and the new headaches that are sure to come with that), while doing a new data retention and backup scheme. Just planned on a linux distro as a side to tool with, start learning that, and eventually also playing with android builds.

In any case, thank you for your response and pointing me toward additional resources. I do appreciate it!

Hi, does anybody know if ACPI bus segmentation can be enabled in Threadripper? I am planning to build but based on the lspci -vt output of Threadripper that I have seen, the two PCI root complexes are not placed on different segments / domains. Threadripper seems very crippled for what it is. The only vendor with SR-IOV support seems to be Asrock. I extracted a whole heap of IFRs from the BIOS ROMs of X399 motherboards.

I would expect to see devices with 0000:00:00.0 and 0001.00.00.0 format. Each segment can only have 256 busses. The new Titan Ridge add-in cards chew through bus numbers, and 256 busses is only enough for two cards. I am aiming for four.

Any ideas would be appreciated. Oh, if Linux can override the BIOS then that is acceptable.

Cheers!

Super late to the party, not sure if this warranted a new thread or not.

Just setup a new Ryzen 3700X build on an Asrock x470 taichi ultimate and I’m seeing identical errors to OP on a fresh proxmox 6.0.1 install which is debian buster with proxmox custom kernel 5.0.15-1. For now I set pcie_aspm=off in grub and it seems the errors have gone. Is this still considered a recommended fix?

kernel: [  557.900765] pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:01.3
kernel: [  557.900769] pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: [  557.900780] pcieport 0000:00:01.3:   device [1022:1483] error status/mask=00000040/00006000
kernel: [  557.900784] pcieport 0000:00:01.3:    [ 6] BadTLP

Yep, I still use this.

Glad I’m not the only one super late to the party. Just built my first AMD system with ASRock B450 Mobo, Ryzen 5 3600, RX5700xt and a 500gb NVMe SSD, and I’m getting these same errors but for my networking. I’m up to date on Ubuntu 18.04. I tried installing one of my spare 4 port Intel nics but all it did was move the error to the active port on that nic. I tried pcie_aspm=off in grub as well as setting pcie and promontory to gen2 manually to no avail. Anybody have any ideas?

Hi @Adam_Wilber and welcome to the L1Tech forum.

Can you please show the dmesg output, your kernel version, and your VM configuration.

Mixing older chipsets with ryzen 3000 seems to require manually setting pcie gen3 and not all boards expose that yet. Soon tho

1 Like

Thank you both for your quick responses, but I think I figured out what my issue was. After scouring the net for different things to try I realized when I set up this new hardware I had to manually set my memory speed, but didn’t realize the board would default to 1.2V for DDR4 regardless of speed input, so I had to go set that to the proper 1.35V which fixed the rest of the issues I was having.

Last time I build a desktop was when Sandy Bridge came out and I had more money to throw at a toy back then so I was able to afford a better motherboard that did most of the OC work for me I guess.

3 Likes

pci=nommconf was just what i nedded to shut up my 4.19 (deb buster)

you saved the ssd where the logfiles live :wink:

1 Like

fyi, on my x399+1900X deb stretch 4.9.0 works totally fine, without modification.
dist-upgrade gave me a good scare :smiley:
[Update]
4.19 may be quiet with that option but it is not stable :frowning:
just got stuck no errors to look at…
4.9.0 has thrown 2 “AER … Corrected” errors but bis stable with buster for now. just don’t “apt-get autoremove”
[Update2]
4.9.0 was not stable with buster after all. for whatever reason however,
4.9.0 is rock solid with stretch. had been for weeks b4 upgrade…
What great pcie related changes were made between 4.9 and 4.19 kernel versions?

Hi. I’m not on Threadripper, but I was getting the same BadTLP, BadDLLP, etc. errors on a Ryzen 2700X build, with a Asus Prime B450 Plus motherboard, and Nvidia GTX 1660 graphics.

I tried setting pcie_aspm=off in grub and while this did get rid of these errors, it caused a much worse problem that my M.2 NVMe drive controller would get stuck in a low power state or something, at which point it would quit X and remount my root partition as readonly, causing a whole lot of headache.

I found this problem was triggered for me mostly when running some mprime “P-1” workloads with a large percentage of memory allocated (8 of 16GB). This uses a lot of memory bandwidth, and for whatever reason seemed to be interfering with the NVMe like this.

Here is some dmesg output of what it looks like when the NVMe controller went down for me:

[  989.409598] perf: interrupt took too long (4979 > 4912), lowering kernel.perf_event_max_sample_rate to 40000   
[ 1195.031765] fuse: init (API version 7.31)                                                  
[ 1327.328770] perf: interrupt took too long (6268 > 6223), lowering kernel.perf_event_max_sample_rate to 31750
[ 2238.284260] perf: interrupt took too long (7846 > 7835), lowering kernel.perf_event_max_sample_rate to 25250
[ 9117.462381] perf: interrupt took too long (9831 > 9807), lowering kernel.perf_event_max_sample_rate to 20250
[ 9261.476036] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ 9261.603999] pci_raw_set_power_state: 19 callbacks suppressed
[ 9261.604009] nvme 0000:01:00.0: Refused to change power state, currently in D3
[ 9261.604430] nvme nvme0: Removing after probe failure status: -19
[ 9261.632241] print_req_error: I/O error, dev nvme0n1, sector 15247304 flags 100001
[ 9261.632255] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 9261.729511] nvme nvme0: failed to set APST feature (-19)
[ 9261.739582] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739591] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 9261.739595] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 9261.756670] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0
[ 9261.756951] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 2, flush 0, corrupt 0, gen 0
[ 9261.758061] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0
[ 9261.758368] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 4, flush 0, corrupt 0, gen 0
[ 9261.759112] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 5, flush 0, corrupt 0, gen 0
[ 9261.759138] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 6, flush 0, corrupt 0, gen 0
[ 9262.276359] Core dump to |/bin/false pipe failed
[ 9262.336595] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[ 9262.336817] caller _nv000939rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[ 9262.975980] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 62
[ 9263.012987] Core dump to |/bin/false pipe failed
[ 9263.015801] Core dump to |/bin/false pipe failed
[ 9263.035986] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 1
[ 9263.134288] Core dump to |/bin/false pipe failed
[ 9265.580609] BTRFS: error (device nvme0n1p2) in btrfs_commit_transaction:2234: errno=-5 IO failure (Error while writing out transaction)
[ 9265.580610] BTRFS info (device nvme0n1p2): forced readonly
[ 9265.580611] BTRFS warning (device nvme0n1p2): Skipping commit of aborted transaction.
[ 9265.580612] BTRFS: error (device nvme0n1p2) in cleanup_transaction:1794: errno=-5 IO failure
[ 9265.580613] BTRFS info (device nvme0n1p2): delayed_refs has NO entry
[ 9292.708719] btrfs_dev_stat_print_on_error: 320 callbacks suppressed
[ 9292.708723] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 123, rd 208, flush 0, corrupt 0, gen 0
[ 9368.485780] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 208, flush 0, corrupt 0, gen 0
[ 9577.728458] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 209, flush 0, corrupt 0, gen 0
[ 9577.728508] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 210, flush 0, corrupt 0, gen 0
[ 9577.728715] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 211, flush 0, corrupt 0, gen 0
[ 9577.728768] Core dump to |/bin/false pipe failed
[ 9578.059425] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 212, flush 0, corrupt 0, gen 0
[ 9578.059466] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 213, flush 0, corrupt 0, gen 0
[ 9578.059531] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 214, flush 0, corrupt 0, gen 0
[ 9578.059555] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 215, flush 0, corrupt 0, gen 0
[ 9578.059574] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 216, flush 0, corrupt 0, gen 0
[ 9578.059590] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 217, flush 0, corrupt 0, gen 0
[ 9578.059604] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 218, flush 0, corrupt 0, gen 0
[ 9608.872774] btrfs_dev_stat_print_on_error: 1 callbacks suppressed
[ 9608.872777] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 125, rd 219, flush 0, corrupt 0, gen 0
[ 9608.872797] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 126, rd 219, flush 0, corrupt 0, gen 0
[ 9608.872805] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 219, flush 0, corrupt 0, gen 0
[11308.648706] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 220, flush 0, corrupt 0, gen 0
[11308.648753] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 221, flush 0, corrupt 0, gen 0

This was on OpenSUSE Tumbleweed which I was trying out, but I just switched over to Linux Mint 19.2 today since I’m more familiar with that. (It turns out the BadTLP,etc. errors show up on both distros, but I figured I would try Mint and see if the results were any different)

So yeah I’m definitely not going to try turning off aspm again. I think I’ll just ignore these errors as they don’t seem to be causing any real problems as far as I can tell. I might eventually try setting “noaer” to hide these errors, but for now I’m fine just not thinking about them as long as my system isn’t crashing.

edit: BTW, my SSD is: Crucial P1 500GB 3D NAND NVMe PCIe M.2 SSD - CT500P1SSD8

Hello, I am getting the same errors on my build:
Ryzen 7 1700
Asus Prime X370 Pro
Asus ROG Strix RX Vega 64
I’m using Debian Unstable with 5.2.7 kernel at the moment.
The thing is, i tested everything a while ago with another kernel (probably between 4.17 and 4.19) but lost the file in which i recorded everything. Thus i can’t be sure it’s exactly the same now, but it looks quite similar.
The errors now are:

[   58.175644] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[   58.175650] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[   58.175656] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000080/00006000
[   58.175659] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[   69.769504] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[   69.769511] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[   69.769517] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[   69.769521] pcieport 0000:00:03.1: AER:    [ 6] BadTLP

After a few minutes it began to throw more and more “Timeout” errors:

[ 1497.907503] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
[ 1497.908661] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1497.908664] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=000010c0/00006000
[ 1497.908666] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1497.908668] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1497.908670] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1497.984646] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1497.984650] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1497.984653] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1497.984655] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1497.995671] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1497.995675] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1497.995677] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 1497.995679] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.050775] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.050779] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1498.050783] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000040/00006000
[ 1498.050785] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.094854] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.094858] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.094861] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1498.094863] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1498.116898] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.116902] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[ 1498.116906] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00000080/00006000
[ 1498.116908] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1498.172002] pcieport 0000:00:03.1: AER: Corrected error received: 0000:00:00.0
[ 1498.172007] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.172010] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=00001000/00006000
[ 1498.172012] pcieport 0000:00:03.1: AER:    [12] Timeout               
[ 1498.381399] pcieport 0000:00:03.1: AER: Multiple Corrected error received: 0000:00:00.0
[ 1498.384093] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1498.384097] pcieport 0000:00:03.1: AER:   device [1022:1453] error status/mask=000010c0/00006000
[ 1498.384100] pcieport 0000:00:03.1: AER:    [ 6] BadTLP                
[ 1498.384101] pcieport 0000:00:03.1: AER:    [ 7] BadDLLP               
[ 1498.384103] pcieport 0000:00:03.1: AER:    [12] Timeout               

(that is the last thing i managed to copy from the log)
From around that moment the performance started dropping visibly until the system gui became completely unusable. The screen still kept slowly updating so i could see each line of pixels drawing. I managed to switch to tty1 and observed the “Timeout” errors with almost no “BadTLP” and “BadDLLP”. The time was around [1600]. The system didn’t respond to any input and i had to reboot it with alt+printscreen+b.
The device 0000:00:03.1 which throws errors is:
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 29
Bus: primary=00, secondary=09, subordinate=0b, sec-latency=0
I/O behind bridge: 0000d000-0000dfff [size=4K]
Memory behind bridge: fe600000-fe7fffff [size=2M]
Prefetchable memory behind bridge: 00000000e0000000-00000000f01fffff [size=258M]
Capabilities:
Kernel driver in use: pcieport
Back then with the old kernel i also tested different boot options like “nommconf” (i can’t remember the exact results, but it din’t seem to solve the problem), tried the system with almost everything pulled out (just the bare minimum “cpu+1 stick of ram+this gpu+the hdd with the system”), switched some bios options, updated the bios to 4207
2018/12/148.36 MBytes
PRIME X370-PRO BIOS 4207
1.Update AGESA 1006
2.Improve compatibility and performance for Athlon™ with Radeon™ Vega Graphics Processors
The next version was
4406
2019/03/1110.24 MBytes
PRIME X370-PRO BIOS 4406
Update AGESA 0070 for the upcoming processors and improve some CPU compatibility.
ASUS strongly recommends that you update AMD chipset driver 18.50.16 or later before updating BIOS.
But the last chipset driver ASUS ships for Windows 7 is 17.40.2815.1010 so I didn’t upgrade it further.
I also tried Windows 7, which threw BSoD with some error in pci.sys during the boot process or within 1-2 minutes after the startup.

A few times the system didn’t boot and showed nothing on the screen. And If there was another gpu, i could enter the motherboard setup and see that the gpu wasn’t even detected by the motherboard.

Sometimes the boot process stopped just right after the kernel boot and i also could do nothing. When it was detected as the secondary gpu, i tried to pass it to the guest virtual system (using qemu with iommu passthrough and the guest was windows 8.1) and it worked better than the ‘real’ windows 7 but still crashed after a while. The thing I noticed there is that the crashes mostly happened when the gpu changed from 2D to 3D clocks or back from 3D to 2D (when I opened some 3D app for tests or closed it). After that the log was flooded with errors and I couldn’t start the VM again. Sometimes the host system also couldn’t turn off correctly and showed some ‘system calls traceback’ and the CPU register values.

Unfortunately my motherboard’s firmware doesn’t have a switch for pci-e 1/2/3 mode but if i put vega into the last pci-e which is always in x4 2.0 mode, it worked fine with debian for a few days and for a few hours with windows 7 (but still threw a bsod in the end). The problem is, I had to pull everything out of the pc case and it was very inconvenient. The gpu also seems to be fine in an old motherboard with the only pci-e 1.0 port.

Maybe it’s worth noting that the asus prime motherboard works without any problem with an old radeon hd 5750 (which should be pci-e 2.0 and which I’ve been constantly using instead of the faulty (?) vega for this whole time). I also got an rx 580 (pci-e 3.0) for a short test and it worked without any problem.

TL; DR
After all I assumed this vega 64 has some problems with pci-e 3.0 mode. I was going to try to return and replace the gpu (or get a refund) but I still have some doubts. It works with the same motherboard in pci-e 2.0 mode, works with another motherboard in pci-e 1.0 mode, and the same motherboard works with the other pci-e 2.0 and 3.0 gpus. So it works in some circumstances and I am not sure it can be considered faulty.

The ‘soft’ solutions I found earlier didn’t work for me, but my kernel version was above 4.15 so I probably have to roll back and try it again? Anyway, it doesn’t fully solve the problem, as the gpu sometimes does not POST and isn’t detected by the motherboard. The only thing that might help is the motherboard firmware update but in that case my Windows 7 may become unusable (as i mentioned above, asus recommends to update the chipset driver to a version unavailable for Windows 7) and also it seems that the update doesn’t change anything, just adds support for the new ryzen 3xxx CPUs.

What should I do?