Samsung 980 Pro + Tyan S8030 Kernel Errors and random Reboot

Murariu_Fabian · July 9, 2021, 5:31pm

Hi,

I’m having serious issues with 3 Samsung 980 PRO NVMEs M.2 + Tyan S8030 + AMD EPYC 7443P. I moved the SSD to a laptop and I’m seeing the some of the same messages. I’m a bit surprised it happens with all 3 of them

[ 4392.683810] nvme 0000:03:00.0: AER: aer_status: 0x000000c1, aer_mask: 0x00000000 [ 4392.683820] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683826] nvme 0000:03:00.0: AER: [ 6] BadTLP [ 4392.683831] nvme 0000:03:00.0: AER: [ 7] BadDLLP [ 4392.683835] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID [ 4392.683860] nvme 0000:03:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000 [ 4392.683861] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683863] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID [ 4392.683878] nvme 0000:03:00.0: AER: aer_status: 0x00000081, aer_mask: 0x00000000 [ 4392.683879] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683881] nvme 0000:03:00.0: AER: [ 7] BadDLLP [ 4392.683882] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID [ 4392.683897] nvme 0000:03:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000 [ 4392.683898] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683900] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID [ 4392.683914] nvme 0000:03:00.0: AER: aer_status: 0x00000081, aer_mask: 0x00000000 [ 4392.683916] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683917] nvme 0000:03:00.0: AER: [ 7] BadDLLP [ 4392.683918] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID [ 4392.683933] nvme 0000:03:00.0: AER: aer_status: 0x000000c1, aer_mask: 0x00000000 [ 4392.683934] nvme 0000:03:00.0: AER: [ 0] RxErr (First) [ 4392.683936] nvme 0000:03:00.0: AER: [ 6] BadTLP [ 4392.683937] nvme 0000:03:00.0: AER: [ 7] BadDLLP [ 4392.683938] nvme 0000:03:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

They constantly throw errors in the IPMI interface and the system is unstableerror.txt (6.4 KB)

Log · July 10, 2021, 4:29am

Here’s a “shoot from the hip” thing to try, manually set those slots as pcie 3 (instead of 4) in bios and see if that calms it down.

The boot option pci=noaer can quiet the errors, but doesn’t actually resolve whatever is going on, and also hides non-correctable errors (which for a data disk could mean data corruption). It’s possible the kernel is halting because an uncorrectable error eventually comes up.

More details from https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

Correctable errors pose no impacts on the functionality of the
interface. The PCI Express protocol can recover without any software
intervention or any loss of data. These errors are detected and
corrected by hardware. Unlike correctable errors, uncorrectable
errors impact functionality of the interface. Uncorrectable errors
can cause a particular transaction or a particular PCI Express link
to be unreliable. Depending on those error conditions, uncorrectable
errors are further classified into non-fatal errors and fatal errors.
Non-fatal errors cause the particular transaction to be unreliable,
but the PCI Express link itself is fully functional. Fatal errors, on
the other hand, cause the link to be unreliable.

pci=nommconf is also something people use, but I don’t know offhand what that does.

Another thing that commonly generates these sorts of errors are AMD GPU’s with the reset bug, like my old RX 460. Also not sure what’s the deal with that, though turning off ASPM in bios and boot (pcie_aspm=off) options seems to be the trick to fixing the hard GPU crashes I was getting that needed a reboot to fix. It may help you as well, what this does is disable the automatic PCIe downclocking for energy saving stuff.

Obviously don’t turn off error reporting when trying other things, so you can tell if it worked or not.

wendell · July 10, 2021, 4:51am

what log said also dont use the onboard m.2 slots those are pci3 only. have to use a pcie add in card to make those run at pcie4 speeds

Log · July 10, 2021, 5:10am

There’s actually a MOA revision of the S8030 series boards which supposedly supports PCIe 4 to those slots.

The OOY version does not officially support PCIe 4 to those slots, only 3. I seen reports that some people have gotten it to work fine, but it’s going to be really hit or miss if it can. And they may or may not have been checking their logs. They may also not have been aware of the board revisions.

I don’t know what those acrynymns stand for, but I believe the OOY version is the original version (which I have), and the MOA is the newest revision.

Found an ancient (2002!) forum post explaining Tyan's weird ass revision naming

https://forums.overclockers.com.au/threads/tyan-mp-revisions.59976/

I don’t have specifics beyond the clock, but here is a Tyan Usenet post that explains revisions and why they happen. The following is not from me, and credit is given at the bottom through the original author’s sig.

Here is a recap of a post I made a while back. There is a lot of stuff
that isn’t really necessary to know, but read it anyway:

The way Tyan mark their motherboards you’ve got to either flip it on
it’s back or have access to the table below in order to know what
version you’ve got.

1: COT
2: OOY
3: MOA
4: PON
5: UOT
6: TOY
7: EOA
8: RON

On the Timer MP the version code is silk-screened in the corner right
next to the DIMM4 memory slot. On the board I’ve got in front of me
right now it says: “01OOYB”

Checking the table above I find “OOY” and can see that this is a
Rev:02 motherboard.

If I flip the motherboard on its back i can se the revision number
stamped next to where the last 32-bit PCI socket is located. It’s the
number within a rectangle that’s the revision number.

Now what’s that revision letter everybody is talking about?

Ok here I go again…

The revision number is used to identify the PCB (Printed Circuit
Board) version, while the letter is what is known as a ECN (Engineer
Change Notice) identifier.

Motherboards are (almost) never created perfect. Each time a problem
is identified Tyan tries to solve it by if changing the BIOS, changing
/ adding components on the motherboard or by changing the PCB of the
motherboard.

Changing the BIOS is cheap. It can be used to solve problems with
motherboards already in the hands of users. That’s why we see new
BIOS’es every now and then.

But it’s not always possible to solve a problem just by tweaking the
BIOS. In those cases Tyan tries to solve the problem by adding
components to the motherboard during manufacturing. This is more
expensive than issuing a new BIOS. And motherboards that have been
sold will have to be returned to the factory to be reworked. Expensive
stuff.
Every time such a change is made a new ECN is created. These are
identified by the use of both PCB version (02) and the ECN letter (A)
to identify what revision of PCB is affected by this ECN.

When the reworks stack up, there is a really serious problem that
can’t be solved efficiently by either of the methods described above,
or some components are getting hard to get, that’s when it’s time for
a new Revision. Changing the PCB is quite expensive. They have to
reprogram the robots in the factory to accommodate a new layout. The
design and trace layout of the new PCB has to be tested. And even then
there is always the chance that new bugs may be introduced with the
new PCB.
A change of PCB revision is not something that’s done lightly.

A motherboard usually goes through at least three PCB revisions, but
up to five isn’t unusual. The first revision isn’t usually made
available to the public. The second revision may suffer the same fate
as the first, but more commonly this is the first that’s shipped to
end users. A relatively immature technological platform such as the
AMD 760MP usually has a few quirks that it takes some time to work
out.

If the third revision PCB for the Tiger MP is just around the corner
it probably means one of two things. Either Tyan has identified some
incompatibilities that they have been able to resolve with a new PCB
layout, or they have been able to simplify the motherboard so it’s
cheaper to manufacture. In either case it will eventually be good for
the end users. The one thing I hope for is better memory
compatibility. It’s something that would really be welcomed by most
who use this motherboard.

Regards,
B.Slisk

The only other differences I know of are the MOA version has 5 4-pin headers, and the OOY version has 4.

Murariu_Fabian · July 10, 2021, 8:52am

Any recommendation for a PCIe add-in card with a support for 4 NVMEs Gen 4? that doesn’t break the (very broken already) bank?

I tried my Gen3 NVME on the board and it had no errors so it seems to point to a Gen3/Gen4 issue.

Murariu_Fabian · July 10, 2021, 8:53am

Also fio shows abysmal performance levels and eventually crashes.

wendell · July 10, 2021, 3:23pm

The Asus one should be around $30 pandemic not withstanding. eBay might be a good source

Murariu_Fabian · July 10, 2021, 7:17pm

ASUS It’s out of stock everywhere and about £150 on Ebay I’m going to take my chances with this, which more ore less looks like the same thing

system · April 10, 2022, 1:18pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.