Hello
I just got my ROMEB8-2T with a EPYC 7313.
I’m using Proxmox on the system and passing through devices.
I have Hyper M.2 card in the machine, a 4 port network card, a LSI 9201-16i, and a EVGA 1050ti.
The errors that are showing up in the IPMI/BMC interface are Critical Interrupts / PCI PERR - Asserted errors.
From another forum pose here it looks like it might have something to do with the Gen Speed of the PCI lane.
I’m certain that this error shows up when I activate the VM that I passes the GPU through.
Has anyone else seen this error while running this board and is this a major issue or can it be ignored?
Preemptively thanks for the help.
Did you set the slot to PCI3 instead of Auto in the bios ?
Correct the GPU is set for Gen 3 speeds.
The rest are set to auto and I am wondering if that is an issue.
I was having similar issues with U.2 drives and “auto” mode on a ROMED6U-2L2T. I remedied it by changing all the slots to PCIe 3.0, so that’s something you could try.
PCI slot 1 is set to auto in your screenshot
So what fixed this for me was running Proxmox with kernel 6.1
The forum I found it is here.
I reset UEFI to default with an update and after boot there was no more error even with passing through the GPU and Hyper M.2 card.
I don’t need to set the Gen Speed or anything.
However there is another issue with the ROMED8-2T and my LSI 9201-16i that I just made a post about yesterday here.
Other than that I think this has been fixed for me now.
i had these till i plugged in the PCIE 6-Pin connector located on the top right hand area of the board.
Do some quick math to make sure your PSU can deliver the power it needs.
I threw in a second PSU, moved a few GPUs over and powered the onboard PCIe 6 Pin connector with it.
As you can see below, it worked-- immediately, and no issues since.
Funny enough, it was watching one of Wendell’s videos about this and the supermicro board where he talks a bit about board power made me aware of it-- and prompted me to this forum
I recently had a crash and had to reset my system from scratch.
Seems theres a few options to control these errors-- i’ve used this to get rid of the PCI and AER errors, leaving the system very stable.
This is from the IPMI access to the bios
Advance–> AMD CBS
NBIO Common Options
ACS Enable: Disabled
Enable AER Cap: Disable
NBIO Common Options–> NBIO RAS Common Options
NBIO RAS Common Options
PCIe Aer Reporting Mechanism: OS First
Save & Reboot. PCI errors should not show up in the IPMI log and AER errors should not appear with Linux/proxmox etc.
These settings work for me but I’m by no means an expert, and suggest some experimentation to get this working properly.
Thanks @QuietDevil , this solved the issue for me as well, also using Proxmox (I also have PCI-E 6 pin power to the mobo, but that didn’t change anything).
WOW, you da man!
I was plagued with those errors, now they finally quiet down.
I assume this just suppresses the errors and there’s still a hardware issue at the core? Going to see if it affects any performance, I’m running latest Unraid.
Thanks again for the info and screenshots!
I actually went back and enable ACS since it broke IOMMU grouping and prevented many devices from getting passed-through VM’s.
I am getting errors still, but not as many as before.
Sep 19 19:33:38 SelaNAS kernel: pcieport 0000:80:01.1: AER: Corrected error received: 0000:80:01.1
Sep 19 19:33:38 SelaNAS kernel: pcieport 0000:80:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
Sep 19 19:33:38 SelaNAS kernel: pcieport 0000:80:01.1: device [1022:1483] error status/mask=00000040/00000000
Sep 19 19:33:38 SelaNAS kernel: pcieport 0000:80:01.1: [ 6] BadTLP
Is there a way to determine the offending hardware?
What is 80:01.1 specifically?
This is the lspci output for this device:
lspci -vvnnn -s 80:01.1
80:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483] (prog-if 00 [Normal decode])
Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1453]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 36
IOMMU group: 18
Bus: primary=80, secondary=81, subordinate=81, sec-latency=0
I/O behind bridge: [disabled] [32-bit]
Memory behind bridge: f3900000-f39fffff [size=1M] [32-bit]
Prefetchable memory behind bridge: [disabled] [64-bit]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #3, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x4
TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #23, PowerLimit 75W; Interlock- NoCompl+
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet- LinkState-
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing- 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd+
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 0000
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1453]
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol+ CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+ MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 RootCmd: CERptEn+ NFERptEn+ FERptEn+ RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: 8009 ERR_FATAL/NONFATAL: 0000 Capabilities: [270 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: LaneErr at lane: 1 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Capabilities: [370 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+ L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- L1SubCtl2: Capabilities: [380 v1] Downstream Port Containment DpcCap: INT Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+ DpcCtl: Trigger:0 Cmpl- INT- ErrCor- PoisonedTLP- SwTrigger- DL_ActiveErr- DpcSta: Trigger- Reason:00 INT- RPBusy- TriggerExt:00 RP PIO ErrPtr:1f Source: 0000 Capabilities: [400 v1] Data Link Feature <?>
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> Capabilities: [440 v1] Lane Margining at the Receiver <?>
Capabilities: [488 v1] Designated Vendor-Specific: Vendor=1002 ID=0001 Rev=1 Len=68 <?>
Kernel driver in use: pcieport
This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.