Win Server 2016 Reporting PCIe Carnage - Thoughts?

And I’m back - a couple weeks ago I’d posted about doing virtualization on this thing but it never reached that point.

I’ll try to keep it concise. I’m seeing several hundred WHEA Event ID 17 warnings in the event logger per second (below), and eventually it runs into something uncorrectable and just BSODs with “WHEA UNCORRECTABLE_ERROR”.

A corrected hardware error has occurred.

Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)

Bus:Device:Function: 0x0:0x2:0x0
Vendor ID:Device ID: 0x8086:0x6F02
Class Code: 0x30400

Now, 0x6F02 points to the CPU root port 1, but the log is a mix of 0x6F02, 6F04, and 6F08 which are all of the CPU PCIe root ports. Enumerated on these root ports are a Mellanox ConnectX-3 HBA and two GTX 1080Ti’s respectively, but while removing a card removes its instance of the corrected WHEA error, the problem doesn’t go away and the other two root ports just throw warnings until somethings errors out. X99 PCH root ports not affected by any of this.

I haven’t been able to replicate any of these problems on Linux and haven’t seen any of the usual “PCIe Bus error severity=Corrected” messages that I’d expected from the kernel.

I have no idea what’s going on here. Reinstalled the OS to no effect, but while I burn this Win10 Pro ISO to a boot stick, has anyone seen anything quite like this before?

Relevant specs for those interested, i7-6950x on Asus X99-WS/IPMI, 2x GTX 1080Ti, 1x Mellanox MCX354A, 128GB 8x16G DDR4-3200 (Tested everything down to 2133), Windows Server 2016 Datacenter.

Have you tried backing this off to 2666? 3200 is a “robust” overclock for x99 with 128GB.

Edit, to clarify robust is British for “a bit risky”

1 Like

I’d echo the memory settings, also check for a BIOS/UEFI update. Server 2016 is also old now, so make sure you have all updates if its an old copy you are installing from.

@Airstripone I’m British, never heard anyone use robust to mean ‘risky’ and I even checked with my kids that I’m not out-of-touch on the latest lingo :rofl:

Yep, it’s pretty spicy especially with all those DIMMs but I’ve tried everything down to a single stick at 2133 with no changes. Guess I should’ve noted that, huh. Edited that for future commenters.

Already flashed the BIOS to the latest version from the middle of last year.

EDIT

Just booted Arch Linux again and am seeing the same thing over there -

pcieport 0000:00:01.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
pcieport 0000:00:01.0: AER:   device [8086:6f02] error status/mask=00001000/00000000
pcieport 0000:00:01.0: AER:   [12] Timeout

Leaning towards just replacing the motherboard right now and calling it a day.

Have you reseated the CPU? May be a pin not connecting.

But yes looks like mobo is on the Fritz.