And I’m back - a couple weeks ago I’d posted about doing virtualization on this thing but it never reached that point.
I’ll try to keep it concise. I’m seeing several hundred WHEA Event ID 17 warnings in the event logger per second (below), and eventually it runs into something uncorrectable and just BSODs with “WHEA UNCORRECTABLE_ERROR”.
A corrected hardware error has occurred. Component: PCI Express Root Port Error Source: Advanced Error Reporting (PCI Express) Bus:Device:Function: 0x0:0x2:0x0 Vendor ID:Device ID: 0x8086:0x6F02 Class Code: 0x30400
Now, 0x6F02 points to the CPU root port 1, but the log is a mix of 0x6F02, 6F04, and 6F08 which are all of the CPU PCIe root ports. Enumerated on these root ports are a Mellanox ConnectX-3 HBA and two GTX 1080Ti’s respectively, but while removing a card removes its instance of the corrected WHEA error, the problem doesn’t go away and the other two root ports just throw warnings until somethings errors out. X99 PCH root ports not affected by any of this.
I haven’t been able to replicate any of these problems on Linux and haven’t seen any of the usual “PCIe Bus error severity=Corrected” messages that I’d expected from the kernel.
I have no idea what’s going on here. Reinstalled the OS to no effect, but while I burn this Win10 Pro ISO to a boot stick, has anyone seen anything quite like this before?
Relevant specs for those interested, i7-6950x on Asus X99-WS/IPMI, 2x GTX 1080Ti, 1x Mellanox MCX354A, 128GB 8x16G DDR4-3200 (Tested everything down to 2133), Windows Server 2016 Datacenter.