Continuing the discussion from Escaping the sprawl (rearchitecting homeprod):
Hardware summary:
- Xeon 2630L v4 (10c/20t)
- 4x64GB of 2133 ECC LRDIMMs, 4 rank (vendor / product link)
- Asus Sabertooth x99, bios 4101
– 2630L-v4 supported since bios 3001
– 2133 4 rank R-DIMMs have been on the QVL since 2015 (https://www.asus.com/us/supportonly/sabertooth_x99/helpdesk_qvl/), though not these specifically.
Back history / troubleshooting so far:
Recently swapped this machine from its initial 5820k + 4x4GB DDR3000 setup it’d run from 2015 onwards to the above. Initially bought 1 stick of the ECC and the Xeon, ran Memtest86+ for a day. No issues reported. Installed the other 3, reran Memtest86+ for another day, no issues reported.
Go into Windows, start running synthetic workloads to beat up the RAM, get following error in Event Viewer:
WHEA-Logger Event ID 47:
A corrected hardware error has occurred.
Component: Memory
Error Source: Unknown Error Source
The details view of this entry contains further information.
Details:
<blah blah blah blah system blah>
...
<Data Name="ErrorSource">0</Data>
<Data Name="FRUId">{00000000-0000-0000-0000-000000000000}</Data>
<Data Name="FRUText" />
<Data Name="ValidBits">0x2</Data>
<Data Name="ErrorStatus">0x0</Data>
<Data Name="PhysicalAddress">0x1ab6636900</Data>
<Data Name="PhysicalAddressMask">0x0</Data>
<Data Name="Node">0x0</Data>
<Data Name="Card">0x0</Data>
<Data Name="Module">0x0</Data>
<Data Name="Bank">0x0</Data>
<Data Name="Device">0x0</Data>
<Data Name="Row">0x0</Data>
<Data Name="Column">0x0</Data>
<Data Name="BitPosition">0x0</Data>
<Data Name="RequesterId">0x0</Data>
<Data Name="ResponderId">0x0</Data>
<Data Name="TargetId">0x0</Data>
<Data Name="ErrorType">0</Data>
<Data Name="Extended">0</Data>
<Data Name="RankNumber">0</Data>
<Data Name="CardHandle">0</Data>
<Data Name="ModuleHandle">0</Data>
<Data Name="Length">888</Data>
Across all instances the errors were on some 0x1… Physical Address. A few direct repeats of the same address, but not always. System also intermittently hangs entirely.
Was advised that Memtest86+ doesn’t do ECC checking, but Memtest86 does (ref: MemTest86 V10 vs MemTest86+ V6 comparison - PassMark Support Forums). So I grabbed Memtest86 and ran that for 4 passes / 48 hrs. Came back clean. But when I go back into Windows, still getting the ECC errors.
Grabbed DmiDecode for Windows to pull details ( DmiDecode for Windows @ SourceForge). Open CMD as Admin (needs it for the hardware access), run the exe and drop it to a text file:
.\dmidecode.exe > dmioutput.txt
Trawl around in there for the following (in DMI everything is going to be 0x0#, Windows 10 Pro appears to drop the leading 0 when reporting Physical Addresses):
Handle 0x0066, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 64 GB
From there, scroll up. The output format is DMI type 17 per slot, then DMI type 20 if the slot is populated. Motherboard was nice enough to tell me which slot:
Handle 0x0065, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0060
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32767 MB
Form Factor: RIMM
Set: None
Locator: DIMM_B1
For a blessing, the locator actually lined up with what was printed on the board itself. Pulled the offending stick, reran the testing… still memory errors. Still 0x1-something. Reran the above, the C1 slot now owns the 0x1 address space. However now it’s mostly the exact same physical address, repeatedly, and not an address it was complaining about previously. (So maybe not the CPU’s memory controller having a bad time. Maybe. Not ruled out yet.)
Casually poked through the reference spec sheet to get a better idea of expected behaviors (was seeing ‘type: < OUT OF SPEC >’ but the rest was fine). SMBIOS 3.0 on this board, and DDR4 is in the 3.0 spec.
https://www.dmtf.org/dsp/DSP0134
But seems to be a fallback output in the app itself (line 2834) - dmidecode/dmidecode.c at master · mirror/dmidecode · GitHub. So… that’s weird. But they’re all doing that, and not all of them are misbehaving? Probably irrelevant, anyway.
Pulled the RAM out of C1, reran dmidecode, seems slot D1 went into the firing line for ‘if the memory controller has problems with 0x01… address space’. D1 is the primary slot on the board, and was the single stick I had installed previously.
And yet, the errors persist. Still in the 0x01 address space, but took a lot longer (over an hour, rather than first 15 minutes or so) to finally throw. It also only threw one, rather than double digits of instances like C1 had.
Took the stick previously C1 and put it into B1. First boot made it to Windows and then froze up, second boot doesn’t see all three RAM sticks in Windows or dmidecode. Reboot to bios, bios doesn’t see the stick either. Reseated it, still doesn’t see it in slot B1. Moved that stick to slot C1, BIOS and Windows can see it again, so the RAM stick is fine…?
My question here is this:
At what point is it not the RAM, and instead the CPU and/or motherboard? It seems to follow address spaces and not RAM sticks, and also follows address spaces and not slots. I don’t have another Xeon on hand to test with, the other 2011-3 CPUs I have don’t know what to do with ECC R-DIMMs.
Alternately, have I missed something in the troubleshooting so far?