[Solved] X99 + Xeon v4 ECC troubleshooting

Continuing the discussion from Escaping the sprawl (rearchitecting homeprod):

Hardware summary:

Back history / troubleshooting so far:

Recently swapped this machine from its initial 5820k + 4x4GB DDR3000 setup it’d run from 2015 onwards to the above. Initially bought 1 stick of the ECC and the Xeon, ran Memtest86+ for a day. No issues reported. Installed the other 3, reran Memtest86+ for another day, no issues reported.

Go into Windows, start running synthetic workloads to beat up the RAM, get following error in Event Viewer:

WHEA-Logger Event ID 47:

A corrected hardware error has occurred.

Component: Memory
Error Source: Unknown Error Source

The details view of this entry contains further information.

Details:
<blah blah blah blah system blah>
...
<Data Name="ErrorSource">0</Data> 
  <Data Name="FRUId">{00000000-0000-0000-0000-000000000000}</Data> 
  <Data Name="FRUText" /> 
  <Data Name="ValidBits">0x2</Data> 
  <Data Name="ErrorStatus">0x0</Data> 
  <Data Name="PhysicalAddress">0x1ab6636900</Data> 
  <Data Name="PhysicalAddressMask">0x0</Data> 
  <Data Name="Node">0x0</Data> 
  <Data Name="Card">0x0</Data> 
  <Data Name="Module">0x0</Data> 
  <Data Name="Bank">0x0</Data> 
  <Data Name="Device">0x0</Data> 
  <Data Name="Row">0x0</Data> 
  <Data Name="Column">0x0</Data> 
  <Data Name="BitPosition">0x0</Data> 
  <Data Name="RequesterId">0x0</Data> 
  <Data Name="ResponderId">0x0</Data> 
  <Data Name="TargetId">0x0</Data> 
  <Data Name="ErrorType">0</Data> 
  <Data Name="Extended">0</Data> 
  <Data Name="RankNumber">0</Data> 
  <Data Name="CardHandle">0</Data> 
  <Data Name="ModuleHandle">0</Data> 
  <Data Name="Length">888</Data> 

Across all instances the errors were on some 0x1… Physical Address. A few direct repeats of the same address, but not always. System also intermittently hangs entirely.

Was advised that Memtest86+ doesn’t do ECC checking, but Memtest86 does (ref: MemTest86 V10 vs MemTest86+ V6 comparison - PassMark Support Forums). So I grabbed Memtest86 and ran that for 4 passes / 48 hrs. Came back clean. But when I go back into Windows, still getting the ECC errors.

Grabbed DmiDecode for Windows to pull details ( DmiDecode for Windows @ SourceForge). Open CMD as Admin (needs it for the hardware access), run the exe and drop it to a text file:
.\dmidecode.exe > dmioutput.txt

Trawl around in there for the following (in DMI everything is going to be 0x0#, Windows 10 Pro appears to drop the leading 0 when reporting Physical Addresses):

Handle 0x0066, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x01000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 64 GB

From there, scroll up. The output format is DMI type 17 per slot, then DMI type 20 if the slot is populated. Motherboard was nice enough to tell me which slot:

Handle 0x0065, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0060
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 32767 MB
	Form Factor: RIMM
	Set: None
	Locator: DIMM_B1

For a blessing, the locator actually lined up with what was printed on the board itself. Pulled the offending stick, reran the testing… still memory errors. Still 0x1-something. Reran the above, the C1 slot now owns the 0x1 address space. However now it’s mostly the exact same physical address, repeatedly, and not an address it was complaining about previously. (So maybe not the CPU’s memory controller having a bad time. Maybe. Not ruled out yet.)

Casually poked through the reference spec sheet to get a better idea of expected behaviors (was seeing ‘type: < OUT OF SPEC >’ but the rest was fine). SMBIOS 3.0 on this board, and DDR4 is in the 3.0 spec. :thinking:
https://www.dmtf.org/dsp/DSP0134

But seems to be a fallback output in the app itself (line 2834) - dmidecode/dmidecode.c at master · mirror/dmidecode · GitHub. So… that’s weird. But they’re all doing that, and not all of them are misbehaving? :thinking: Probably irrelevant, anyway.

Pulled the RAM out of C1, reran dmidecode, seems slot D1 went into the firing line for ‘if the memory controller has problems with 0x01… address space’. D1 is the primary slot on the board, and was the single stick I had installed previously.

And yet, the errors persist. Still in the 0x01 address space, but took a lot longer (over an hour, rather than first 15 minutes or so) to finally throw. It also only threw one, rather than double digits of instances like C1 had.

Took the stick previously C1 and put it into B1. First boot made it to Windows and then froze up, second boot doesn’t see all three RAM sticks in Windows or dmidecode. Reboot to bios, bios doesn’t see the stick either. Reseated it, still doesn’t see it in slot B1. Moved that stick to slot C1, BIOS and Windows can see it again, so the RAM stick is fine…? :thinking:


My question here is this:

At what point is it not the RAM, and instead the CPU and/or motherboard? It seems to follow address spaces and not RAM sticks, and also follows address spaces and not slots. I don’t have another Xeon on hand to test with, the other 2011-3 CPUs I have don’t know what to do with ECC R-DIMMs.

Alternately, have I missed something in the troubleshooting so far?

Thoughts:

  • It is unclear why you chose to change your long lasting and obviously stable system. I can see the desire to upgrade from 6 cores to 10. But why using a slower RAM? Have you tried using the DDRR4-3000 non-ECC RAM? Maybe downclocked to a certified/supported speed?
  • What temperature do you observe in the RAM? Is it possible that the issues are only occurring after the system has been on or after load has been created?
  • Did you try to reseat the RAM? A loose or insecure connection can cause these issues.

Ah, fair point: System is no longer for gaming, and is being swapped into home server duty. It was a swap from 6c/12t 140W space heater to 10c/20t 65W space heater, and the RAM speed drop was trying to minimize funkiness with x99 and/or the early DDR4 IMCs. 2133 is the only thing on the QVL for ECC RDIMMs on this board - was hoping it’d be the sane path. Alas.

Reseating: Yep, did try with one of the believed problematic sticks, and it dropped out entirely on that slot. I put it in a different slot and it was picked up just fine. I have yet to try dropping the other stick into the problematic slot to see if that was a one off or if the slot went bad.

RAM temps - not entirely sure. But issues have crept in both immediately after boot, as well as 15-20 min after starting the tests. Longest so far was over an hour, with only 2 slots populated. I retested just now and depending on which sensor we’re looking at the RAM is either 56C in front of the CPU, 54C behind the CPU tower, or the RAM is ~67C as reported on the sticks themselves.

Well, generally, the goal of using ECC RAM is to improve stability of a system. Arguably, this is currently not being achieved.

The mobo typically shows optimal slots for memory use. Am I right to assume that the “problematic slot” is one of the optimal slots for a 4 stick configuration?

Can you direct fans / air flow onto the RAM sticks and see if the behavior goes away?

Couldn’t have said it better myself. :joy:


For reference for the rest:

When I had it set up initially with the 4 sticks for both the gaming setup and the ECC setup, it was in A1, B1, C1, D1. Errors have been reported so far on the sticks in B1, D1, and C1. But never at the same time. It’s only ever one stick that reports problems in a specific configuration, so far.

Initially it was slot B1 that was complaining. When I removed that stick, then slot C1 started complaining. When I removed that stick, slot D1 started complaining. A1, in the back, hasn’t complained so far.


Did another test run. There’s one section that’s thrashing it hard enough they’re hitting 80 (and more error pops while up there). I’ll have to find some cardboard and cannibalize a fan out of another build. Not how I wanted to run this system, with fans on every single panel, but… :yay:

Added and/or rearranged the fans. ‘Idle’ temp with two Arctic P14 140mm as intake on the top of the case, blowing across the ram banks at full speed, was ~34C in front of the CPU, ~40C behind it. Started up the workload, had memory errors pop before the RAM got past 45C. Slot B1, immediately behind the CPU, and the warmest of the 4. Still 0x1… address block.

Starting to think this machine exists only to drive me crazy. :crazy_face: Given that it’s always whatever the BIOS thinks is the second stick of RAM, I might have to try pulling 3 of the 4 again and retesting the slots individually. Or is that crazy talk?

I think the main hypotheses for the cause of your issues are

  • heat buildup in RAM during operation (although I think that has been debunked with your latest test)
  • hw issue in one or more RAM sticks. Yes, only use one stick at a time try to reproduce issue. The good news is that it is seemingly easy to repro the issue. After confirming (through extended use) that each one works, proceed to connecting all sticks. Use the recommended slots as documented in your mobo manual. Assuming you can repro the issue, pull out the offending stick. Validate that system is stable using the remaining 3 sticks. Replace any of the three working sticks with the offending stick and try to repro (only 3 sticks in use, same slots as tested good before). Assuming system is again stable, I’d call all 4 sticks ok.
  • less than ideal connection of RAM in slot. Not sure how to guide you to a solution here other than “reseat RAM stick” and listen to “click” as it enters slot. Watch the latches and make sure all are fully closed. I find RAM seating harder with each generation of DDR memory and experienced issues seating RAM myself.
  • hw issue in mobo (slot or otherwise). Using three RAM sticks that have been confirmed ok in previous steps, move any one into the slot location that threw error message. Try to repro issue. If not possible to repro, I’d call this one debunked.
  • firmware issue in mobo. This would be a sticky one. You won’t be able to convince ASUS to look into the issue. Maybe you can find old online reports of successful ECC use with specific firmware versions. Maybe you can try another firmware version. But honestly, this is not a path I would go down far.
  • CPU issue in memory controller. Possible. Unlikely. Requires different CPU to test. Not sure I would investigate much further. At this point I would strongly consider going back to old non-ECC RAM that has been reliable for many years and call it a day.

Oh. I forgot to ask for completeness sake: You did change your memory configuration in the BIOS from the old DDR4-3000 to whatever JEDEC config the ECC RAM is comfortable with, right?
:wink:

I’ll also recommend saving BIOS profile and manually clearing CMOS before every test/after every hardware change.

Maybe stating the obvious, but did you plug in both cpu power connectors?
I’ve had issues with an x99 Asus board and ecc ram when only the 4+4 pin was plugged in.

1 Like

While not ideal, if ECC single bit errors are being corrected, then ECC is doing is job and everything is working. I have a machine that has been running for 4+ years uptime with ~300 single bit errors corrected no problems.

If you get double bit errors, then you’ll get kernel panics and you’ll need to start changing things.

On older CPUs i’ve had MC’s go “weak” and need a 100-200mV bump in DDR voltage, or a smaller tREFI interval to keep within spec.

1 Like

Yep, both the 8pin and the 4 pin are plugged in.

Yep, I’d reset it to default (2133 15-15-15-36 1T) rather than leaving it at the prior kit’s spec. :stuck_out_tongue_winking_eye:

This is what I think has been happening when the system hangs, but it’s neither logged nor BSOD’ing, just freezes and requires a power cycle. That behavior hasn’t been as easy to reproduce on demand, but so far has only happened while the box is loaded up with this test load (i.e. hasn’t done it just idling on the desktop). At least as observed so far…

You’re right that the ECC errors I’ve listed so far are best described as ‘working as intended’. It’s the intermittent hang that got me to take a deeper look at this.


Currently working on testing the sticks individually in slot D1. Since it used to fail within the first hour in multi-stick, giving it 3 hrs for single stick before moving on to the next. At least for the first pass, anyway. Hopefully know more tomorrow once I’ve had a chance to cycle them all.

Yay, progress.

3 of the 4 sticks have tested out just fine solo, even with temps climbing as high as 65C (fans weren’t on full blast those runs - forgot to for the first one, decided to hard mode the rest as well). Subjectively, the machine hitches and stutters less with a single DIMM, too - both immediately after boot, and just mousing around on the desktop once it’s settled down.

The 4th threw 24 errors in 3.5hrs of work, and was happening within the first hour. So pulled it, reseated it, and retested. Still complaining, and does so within the first 20 minutes. Nothing abnormal about temps vs. the other sticks, voltages seemed fine, etc.

Pulled that stick and put the three known-good ones into slot D1, C1, and B1. Works fine, no errors thrown, no weird hitches, stutters, or hang. Inclusive of the stick in B1 hitting 75C at one point, even with the fans maxed. Something probably wrong with the airflow pattern behind the CPU, but ok.

Threw the known-questionable stick into A1, reran test… and start seeing memory errors again, 19 in the span of 20 minutes or so. But from B1, not A1.

Despite what the tools are telling me, removed the questionable stick from A1 and reran the test to go back to last known good. And… it all works again, no errors.


I don’t even want to try and crack the ‘why have literally all of the tools lied to me?’ question, at this point. Computers Barely Work, apparently including The Standard Diagnostic Tools that people swear by. Or swear at, in my case. :yay:

Off to RMA some RAM, and figure out a more permanent ducting solution. Thanks, all. :slightly_smiling_face:

3 Likes

Hi @Molly , sorry to hear about your issues but I’m glad that you seem to have gotten to the bottom of it at last!

I hope it’s not too intrusive to ask you few questions and briefly hijack the post; information on using this motherboard with ECC memory and Xeons seems to be very scarce online.

I’m about to pull the trigger on buying some used ECC RAM and Xeon CPU myself, and seeing as we share the same motherboard, I was hoping you could share the specific of your setup, or perhaps any extra findings if you have some.

I’m considering going for a 64GB+ setup with either e5-2699v3 or some other v4 CPU if I can find one in comparable price. My concerns are mostly with power draw and support for the ECC memory.

The QVL is pretty limited and probably outdated, but I was wondering if you had any insights into what speeds/capacities/numbers of DIMMs would be supported, or at least wanted to share what you have tried following on from your lates post?

Last, with regards to power draw, I appreciate you might not have tested it yourself, but I was hoping to set my CPU up to go into higher C-states to save some power during inactivity, or potentially even periodically disable some cores if I go with one of the beefier Xeons. Are these features still supported on the motherboard when you install the Xeon CPUs? Or for that matter, did you observe any features not being available after switching from 5820k (which I myself drive at the moment too :stuck_out_tongue: )?

1 Like

Hi there, welcome to the forum! :slightly_smiling_face:

I went from a 5820k + 4x4GB Corsair LPX DDR4-3000 from the initial build to a Xeon E5 2630L v4 and the config I was troubleshooting here was for 256GB (4x64gb) of DDR4-2133 LRDIMMs.

For picking a CPU, the QVL is still current so long as you’re on the latest bios: SABERTOOTH X99 - Support

For memory, I took the L on RAM speed as the only thing I could find on the Sabertooth-specific RAM QVL was 2133 and x4 rank ECC being supported. I didn’t want to fuss with finding out if x8 would work, or if faster would work, but that’s because for my uses the machine is going to be doing background and/or not time critical things for me. It’s also sitting across a gigabit NIC, so it’s only really got to move faster than a HDD anyway.

The specific RAM I’d gone with was these: Hynix 64GB 4Rx4 PC4-2133P-L LRDIMM DDR4-17000 ECC Load Reduced Server Memory RAM | eBay. This seemed at the time to be the sweet spot for density, 128GB sticks have a big premium on them, as do the faster 64GB sticks. On the other hand, if you think you might upgrade to another DDR4 server platform sometime in the future, going for DDR4-3200 might make sense – it should run at 2133 or whatever below its max spec in the meantime, too.

Power draw and C-states: Does work with the chip I got, BIOS has to be configured to allow it (i.e. not on ‘performance’ in the EZ mode, the other two options will have it). I forget if I had to force enable using the C states, but the option is present in the BIOS to do so. Ditto for the option to disable cores and/or hyperthreading, too. Otherwise I will say I haven’t tried fussing with any of the overclock settings for either CPU or RAM but I’d be very surprised if those worked beyond setting the appropriate speeds and timings for the RAM sticks. On the other hand, though, my RAM is pulling more wattage than my CPU does, at least per HWinfo64 (~40w on the CPU, ~70w on the RAM when loaded up, off the top of my head). But it’s the L CPU, 65w part that base clocks around 2ghz once past the boost timer.

Max RAM capacity should be determined by the CPU, since the memory controller’s in there. 8 slots on the board * whatever density of RAM, so long as it’s under the max the CPU’s ARK page says it should probably be fine. The one caveat to ARK being that when the 5000-series and Xeon v3s were new, DDR4 hadn’t hit its max density per stick yet. Going above might work, but ‘here be dragons’ and all that. Though practically speaking, hitting the 1.5TB max on my v4 chip would be breathtakingly expensive (256GB sticks are still like $2k+ each).

Hope it helps!

3 Likes

Thank you very much! I appreciate it!

That’s very insightful and definitely makes my choice easier. I guess now’s the time for me to start assembling the part list :grinning:

1 Like

100% closing the loop here. New stick arrived, no issues in the testing since. Glad it wasn’t the CPU. :slightly_smiling_face:

2 Likes