Memtest86 and ECC RAM

I have an Asrock X99 Taichi motherboard with an Intel Xeon E5 1660 V3 installed. I have installed eight 32GB 2133 MTs ECC RDIMMs. They are Micron MTA36ASF4G72PZ -2G1A1 part number. This is my first time dealing with ECC memory. I got this much memory to perform a task and the files created have errors in them. I ran memtest86 for a bit over the last 13 hours before I decided that was plenty enough down time. It completed about 1.75 passes of the 4. It generated 89 ECC Correctable Errors, all on the same stick. I am unsure if this means that I should replace this stick. Or perhaps that the stick is overheating. I mean they must put those little vents on the memory in servers for a reason. I saw temperatures in the high 50s to low 60s in WIndows.
Honestly, I was a bit surprised to see Memtest report ECC Corrected Error. I thought that ECC would not work on X99 with RDIMMs based on the language in the manual; but I guess that manual also says the memory capacity of the board is 128GB.



Those results are definitely indicative of a failing RAM module. You can try putting another fan pointing directly at the modules and run again, but if all of the modules are the same problematic temperature, I would expect all the modules to report errors and not just one. I would just replace the module personally.

I am not a windows guy, but I know that in a properly configured Linux or FreeBSD environment, your OS can detect if ECC errors are being thrown by the motherboard as long as the motherboard is reporting them properly (which seems to be the case since Memtest is detecting that they are occurring and being resolved). I would assume Windows has similar functionality, so try looking into that for a better sense if they are happening regularly.

1 Like

Shockingly, significantly fewer errors in the Windows Event Viewer than the 13 hour memory stress test. I had been fully loading the memory for a day and a half before discovering the errors in the output.
image

Definitely should be replaced then if you can afford it. I consider >0 error to be too many.

1 Like

one of the VERY few X99 boards to FULLY support RDIMMs.

And yet this board launched with a price of just $330. I haven’t used more X99 boards than just this one, but I have been so blown away with it. I really miss HEDT. They took it away about the same time it would have been worth upgrading my X99 platform. This whole consumer socket with too many cores straight to Lite Workstation stinks. I have very little use for more than 16 cores and frankly think $1500 is a very reasonable price point for HEDT entry level board, CPU (and 64GB of memory if it’s just 12 cores).

1 Like

So I zip tied a small really annoying fan to hang so that it’s pointing at the slots on the 24-pin side of the board. (this was really annoying because even tho I set the fan software to 0%, it wouldn’t stop spinning - at least it was slow enough that it was just annoying to stick fingers into instead of breaks the fan and makes me bleed speed) Temps are down from mid 60s to mid 50s and I’m not seeing any errors after hours of running a workload that uses 128GB every 11 mins and generates a new 128GB. Really surprising that what seems like low, reasonable temperatures was causing the instability at such low speeds.