I have upgraded my workstation from 256gb DDR4 ECC RAM to 512 and my machine started to behave wonky.
It always boots successfully after changing config - so when RAM training occurs it always boots fine. But later if I just ‘hot boot’ it - it almost always power cycles once during startup. And then boots fine after power cycle. I managed to reboot it few times without issues but it’s quite rare.
I tested machine and didn’t notice anything extraordinary - sticks are running at 3200 mhz like they should so idk what’s going on. I decided to test my real use cases and while they work functionally fine, I noticed quite a bunch of ECC errors in dmesg. All of them corrected but well… kinda sus. Like 20 errors in 1 hour. I had 0 ECC errors with 256gb config.
So I decided to run memtest86+ but it just passed with 0 errors after hammering RAM for few hours. Idk what’s going on.
I tried mixing sticks in various configs:
4x64 + 4x32, another 4x64 + 4x32, just 4x64 and everything works fine until i load up all 8 banks with 8x64 (in terms of booting, i did not test for ECC errors in all possible configs)
So my question is - should memtest86+ in default configuration report ECC corrections happening? If it returns 0 errors does it mean RAM itself is fine and something else is wonky? Is it possible that just with such high memory density ECC errors are more likely to occur?
Try memtest86 (non-plus). They differ in how well they support different hardware.
Edit: Or even better, see if you can find explicit mention of support for ECC error reporting for your specific platform on memtest86 and memtest86+, respectively.
I believe the UEFI might have settings for RAM scrubbing. See if it’s set to scrub the RAM every 5 minutes. That would at least explain the periodicity of the error.
(I’ve never messed with this myself but I do remember seeing this setting in a couple of UEFIs on computers with ECC RAM. I don’t know if 5 minutes is a reasonable/typical value here; maybe it’s a red herring, but worth checking out maybe.)
I found multiple reports where people also had ECC errors exactly every 5 minutes so it seems to be value for periodic scrubs indeed. I already ordered replacement stick. Still waiting for response from ASRock though. Worst case I’ll have to try one-by-one though it’s tricky since errors only occur under higher RAM load . _ .