How to validate ECC memory?

barrett7212 · March 20, 2019, 7:06pm

Hey all,

I recently built a Threadripper machine with ECC and have verified that windows and linux reports my ECC ram as ECC enabled. However, I want to verify that ECC is actually working since I got cheapish ram off of ebay that I don’t necessarily trust (some no-name brand with hynix dies).

i.e.)
C:\Users\user>wmic memphysical get memoryerrorcorrection
MemoryErrorCorrection
6 (this is multi-bit detection)

-also-

C:\Users\user>wmic memorychip get datawidth, totalwidth
DataWidth TotalWidth
64 128
64 128
64 128
64 128
(since totalwidth > datawidth, I should be fine, right?)

Also, dmesg indicates ECC is enabled, alongside AIDA64. So it looks to be enabled.

How exactly can I verify this, though? After digging around for hours, it looks like a pro version of memtest86+ is the only way to inject single bit errors into the system, but that costs like $50. Are there any free tools out there for this?

Thanks in advance!

Log · March 20, 2019, 7:46pm

Start overclocking the ram. Then errors should show up in memtest86+ (the free version, comes most Linux boot installers) just fine

Passmark memtest86 is the paid version. Memtest86+ is a free version with source available that has been abandoned for several years, but still works. The confusion is common.

barrett7212 · March 20, 2019, 7:49pm

I tried the procedure laid out in this post:

https://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive-4.html

However, I couldn’t boot into arch with unstable ram. It either booted or didn’t. Nowhere inbetween. Do you have any pointers to overclock ram to an unstable, but still bootable state? I ran 10 hours of testing with the stress tool as described, and 0 errors.

I overclocked it to the xmp profile for 2800, since the next step up would not boot. I then kept tightening the timings until it would no longer boot, and then left it at that value-1 to get it to boot. Still no errors reported.

Edit: to be more clear, I mean this stress tool: https://linux.die.net/man/1/stress

Log · March 20, 2019, 8:00pm

Sorry, it’s really trial and error.

If no adjustment to ram voltage gives you a hard wall, the try setting ram voltage to 1.4 and trying again. Don’t worry about high ram voltage. In fact, creeping it up to 1.5 will likely get it warm enough to start causing instability

Also Set the SOC voltage to 1.1, which is where you should leave it at and is safe 24/7.
Don’t go higher than 1.15v for SOC(anything lower is still safe”, it won’t do anything for stability. Above 1.20v can degrade the soc via high LLC, and above 1.25v WILL degrade it.

Procodt is also important for stability, and the best setting is individual to the motherboard/cpu/ram combo, and not nessicarilly found automatically. You can adjust this to get errors too.

mihawk90 · March 20, 2019, 8:08pm

Someone correct me if I’m wrong, but AFAIK you could also use something like rowhammer that is supposed to introduce RAM errors, they should be reported and/or corrected.

Log · March 20, 2019, 8:12pm

I personally found that with passmark’s memtest86 the row hammer test didn’t do anything other than waste time, which I take as a good sign.

There may be other tests that are more effective in causing errors.

barrett7212 · March 20, 2019, 8:15pm

I’m not super familiar with overclocking memory. I tried the xmp profile for 2800, and increased the DRAM Voltage to 1.4, and no errors yet using stress. It still won’t boot on 2866.

Log · March 20, 2019, 8:39pm

If you get me screenshots of the primary and memory related bios options, I may be able to help you out, otherwise I’m flying blind. Too many screenshots is better than too few.

I’m a bit surprised that the ecc ram you have, has an xmp profile. With mine it was manual the whole way, which is fine since I changed so much.

I’m phoneposting, but I’ll check back late tonight.

barrett7212 · March 20, 2019, 8:49pm

@Log here is my bios with available settings. It currently boots on 2866 at 15 CL

Log · March 20, 2019, 10:30pm

Oh, must have missed that you have a Taichi, thought you had something else. Or many asrock uses the same scheme for multiple boards.

If you look at the Column of numbers that you can’t change, that’s actually what the board is running at. What happened is that it couldn’t make your entered numbers work, so it ignored them. You’re actually running at 2666 20-19-19-19

Late tonight I’ll cobble together some of my other helpposts into a guide. Well get a good OC foundation first, then it’ll be easier to create small instability.

barrett7212 · March 20, 2019, 10:36pm

Are you sure? If I switch my frequency or ram timings in the bios (right side) , I can detect the changes in the OS (admittedly only frequency, not sure how to check timings in linux).

Unless Linux is lying to me. Running a memtest86+ run on 3200 mhz (memtest detects the higher frequency), no errors and almost done. What’s going on here?

Edit: thought 3200 was unstable enough since I was able to boot into arch, but it froze.

Log · March 20, 2019, 10:56pm

If you have windows, run the ryzen timing checker, which is the only non-bios way to check that I trust. I vaguely recall seeing incorrect readings in other things, but that was early on in the ryzen comparability phase, and a long time ago.

www.techpowerup.com/download/ryzen-timing-checker/%3Famp

Another thing that can happen is that the bios may not be able to work things out the first time, but subsequent tries (with incremental changes to the auto settings) can work.

barrett7212 · March 20, 2019, 11:34pm

Here is what I get after booting into windows with 2866 MHz, 15, 15, 15, 15, 25 timings (no idea if these are real or not)

I can see the MEMCLK correct on top, and I see some 15s and 25s below. I think my overclocking is sticking!

Log · March 20, 2019, 11:41pm

And does the bios still show something different? (Also while you’re there, what version is the bios, just to rule that out)

barrett7212 · March 20, 2019, 11:57pm

Latest version, 3.5.

My system doesn’t want to boot anymore on 2866, so it won’t stick. That’s odd. I’ll play with it more

barrett7212 · March 21, 2019, 12:02am

@Log I see what you are saying now. TCL isn’t 15 even though I set it to 15. The bios reset it back to 16, and the timings checker is showing that.

barrett7212 · March 21, 2019, 12:05am

@Log I also verified the bios is matching the timings checker outside of trfc. Not sure how to explain that one.

Log · March 21, 2019, 12:54am

Nice. The bios was likely still trying to auto adjust settings, meaning it was changing each boot. At least we’ve verified I’m not yet delusional.

Also, the bios hates certain timings being odd or even when they get low enough, and will change your “attempt this” setting, as you saw. You can actually try for 14, even if it hates 15.

barrett7212 · March 21, 2019, 1:00am

Interesting, ill have to try that. Whats the next step here? No failures in memtest86+ @ 3200MHz. Now I’m running some tests with Phoronix at the suggestion of @wendell, with no errors yet at 2866MHz.

Not entirely sure which tests to run, but been running some mbw for about a half an hour now.

wendell · March 21, 2019, 1:04am

One thing to watch for is that your bios will set a command rate of 2t instead of 1t on init failure so set 1t at stock speeds then crank up