How to validate ECC memory?

barrett7212 · March 20, 2019, 10:36pm

Are you sure? If I switch my frequency or ram timings in the bios (right side) , I can detect the changes in the OS (admittedly only frequency, not sure how to check timings in linux).

Unless Linux is lying to me. Running a memtest86+ run on 3200 mhz (memtest detects the higher frequency), no errors and almost done. What’s going on here?

Edit: thought 3200 was unstable enough since I was able to boot into arch, but it froze.

Log · March 20, 2019, 10:56pm

If you have windows, run the ryzen timing checker, which is the only non-bios way to check that I trust. I vaguely recall seeing incorrect readings in other things, but that was early on in the ryzen comparability phase, and a long time ago.

www.techpowerup.com/download/ryzen-timing-checker/%3Famp

Another thing that can happen is that the bios may not be able to work things out the first time, but subsequent tries (with incremental changes to the auto settings) can work.

barrett7212 · March 20, 2019, 11:34pm

Here is what I get after booting into windows with 2866 MHz, 15, 15, 15, 15, 25 timings (no idea if these are real or not)

I can see the MEMCLK correct on top, and I see some 15s and 25s below. I think my overclocking is sticking!

Log · March 20, 2019, 11:41pm

And does the bios still show something different? (Also while you’re there, what version is the bios, just to rule that out)

barrett7212 · March 20, 2019, 11:57pm

Latest version, 3.5.

My system doesn’t want to boot anymore on 2866, so it won’t stick. That’s odd. I’ll play with it more

barrett7212 · March 21, 2019, 12:02am

@Log I see what you are saying now. TCL isn’t 15 even though I set it to 15. The bios reset it back to 16, and the timings checker is showing that.

barrett7212 · March 21, 2019, 12:05am

@Log I also verified the bios is matching the timings checker outside of trfc. Not sure how to explain that one.

Log · March 21, 2019, 12:54am

Nice. The bios was likely still trying to auto adjust settings, meaning it was changing each boot. At least we’ve verified I’m not yet delusional.

Also, the bios hates certain timings being odd or even when they get low enough, and will change your “attempt this” setting, as you saw. You can actually try for 14, even if it hates 15.

barrett7212 · March 21, 2019, 1:00am

Interesting, ill have to try that. Whats the next step here? No failures in memtest86+ @ 3200MHz. Now I’m running some tests with Phoronix at the suggestion of @wendell, with no errors yet at 2866MHz.

Not entirely sure which tests to run, but been running some mbw for about a half an hour now.

wendell · March 21, 2019, 1:04am

One thing to watch for is that your bios will set a command rate of 2t instead of 1t on init failure so set 1t at stock speeds then crank up

Log · March 21, 2019, 6:40am

I’ve run out of time for today (and if I’m honest, it’ll probably take another day after that too), so I don’t have my guide ready yet (it’s turning more into a general guide I’ll probably post as a standalone thread meant for feedback, and link you to it.), but here is how to deal with ProcODT

Prior to dealing with ProcODT

Set DRAM Voltage and VDDCR_SOC Voltage as I mentioned
Explicitly set CommandRate=[1T] and leave GearDownMode=[auto]
Find the highest stable or near stable frequency

ProcODT
This is one of the most important settings that most people don’t know to set. Prior to the latest bios, mine was always auto set at some less than optimal, but I noticed that this last time around it was actually fine. This may be luck, or this may be an actual bios improvement (I really wish they’d put out a fucking changelog). I have no idea at the moment, just like I have no idea how to actually know what the current setting is automatically set to. Notice that even Ryzen Timing Checker doesn’t know what it is.
To know what it is, I have to manually set it and test stability.

Dealing with ProcODT is literally just brute forcing every option from 43.6 ohm to 96 ohm until you find ALL of the settings that work at the easiest maximum frequency you can get to as a starting point. As you up the frequency, you’ll eventually be left with only one ProcODT that works. Note that just because a frequency boots doesn’t mean it’s actually stable. You’re just mapping out the “Boots/It freezes or doesn’t boot” options right now.

You can also use a somewhat stable frequency to figure out what crashes the least. For example, here some averages from testing how long Passmark Memtest86 could go without freezing/crashing/throwing an ECC error.

ProcODT setting:ramtest time[seconds]
43.5 = Noboot
48 = 403
53.5 = 86
60 = Boots, crashes soon after

As it turns out, my ram really loves 48 more than anything else, which I later confirmed a few more times.
ProcODT is very specific to individual setups. Generally though, large amount of sticks do better with lower numbers. If I cut my 8 sticks to just 4, I’d have to retest.

These settings are all safe, despite a common misunderstanding regarding a comment from an AMD engineer, who had badly communicated about about thermal stability, not thermal safety. All you’re doing is looking for the one that best fits your specific setup. You will likely find 53.3-68.6 ohm to be good. It’s possible that the highest settings may generate more heat, but you are unlikely to run into that.

sceps · March 21, 2019, 6:54am

It’s not about ECC specifically, but here’s a great article with lots of info about RAM overclocking on Ryzen/AMD by Yuri Bubly (1usmus), the author of the famous DRAM Calculator for Ryzen: https://www.techpowerup.com/reviews/AMD/Ryzen_Memory_Tweaking_Overclocking_Guide/

barrett7212 · March 24, 2019, 5:31pm

@Log Thanks for the writeup, I’ll start looking into this this evening. Are you saying memtest reports ecc errors or just errors in general? @Wendell seemed to imply that I should be in OS to see the ECC errors reported by dmesg.

Log · March 24, 2019, 11:35pm

Passmark Memtest86 will report ECC errors in a special column, and I believe Memtest86+ does as well.

Linux and Windows have mechanisms for logging ECC errors as well, and I’ve used them. Under windows you’d look for some WHEA errors in the event viewer, but I forget that they specifically look like. Note that historically windows has spammed WHEA EventID 19

A corrected hardware error has occurred.
Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Cache Hierarchy Error
Processor APIC ID: 0
The details view of this entry contains further information.

For some reason or another. I only have one sitting in my log. I’m pretty sure you can ignore these.

I completely forget how to check under linux.

barrett7212 · March 27, 2019, 9:03pm

After screwing around a bit with memtest, I got the following error:

This doesn’t look like an ecc error though. Not sure what to think of this. It was frozen when I noticed it.

Log · March 28, 2019, 5:13am

That is definitely a regular, non-ecc corrected error.
It should specifically state “ECC” somewhere if it was a 1-bit corrected error, ~~and a 2-bit error should be detected and cause a system halt.~~

So I just discovered something that appears to explain what might be going on. If anyone knows better, please speak up. It seems that ECC ram typically uses what are called “hamming codes” to store parity and allow for correction and detection of the amount of bits they are designed for. Many of these implementations have an issue where they can detect any ODD number of errors, but not even. So it can detect 1, 3, 5, errors etc. But it can’t detect 2, 4, 6 errors. There seem to be a variety of error correction implementations out there, so I’m not very clear on the specifics. Some ECC ram can also be able to correct 2 bits just fine, but 3 bits would be invisible. Like I said, this is new stuff to me. You might try running this program to see if there is any metadata in the ram that could help us find it’s specifications.

From your screenshot, it looks like 2 bit errors happened, meaning your ram may not be the kind that can handle an even number of errors.
Expected: EFEFEFEF = 11101111111011111110111111101111
Actual----: AFAFEFEF = 10101111101011111110111111101111

Relevant links: https://www.crucial.com/usa/en/memory-server-ecc
https://www.atlantic.net/hipaa-compliant-hosting/ecc-vs-non-ecc-memory-critical-financial-medical-businesses/
https://www.vusec.net/projects/eccploit/
https://en.wikipedia.org/wiki/Hamming_code

Since it looks like you’ve obtained passmark memtest, this weekend I’ll look at playing around with the error injector. It should work on ryzen when enabled, with potential caveats

My ram guide on temporary hold, because I discovered that 1usmus (basically THE ryzen based ram overclocking guy) has posted a guide. So I’ll need some time to parse through that and ~~perhaps fix my shortcomings~~ throw what I thought I knew right out the window. I also work in agriculture, and spring is finally here in full force so I am just plain out of time.

By the way… welcome to RAM testing, where doing fuckall takes weeks.

Marten · March 28, 2019, 5:30am

All this effort into creating an ECC error is making the world a safer place to trust ecc memory

barrett7212 · April 3, 2019, 6:43pm

I appreciate your efforts to help me debug this kit of ram! However, unfortunately, I returned the ram since the return window was closing and the inability to verify the ecc wigged me out.

Someday I will buy a different kit and verify it. When I do that, you might see this thread revived.

Thanks again

barrett7212 · April 3, 2019, 6:44pm

Yeah. Unfortunately I was unable to verify this kit so I returned it. Once I buy a new kit, I will try to verify it

sjlnk · April 15, 2021, 5:05pm

Did you buy a new kit? How did your tests go with that kit?