Ryzen 5700X ECC reporting

Well, fu…

MemTest86 Pro 10.4 states that the 5750G system also doesn’t support error injection (tried PFEH on and off). Seems like overclocking your memory or using a physical device to provoke memory errors is the only way to check ECC on AM4 in the present :frowning:

2 Likes

Well, that is not good news, that is for sure. :slight_smile: But still, thank you for seeing this through and reporting back.

Just one last question. Earlier, you wrote that you experimented with 3000 series CPUs etc. Do you perhaps remember or know if error injection works on those CPUs (i.e. the older ones)? I am only asking this, because I think I could get a 2700X without spending any extra money. Honestly, if error injection were to work on that and I could see the errors being handled then maybe I would just swap the 5700X to that 2700X, even though the 5700X is better for my usecase.

Yes, the physical testing device seems to be the logical solution. However, for my usecase (homelab), I think it is going overboard (i.e. I would be spending the price of 2 CPUs on a 5750G PRO and a physical injector with no guarantees, since ASRock did not manage to confirm just about anything regarding ECC reporting), even though I would like some of my data to be kept safe.

Verified that ECC is working with the 5750G (PFEH OFF) by overclocking the system memory from DDR4-3200 to DDR4-3600 while staying at the default DDR4 memory voltages (1.2 V).

No crashes or non-correctable errors, MemTest86 Pro correctly noticed ECC intervening (hypothesis: Regular operating systems should also be able to properly log ECC errors):

Will look at the 5950X next.

7 Likes

Going with experiences from testing from a few years ago doesn’t seem wise here since the AGESA feature set has been practically completely rebuilt by AMD over the AM4 lifecycle. PFEH was definitly there in non-APU Ryzen CPUs but you should use the latest AGESA version due to various security fixes.

I’ve gifted all my Ryzen 2000 parts to family members so these parts are no longer available to toy with.

But I still got a few 3700X and a 3900 PRO to check out regarding what the current AGESA version is allowing to be shown in the BIOS settings.

The mentioned 5950X MemTest Pro tests are still running with overclocked settings (Infinity Fabric clock 1,800 MHz/DDR4-3600 with 1.2 V), “unfortunately” so far no errors at all have appeared. That might be due to the hidden PFEH being active masking corrected errors by ECC or maybe the memory is of higher quality compared to the 5750G system. The tests can run until Monday, then the system is actively needed again.

That 5950X is also very well cooled with custom watercooling, the 5750G on the other hand is passively cooled in a Streacom case.

3 Likes

Wow, thank you for the in-depth report! I honestly could not help but grin in happiness and agony at the same time seeing your results :slight_smile: , as ASRock failed to achieve in two and a half months what you managed to prove in under a day or so: they communicated that they had a 5650G PRO, where they confirmed that PFEH shows up in BIOS but were unable to manifest ECC errors in the logs and could therefore not prove that ECC errors get reported on the MB.

I especially appreciate the screenshots, as those also seem to support my idea that if ECC reporting is working, then ECC reports should start being generated relatively soon if the system is not stable. For comparison, I spent many hours with my 3200 MHz RAM overclocked all the way up to 4000 MHz (which I would deem as a massive overclock, especially on ECC RAM), running tests in MemTest PRO, Linux and Windows, and failed to see any reports being generated on any OS or software.

Your advice regarding the previous generation CPUs is also logical. Honestly, that was more like a last resort on my side, but I shortly realised that that would also introduce some problems (e.g. I would need to downgrade the BIOS for the MB to recognize the previous CPU, which would also bring back the noisy heatsink FAN since FAN curve optimization was introduced in later BIOS versions, not to mention the possibility of bricking the MB in the process and everything).

All in all, I think I will just order a 5750G PRO and see if I can see any ECC errrors generated that way. This is of course not guaranteed, as I do not know the reason why ASRock support could not see any ECC errros: was the support simply incompetent and/or did not increase the clock speed enough, or does the MB have some problems regarding ECC reporting. I guess I will have to wait and see.

If the 5950X does not yiel any ECC errors, that may also suggest that buying PRO CPUs should be the way to go (even though the other (e.g. 5700X) CPUs are also listed to support ECC on their product page).

1 Like

That’s a Bingo.

At first I wouldn’t get any errors, had to undervolt the memory until the system wouldn’t POST, then increased the voltage one step again so it would be reasonably doubtful that the memory could be stable under a stress test.

So, verified that ECC is also kicking in with a 5950X, unkown PFEH status, and the platform is giving the corresponding feedback to the operating system (motherboard: ASUS ProArt X570-CREATOR WIFI, UEFI 1201, AGESA 120A):

Glad this topic could be resolved and that I haven’t been spouting non-facts since I previously tested ECC with Ryzen 3000 stating ECC worked on AM4.

Can’t say anything about the current state of ASRock except in my opinion their motherboards have become worse feature-wise (currently ECC on AM5 seems to be messed up on ASRock motherboards, working on ASUS motherboards - I can’t verify this since I don’t have an AM5 system yet).

I also know the frustrations of being jerked around by incompetent Tech Support from billion dollar corporations, I’ve been pissed at Broadcom for over a year:

5 Likes

PS: Please also send the findings to ASRock’s Tech Support. It’s not about rubbing it in but hopefully they’ll take notes from this exercise to maybe in the future be able to respond to a similar request from another user more usefully.

I might be able to also check ECC on an ASRock X570 Taichi, was my first X570 motherboard before moving to the ASUS ProArt X570-CREATOR WIFI because of better IO and more frequent UEFI updates.

1 Like

Thank you very much! Now not all hope is lost. :slight_smile: This means that the 5700X should not be a problem. Now all that’s left is for me to try to get this working on the Taichi as well.

I will also try my best to check on my Taichi, however that will probably take some time, as I do not have much experience in overclocking/undervolting RAM. Anyway, since looking at your screenshot it seems that it may take half a day or more for a single error to happen, I will try to get back to 4000 MHz and 1.1 V and then leave MemTest PRO running for a while to see if everything is OK. Just out of curiosity, what was the last working voltage in your case?

Yes, I will definitely get back to ASRock’s Tech Support, especially since they did not even bother replying to my last e-mail. Hopefully, they will be able to use this information to improve their service in the future.

Well, I’ll be d*mned :slight_smile: , but it seems that you are even using similar if not the same memory sticks. Even though I cannot see the voltage on the screenshots, I thought that maybe I could get an idea if I had a look at your timings to see how far they were from the ones I tried last time, but they were too similar. I am using a Crucial 32GB RAM as well (TrueNAS shell says 18asf4g72az-3g2f1 under part number, should be the MTA18ASF4G72AZ-3G2R model though).

The memory details shown on MemTest86 are just the default part names and their default SPD profiles, not the actually active settings.

Voltage settings cannot be really be applied 1:1 between different motherboard models, manual trial-and-error is required.

The “winning” configuration with the 5950X was memory at DDR4-3600, Infinity Fabric at 1800 MHz, CPU voltage via offset - 0.01875 V and memory voltage at 1.18 V.

The other way around was not fruitful, DDR4-3600 seemed to have been the limit with my parts, even after going to 1.35 V above DDR4-3600 wouldn’t POST and settings that would POST didn’t produce any memory errors within a few hours.

Memory tests that are somewhat robust and can withstand criticism take many, many hours, in my previous life before ECC errors in MemTest86 would only appear in Pass 3 or 4 after a day or so.

1 Like

That is true.

Thank you for the values as well. I will try looking around by starting from those values and see what I get. So far I had no luck with 3800 MHz and 1.178V but as you said, we will have to wait and see.

I had memtest report to me that ECC corrected a single bit error on one test once with my 5800X3D on an ASUS Pro WS X570. So I’ve seen it work once, but usually it’s just “no news is good news.” :person_shrugging:

1 Like

Thank you @GeorgePatches! Honestly, judging from these facts, I would bet that the 5700X should then also report ECC errors (as per AMD’s statement).

The only thing left now is to find out if the Taichi reports the ECC errors.

As an update, I have been running a (sadly a bit short) memtest with the following options: 1.14V and 3800 MHz. The system would not boot under 1.14V so I thought this might be enough. Sadly, I did not get a single ECC error in Memtest PRO 10.3 in 4.5 hours. I will try to further reduce the voltage and increase the frequency and run a longer (cca. 12 hours) test and report my results.

Instead of playing with voltages on the RAM, just use a heat gun on low aimed at the ram from about a foot away.

Pretty sure the RAM will puke errors after 5 minutes.

1 Like

That’s a nice method however I didn’t want to risk anything happening to my systems. I really appreciate an issue counter of 0 in day-to-day usage.

Then why an exception for undervolting/overclocking?

I can almost set all BIOS settings blindly and with BIOS flashback there is practically 0 chance of messing components up permanently.

Thank you @KeithMyers for the idea!

Honestly, I have read somewhere that heat guns (similar to hair dryers) can generate static electricity, which may damage the component and/or the motherboard, so I did not opt for that option. However, following your advice, I found that basically a simple lamp directed at the RAM can do the same (i.e. generate heat & light) while not being dangerous. Therefore, I used a lamp and directed it at the RAM sticks. I could raise the temperatures by 15 degrees, hitting 60 Celsius at the end. However, still no ECC errors so far.

The setup that I am now using (and used for the lamp trick as well) is:
1.098 V (the minimum that the BIOS allowed) and 3733 MHz. With this option, the computer would not POST in 50 to 66% of the cases and even if it did, the loading usually took some time. Therefore, I chose this for running the tests.

So far, I am running my 3rd pass with 6 hours into the testing. Sadly no ECC errors so far (not even with the lamp).

I will run this test to completion (4 passes) and post the results.

1 Like

I’ll try tightening tRCDRD, maybe to 16 ticks or even 15 ticks. Basically only samsung B die can do very low tRCDRD and anything else should fall flat on their faces.

1 Like

Okay, so the results are out: no ECC errors reported so far.

I have run Memtest for a total of 24+ hours combined with various settings that would barely boot (4000, 3800, 3733 and 3600 MHz with various voltages). Still no positive results, even after turning to the “lamp trick” and heating the RAM sticks to 60 °C.

Today, I benchmarked with the previously mentioned 3733 MHz and 1.098V setting, which would often not even post (in fact the PC was not even able to restart after exiting from Memtest and selecting the Reboot option). Yet, zero errors.

Now I am really starting to question if ECC reporting works at all on this motherboard.

EDIT: Since the system would revert to the default settings (3200 MHz and 1.2V) upon failing to boot, I deliberately checked the BIOS settings before running Memtest to guarantee that Memtest is being run with the custom settings and not the default ones.

1 Like

From the various RAM overclocking threads at OCN, the trip point for DDR4 and DDR5 RAM temps is around 65° C. before they starting generating errors in Memtest and the other RAM testing softwares.

I don’t think your lamp is getting them to the temps that generate the memtest errors. Why I am sure that the heat gun would do the trick. I’ve never heard of any example where a heat gun generated static electricity. Leaf blowers on the other hand . . . . . yes.

3 Likes

Do you have 2 or 4 DIMMs installed?

1 Like