I would like to ask: did anyone manage to get a Ryzen 5700X to report ECC errors? (especially on the X570 platform)
I built a custom NAS that uses TrueNAS Scale for my homelab and wanted to use ECC (I know it is not mandatory, but I wanted that extra security for some of my data). Therefore, I bought some ECC UDIMM RAM sticks, a Ryzen 5700X CPU and and ASRock Taichi X570 motherboard. My reason for choosing this setup was that the CPU was more than enough for my needs and both the CPU and the motherboard are being advertised to support ECC. Moreover, I saw that ECC reporting should work on the 5000 series CPUs (for example 5900x) as long as the motherboard supports this functionality.
After booting up the system, I could verify that ECC was supposedly working, as Memtest, Linux, Windows etc. said that the memory was working in ECC mode (Multi-bit ECC correction). However, I am not able to see any ECC reports even after increasing the speed of the RAM to 4000 MHz, using tight timings, etc. and running memtest with these higher values for long periods of time.
In theory, everything should work as far as I know: AMD support confirmed that ECC reporting should work on a 5700X CPU. ASRock support was also extremely helpful as they mimicked my setup with QVL RAM sticks etc, and they confirmed that there is no restriction from the motherboard’s side either. However, even they could not see any errors being reported on their setup. So both sides say that the functionality should work, but for some reason, it does not: neither for them, nor for me.
Did someone have more success with ECC reporting on a 5700X CPU?
Thank you very much for your response! Could you perhaps mention your motherboard as well?
Yes, I am now asking specifically for 5700X (but posts with other CPUs are also welcome), as ASRock support insisted that the errors not being reported must be related to the CPU. However, as far as I saw on the Internet, other people reported that it works on other 5000 CPUs (such as the 5900X), just as AMD support confirmed it. Therefore, now I am trying to find anyone who could maybe confirm that 5700X should report the errors. That way, I could once again ask them to have a look at this issue.
AsRock Rack X570D4U-2L2T. RAS daemon in Proxmox checks for decrypted mcelog outputs.
The question is…what do you expect and what tool do you use to display errors? You need the OS to enable this and provide some way for you to see the messages.
I did see reporting on Core, but in my limited time with Scale I didn’t see anything. Might as well be iX systems not using RAS and EDAC stack as much as they do on Core. It’s still a new product after all.
And no reports doesn’t mean it’s not working.
Honestly, I tried every method I could find online on every OS I had access to. This meant that I tried checking the logs for ECC errors:
on Windows (WHEA errors) in the system log
on Linux (multiple flavors) using edac-util, grepping system logs, etc. and some newer packages/commands that were newer and supposedly replaced edac-util, still no success
I did the same on TrueNas Scale, as it also runs Linux
I also ran Memtest86 PRO after increasing the speed of the RAM to see if any errrors get logged.
On Windows and Linux, I ran memory benchmarks to use every available RAM except for 1 GB for the OS to drive the OS to the brink of instability while running the RAM with 4000 MHz etc. to increase the chance of encountering errors. However, I was not able to get a single one even after hours of testing.
As far as I am aware, ASRock also tested the setup both on Windows and Linux, but they were also unsuccessful and could only suggest that I upgrade my CPU to a PRO CPU, because for some reason, there is no PFEH (Platform First Error Handling) in the BIOS for non-PRO CPUs. On the other hand, AMD support verified that there is no such limitation between the PRO and non-PRO CPUs in terms of ECC reporting. So now AMD says that I should get help from ASRock, and ASRock says that I should get help from AMD, as both of them is sure that the fault is not with their products.
Since the post garnered a bit of interest in the form of 150+ views as of now, I thought that maybe some other people are also interested in this question. Furthermore, since I also benefited greatly from previous L1 posts, which I am more than thankful for, I will leave the conclusion of my issue here for the future readers.
I think that at this point it is safe to assume that there will be no crucial updates on ASRock’s part as they seem to have completely let this issue go. Not only did my issue not progress at all in the last 2 months, in their last e-mail they flat-on refused that the BIOS layer has anything to do with anything ECC-related (which is not true as they themselves confirmed that PFEH shows up on some Ryzen CPUS and not on some others). They also concluded that as long as ECC reports are not being generated, the functionality is working, which is pure bogus: ECC reports are not generated on non-ECC RAM either, and that is anything but proof of ECC working on that non-ECC RAM.
So all in all, even though this is just my own opinion on this matter, I would advise against choosing the ASRock X570 Taichi as a motherboard for anything ECC-related, because although it is (even now) being marketed as fully supporting everything ECC-related and being a server-grade MB, even the support cannot get ECC reporting to work or confrim that it should work in theory.
To expand on what @aBav.Normie-Pleb said, AGESA is a firmware component supplied to OEMs by AMD. So when ASRock claims something like PFEH is not in the BIOS layer, they likely mean it’s not in the area they control and are able to customize.
Of course that does not let them off the hook for any claims of support — if they say ECC works, they should readily be able to demonstrate it.
Thank you @aBav.Normie-Pleb and @Quension for your helpful answers. Now at least I know the reason for PFEH not appearing in the BIOS. (Although the funny thing is that ASRock said that the PFEH option was shown for the 5650G PRO in the BIOS when they tried that CPU with the x570 Taichi.)
@aBav.Normie-Pleb: Although this may sound like a silly question, did you manage to get ECC reporting to work even without PFEH or did it only work with PFEH enabled? I am asking this because I also considered swapping the 5700x to a 5750G PRO if that would result in the ECC reporting working (the 5650G PRO is not really available in my country and I would also be losing some performance).
No, so far any BIOS that doesn’t show the PFEH option seems to have it automatically enabled (meaning certain correctable errors are masked by the motherboard and remain invisible for the operating systems).
I’ve made a quick BIOS comparision between a 5950X and a 5750G (PRO) on the same motherboard model (ASUS ProArt X570-CREATOR WIFI, UEFI 1201) with the latest public AM4 AGESA version as of now (120A).
Both systems use 128 GB Micron ECC memory. To verify that Multi-bit ECC is actually working you can use memory testing software like Passmark Memtest that supports error injection. Of course that’s not optimal but at least you get to use ECC in a way.
Note that Ryzen 5000 CPUs and APUs don’t just differ in the PCIe version (Gen4 vs. Gen3), APUs also have a more limited set of support for PCIe Bifurcation (only up to 3 individual PCIe devices can be operated off of the main x16 lanes, so x8/x4/x4 is possible but x4/x4/x4/x4 is impossible).
Danger: The 5500 even without the G at the end is internally also just an APU where the graphics part has been disabled.
Thank you very much for the detailed comparison and your invested time! I really appreciate it. It is ironic though how much more information I got on this forum in a matter of days as compared to what has now been about two and a half months of constant e-mailing with ASRock… (For example, they could not even tell me what happens if PFEH is not visible in the BIOS: as in is it ON or OFF by default.)
Yes, I actually tried Passmark Memtest Pro to check error injection, but error injection is disabled on the 5700x. Do you perhaps have any information if that is at all available on the 5750G PRO? As far as I know, it is disabled on all 5000 CPUs.
Sorry for the barrage of questions, I am just asking this because if push comes to shove I am thinking of buying a 5750G PRO to get ECC reporting to work, and therefore I am trying to make sure that I do not end up with another CPU without the ECC reporting working yet again.
Yes, I read about this bifurbication, which led me to lean more towards the X CPUs instead of the PROs when I bought the CPU, but I think I would appreciate a fully working ECC more now than bifurbication, as I am not using it at all currently.
I shall look at that, have to check if my PassMark Memtest Pro license is still valid.
To be honest I’ve become quite lazy regarding actually checking ECC, with Ryzen 3000 I did intensive testing verified it to be working and then I just went on “trusting” that AMD wouldn’t mess things up with AGESA updates.
The only check I nowadays do is if the OS is correctly reporting Multi-bit ECC and I’ve been using S3 suspend-to-RAM extensively (not the hybrid stuff) on a daily basis with 0 instability issues since Ryzen 2000 (when I moved over from Xeons).
@Log
Yes, honestly I have been wondering if I should actually invest in one of those. For now, I have postponed as it costs the same amount as a PRO CPU and since ASRock could not verify ECC working with a 5650G PRO either, it may as well just be the MB’s fault. I have even seen some workarounds on this forum, like using an extra ECC stick with some soldering and essentially shorting the stick to generate ECC errors. While this seems to work, I would much rather avoid using that method to avoid any potential problems and damaging any of my components.
@aBav.Normie-Pleb
Yes, honestly I have been thinking about just letting it go as is, since all this time looking at forum posts, and especially the time spent trying to get any non-standard answers from ASRock (apart from if it is on the product page, it should work in theory) is starting to seem like just extra time spent in vain. Maybe I should just “trust” it as it is, but had I known this, I would definitely not have spent this much on this specific MB just because it is being advertised to support ECC RAM and instead would have opted for other options that were literally half the price of this one.
MemTest86 Pro 10.4 states that the 5750G system also doesn’t support error injection (tried PFEH on and off). Seems like overclocking your memory or using a physical device to provoke memory errors is the only way to check ECC on AM4 in the present
Well, that is not good news, that is for sure. But still, thank you for seeing this through and reporting back.
Just one last question. Earlier, you wrote that you experimented with 3000 series CPUs etc. Do you perhaps remember or know if error injection works on those CPUs (i.e. the older ones)? I am only asking this, because I think I could get a 2700X without spending any extra money. Honestly, if error injection were to work on that and I could see the errors being handled then maybe I would just swap the 5700X to that 2700X, even though the 5700X is better for my usecase.
Yes, the physical testing device seems to be the logical solution. However, for my usecase (homelab), I think it is going overboard (i.e. I would be spending the price of 2 CPUs on a 5750G PRO and a physical injector with no guarantees, since ASRock did not manage to confirm just about anything regarding ECC reporting), even though I would like some of my data to be kept safe.
Verified that ECC is working with the 5750G (PFEH OFF) by overclocking the system memory from DDR4-3200 to DDR4-3600 while staying at the default DDR4 memory voltages (1.2 V).
No crashes or non-correctable errors, MemTest86 Pro correctly noticed ECC intervening (hypothesis: Regular operating systems should also be able to properly log ECC errors):
Going with experiences from testing from a few years ago doesn’t seem wise here since the AGESA feature set has been practically completely rebuilt by AMD over the AM4 lifecycle. PFEH was definitly there in non-APU Ryzen CPUs but you should use the latest AGESA version due to various security fixes.
I’ve gifted all my Ryzen 2000 parts to family members so these parts are no longer available to toy with.
But I still got a few 3700X and a 3900 PRO to check out regarding what the current AGESA version is allowing to be shown in the BIOS settings.
The mentioned 5950X MemTest Pro tests are still running with overclocked settings (Infinity Fabric clock 1,800 MHz/DDR4-3600 with 1.2 V), “unfortunately” so far no errors at all have appeared. That might be due to the hidden PFEH being active masking corrected errors by ECC or maybe the memory is of higher quality compared to the 5750G system. The tests can run until Monday, then the system is actively needed again.
That 5950X is also very well cooled with custom watercooling, the 5750G on the other hand is passively cooled in a Streacom case.
Wow, thank you for the in-depth report! I honestly could not help but grin in happiness and agony at the same time seeing your results , as ASRock failed to achieve in two and a half months what you managed to prove in under a day or so: they communicated that they had a 5650G PRO, where they confirmed that PFEH shows up in BIOS but were unable to manifest ECC errors in the logs and could therefore not prove that ECC errors get reported on the MB.
I especially appreciate the screenshots, as those also seem to support my idea that if ECC reporting is working, then ECC reports should start being generated relatively soon if the system is not stable. For comparison, I spent many hours with my 3200 MHz RAM overclocked all the way up to 4000 MHz (which I would deem as a massive overclock, especially on ECC RAM), running tests in MemTest PRO, Linux and Windows, and failed to see any reports being generated on any OS or software.
Your advice regarding the previous generation CPUs is also logical. Honestly, that was more like a last resort on my side, but I shortly realised that that would also introduce some problems (e.g. I would need to downgrade the BIOS for the MB to recognize the previous CPU, which would also bring back the noisy heatsink FAN since FAN curve optimization was introduced in later BIOS versions, not to mention the possibility of bricking the MB in the process and everything).
All in all, I think I will just order a 5750G PRO and see if I can see any ECC errrors generated that way. This is of course not guaranteed, as I do not know the reason why ASRock support could not see any ECC errros: was the support simply incompetent and/or did not increase the clock speed enough, or does the MB have some problems regarding ECC reporting. I guess I will have to wait and see.
If the 5950X does not yiel any ECC errors, that may also suggest that buying PRO CPUs should be the way to go (even though the other (e.g. 5700X) CPUs are also listed to support ECC on their product page).
At first I wouldn’t get any errors, had to undervolt the memory until the system wouldn’t POST, then increased the voltage one step again so it would be reasonably doubtful that the memory could be stable under a stress test.
So, verified that ECC is also kicking in with a 5950X, unkown PFEH status, and the platform is giving the corresponding feedback to the operating system (motherboard: ASUS ProArt X570-CREATOR WIFI, UEFI 1201, AGESA 120A):
Glad this topic could be resolved and that I haven’t been spouting non-facts since I previously tested ECC with Ryzen 3000 stating ECC worked on AM4.
Can’t say anything about the current state of ASRock except in my opinion their motherboards have become worse feature-wise (currently ECC on AM5 seems to be messed up on ASRock motherboards, working on ASUS motherboards - I can’t verify this since I don’t have an AM5 system yet).
I also know the frustrations of being jerked around by incompetent Tech Support from billion dollar corporations, I’ve been pissed at Broadcom for over a year: