I feel like this issue is brought up a lot but somehow I still don’t quite know the answer to this. I am building a new system with a 12 or 16 core ryzen 9 and would like ecc memory.
Because the memory will be on the “slower” side (5200 or less from what I have seen) can I use all 4 dimm slots? Assume mobo is configured for stability and efficiency and I have a preference for $/GB savings. And, if so, can I mix kits?
I assume that selecting for ddr5 and ecc on newegg’s filter is just returning all ddr5 because of the ecc required for the ddr5 standard? When I just search the site I get totally different results. Yes, this is my first time trying to purchase ecc. How did you know?
Where do you purchase your ecc and what brand is typical for non professional use?
Will I even notice a difference in real world (just a guy running jellyfin, family backup, old man tier gaming, numerous low use containers, and aspirations for some AI tinkering) if I save some coin on the slower stuff?
This will refer to ddr5. (I had to research it so others may not know)
There are UDIMMs and RDIMMs. The “R” stands for register clock driver and it is visible on the module. UDIMM and RDIMM are keyed differently and are not interchangeable. RDIMMs are also referred to as buffered dimms. The U in UDIMMs does not have this clock and is therefore unbuffered.
RDIMMs are for server use. UDIMM for everything else.
There is “on die” ecc for both varieties of module. When data is received the module calculates and appends ecc data and stores them both. When performing a recall it verifies that no bits flipped while in storage by using the ecc bits. Single bit errors can be corrected but multiple bit errors will error / crash.
And here is where things got a bit confusing for me. “True ECC” requires that the payload arrives from the module to the cpu intact. This is done by the cpu. It generates ecc bits and embeds them into the payload sent to the memory module and deciphers them when it reads data back. “But”, you ask, “where are these additional bits stored?”. Well now we know what differentiates “ecc” and “non-ecc” UDIMM. The ecc UDIMMs will have additional DRAM chips to hold the ecc bits. The ecc UDIMM is oblivious to this and calculates its “on-die” ecc on a slightly larger collection of bits.
This process requires the cpu and mobo’s memory controller to be able to do the job. I don’t know about intel, but ryzen’s newer chips and many of their mobos do as well.
How to identify true ecc UDIMM? They will have 10 DRAM packages, not 8.
Newegg’s filter only returns Kingbank branded memory when “ecc” is selected. I am guessing someone entered that data incorrectly. Searching instead for “unbuffered ecc ddr5 udimm” does kind of work.
Also, now I am not entirely convinced that true ecc is actually needed. Perhaps just buying / running ddr5 a bit slower is enough to ensure the memory controller does not make mistakes? Of course you would never know if did make a mistake unless it crashed somehting. Anyone have any data on this?
I needed to look up a lot of this so others may find it useful (and I may have misunderstandings… sorry).
ODEC2 = On-die ECC (ODECC) ?
SEC = Single Error Correction. That is a single bit flip will be found and corrected.
DED = Double Error Detection. Multiple bit flips will be detected but not corrected. (see SECDED)
EC4 / EC8 = The number of correction bits per a subchannel. EC4 appears to common in UDIMM. A subchannel is a division on the the DIMM stick itself for increased performance.
chipkill = Detection / correction of failing DRAM chips along with failure tolerance. I have no idea if true ECC memory typically has this feature or it is even supported by any ryzen boards.
Trying to summarize @lemma post (best to read it yourself) it is probably not worth it for true ECC because:
Mobo MC seem very reliable. Not likely to make a mistake.
DDR5 standard requires additional CRC. CRC ensures that proper operation (read, write, etc) and memory addresses are received by the MC. It does not verify the payload data (thats the true ECC job). So the what and who bit flips are already checked.
Keeping a system cool and set to speeds defined in the standard keeps bit flip chances low.
Unless you are operating in a harsh environment (increase is radiation causing flips) or cannot stand down time for memory checks / swaps then the benefit of true ECC may not be worth it.
True ECC adds SEC and DED for both in memory itself and transferred data on the memory bus. But I don’t know that it always reports these errors on ryzen. Probably any errors on the transferred data but I don’t know about the errors on the module itself. (Running ubuntu)
Practically I probably don’t need true ecc. But I will probably buy it anyway just to be able to play around with it.
EC2 supports SEC. EC4 supports SECDED. EC8 chipkill is also true ECC.
A more useful way of looking at it is probably that there isn’t fake ECC. Not that that’ll stop forum doomers from going on about how ODEC2+CRC must be terrible because it’s not identical to EC4. Which I guess would make EC8 god tier if they’d heard of it.
What I do find weird is there doesn’t seem to be a way to get SEC logging from ODEC2 ECS. DRAM datasheets I’ve looked at indicate correction counters but I don’t know of an OS that hooks up to them.
Um, no, DDR5 CRCs are eight bits covering 64 bits of data. Implementations I’ve been able to access documentation send ATM HECs as transfers 17 and 18 in a burst and therefore detect odd count, double, and quite a few other even count errors.
As best I’ve been able to tell you need to be under NDA to find out what exactly AMD (or Intel) does on the IMC.
All MSI boards, last I checked, and most Asus and Gigabyte boards don’t support bus EC4. Asus also seems characteristically flaky, including reports here of on QVL ECC DIMMs not posting. ¯\_(ツ)_/¯
Other thing is UDIMMs should run without errors, so in a well configured and tested system there’s normally nothing to report.
Both. An IMC has no way of differentiating transmission errors from on DRAM errors.
As chipkill requires EC8, searching for DDR5 RDIMMs that aren’t EC4 is currently needed.
So CRC is calculated on everything? Then on a read the memory controller performs a CRC on all the bits sent from the memory? Sorry for not getting this but then, aside from the reporting, what is the value on ECC? Isn’t it redundant?
Definitely DQ[31:0]_A or DQ[31:0]_B. If CB[3:0]_A|B or CB[7:0]_A|B are in use I presume those are also CRCed but I haven’t been able to find documentation outside of JEDEC’s paywall specifically confirming that. Yes, the IMC performs CRC checking on reads.
CRCs perform error detection. If you want correction ECC is required but, since ODEC2 and EC4 are both SEC, what EC4 offers is DED, SECing DRAM bit flips which happened since the most recent ECS, and the likely negligible performance advantage of SECing transmission rather than retrying a read burst. Also, since OSes don’t seem to report CRC errors or ODEC2 SECs, ECC offers a monitoring workaround. But, as transmission errors are rare and ODEC2 SEC preempts most bus ECC SECs of DRAM content, bus EC4 is primarily a mechanism for DDR5 DRAM DED reporting.
Not entirely but if the value proposition for DDR5 EC4 is as strong as it is for DDR4 EC4 I’m failing to figure out how, even with current documentation gaps and the software misses in monitoring.
There aren’t many recent, open analyses of DRAM reliability at datacenter scale but, historically, chipkill was like the second most common failure mode. So EC8 RDIMMs seem to make sense.