I’m researching Registered vs Unbuffered RAM. Some sources suggest the only difference is that registered RAM allows more of it. I found this however:
where it says:
“ECC UDIMMs provide a limited reliability and are known to cause data corruptions and system crashes due to single bit errors and single DRAM failures. RDIMM modules offer a comprehensive RAS solution including parity and availability of extended ECC, which minimize these issues.”
So it seems like there’s more to the story than just capacity. But after a while searching for additional info/explanation, I’ve come up empty.
Anyone know more on this than that article? This article doesn’t make sense to me because the whole point of ECC is for single bit error correction, not crashing!
But then there’s this article:
which basically talks about additional RAS features on their Scalable chips which only seemingly support RDIMMs.
That EE Times article (great magazine btw.) is 10 years old. And ECC UDIMMs have single bit error correction (nowadays). Don’t know what DDR3 was like.
RDIMMs have the capacity and bandwidth advantage. That’s why you see 8x 64GB DIMMs in servers all running at 3200MT/s without any problems. EPYC specs even list 2DPC config with 2933MT/s for 16x DIMMs. That’s insane compared to non-ECC consumer memory. That’s what “higher performance” means in marketing brochures. And the option to get 128GB DIMMs. 256GB modules are too special and cost as much as a normal server. They still make good marketing material for max RAM size on boards.
If you can get RDIMMs, get RDIMMs. Support is limited to server platforms however, so any other boards will have to work with UDIMMs.
There is no real choice other than choosing another board/CPU. And if you want ECC, you get the ECC that’s compatible.
The biggest problem with going RDIMM is that it’s only supported on high end CPUs. I need something much less fire breathing. Xeon Silvers for example… 150W 10 core. That’s double to almost triple the computation/power that I need.
In UDIMM, the CPU’s memory controller communicates directly with each memory banks on the DRAM module. This can put a lot of load on the memory controller, especially on the electrical side, leading to signal integrity issues with higher clock, or with higher number of memory chips. Generally, there are two kinds of chips on these UDIMM modules: a memory chip itself, and a SPD (Serial Presence Detect) chip.
RDIMM adds another kind of chip to the DRAM module, called RCD (Register Clock Driver). The CPU’s memory controller connects its control lines to an RCD, and the RCD is responsible for communicating to each memory banks on behalf of the memory controller. This frees up the memory controller, allowing for more DRAM module, and higher capacity per module, without sacrificing signal integrity.
The downside of adding RCD is that RCD takes a few clock cycles to relay the command (thus the “Buffered”/“Unbuffered” distinction). In both UDIMM and RDIMM, data lines are still connected directly to the memory controller. LRDIMM, on the other hand, introduces a Data Buffer (DB) chip that sits between the data line and the memory clip.
DDR5 RDIMM introduced an on-module PMIC to supply a stable voltage (JEDEC only allows up to 33mV of fluctuation) for even better stability and more granular power management.
Unfortunately I’m none the wiser. I’m not sure what it all means. If UDIMMs can’t detect parity, then how do they detect memory errors? How are they still ECC?
I don’t think ECC is ever perfect. AFAIK, it can correct 1 bit per byte on the bus, with UDIMM RDIMM, or LRDIMM, and detect correctly 2bits of error per byte, informing the system that the data is no longer reliable. Past that, isn’t everything more or less screwed? If your memory is failing more than a quarter of bits in a word, you’re in for a bad time, but this basically should only happen if you’re looking at hardware failure or you’re overclocking.
Memory corrupts your data, ECC just corrupts slower Good enough for most stuff. There are systems out there covering more bits. Gets ludicrous expensive really quick. You can always run a RAMDisk in RAID1 for cheapskate memory paranoia.
Given how rare it is for single bit errors to occur with stable modules, an ECC scheme that requires 3 errors to be present for data corruption provides so much statistical protection that it’s not really worth concern. And if a system is so unstable that it can realistically generate 3 errors at once in a group, it’s not going to be generating only those 3 errors (unless you are trying to deliberately engineer errors with rowhammer attacks or something), it’s going to be spewing out all sorts of errors that will fill the logs and likely quickly crash the system on 2 errors detected.
Basically the worry is an academic one, especially when the only available alternative is to choose no protection at all or to implement some kind of slow, expensive, and power hungry mirroring solution (which apparently do exist).
Do note, there are multiple ECC schemes floating around, which have different levels of capability, and if I have understood properly, some schemes may even vary in how well they can actually detect/correct errors based on the exact circumstances like the locations of the errors.
It is 1 or 2 bits per 72-bit stride (9 chips of 8 bits each), of which the processor uses 64-bits, i.e. parts of the data or parity can be corrupted and fixed also.
Some time ago I wrote a program to test all possible bitflip results for Hamming 72/64 SECDED :
Bits corrupted
Detect 0 bits
Detect 1 bit
Detect 2 bits
Explaination
1 bit
0
100%
0
All 1-bit errors detected and corrected
2 bits
0
0
100%
All 2-bit errors detected, but not corrected
3 bits
0
56.4%
43.6%
All 3-bit errors detected, but most are incorrectly detected as correctable and badly corrected
4 bits
0.0082%
0
99.18%
Most errors detected, none incorrectly fixed, but some errors missed
Adding to the replies you’ve already gotten, SDRAM ECC is a lot simpler than people tend to expect: the DIMM does nothing, it just has a little extra storage. The host memory controller is what that uses that storage to perform ECC (or not).
A non-ECC DDR4 DIMM has 64 bits (8 bytes) stored at each address. An ECC one has 72 bits (9 bytes). The memory controller (which is on the CPU package these days) tells the rest of the system it’s still only 64 bits of data, and takes the extra 8 bits itself to store and execute ECC.
The exact sizes change across SDRAM generations and memory layouts, but the principle is the same. UDIMM, RDIMM, LRDIMM etc are about the communications interface as a whole, and don’t change anything for ECC.
Not to complicate this too much further, but if you have a few hours to spare this paper clearly explains the state of the various ECC implementations (there are many now) with DDR5. ECC is very different now than it use to be in the past; much more logic and mitigation techniques are within the DIMM now and not just the memory controller.
One of the more interesting snippets from the paper is the observed failure modes and their respective occurrence rates:
The paper is a proposal for a modified DIMM format based on the DDR5 standard. At a high level, what DDR5 adds is an evolution of self-test and diagnostics for the DIMM itself, focused on managing hardware defects. It happens to be using ECC (which is a general technique), and is definitely an advancement, but it’s internal to the module. It’s not related to the system’s use of ECC to manage data integrity (what the paper refers to as “rank-level ECC”). The paper’s argument is essentially that the current DDR5 standard is inefficient due to separation from the memory controller, and that closer integration would reduce that inefficiency.
The table quoted is data from a study of 149,504 2GB DDR2 DIMMs in the Jaguar HPC cluster at Oak Ridge National Laboratories, conducted 2009-11 to 2010-10. The paper used this data as input to simulations for its proposal.
Today a DDR5 “ECC DIMM” is still just extra storage for use by the memory controller.
The on-die ECC that bog standard DDR5 employs is true ECC that can detect and correct single bit flips on a codeword due to cosmic rays, neutron radiation ect; it is just not as robust as previous generation’s rank-level ECC, if I recall on-die ECC is uses 8 parity bits for every 128 bits of data while “real”/legacy ECC on DDR5 uses either 4 or 8 parity bits per 32 data bits transferred (EC4 and EC8). Another disadvantage of the on-die ECC over rank-level ECC is that is only corrects errors within the DIMM and can’t account for errors that occur in the memory bus like rank-level ECC.
There is a lot of misconception about what DDR5 on-die ECC does; a bunch of the tech press incorrectly claimed it was only to improve DDR5 yields and a bunch of product marketing will misleadingly claim that their memory is ECC now when that term should probably only be used for ECC that involves the cpu memory controller and reports events.