From reading reports from sysadmins dealing with multiple terabytes of ram in multiple systems, they’ll see a handful a year, maby.
Decades ago it was a thing where there would be bad batches of ram that had materials containing a bit too much trace radioactive elements, and those would produce occasional errors.
All in all, as long as ram is properly cooled, a consumer shouldn’t ever expect to see ram errors, though you may have a few over the lifetime of the system.
I was overclocking 64gb ecc ram in a threadripper system for years. Once I got it truly stable, no errors.
I don’t see a link to the study the memory seller was referring to, but,
Last I heard, was a lot less frequent than that.
A large-scale study of Google servers found that roughly 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory error per year.
I guess we are now in the era that manufacturing/material defects are mostly caught before the memory gets to a customer so bit-flips will be caused by external events. The old ‘must have been cosmic-rays’ excuse might finally have some merit
On a more serious note, for workstations the ‘must have ECC ram requirement’ feels like it might now be optional… I’d still want it in my servers that are up 24/7 though!
Like the others have said, I likewise have heard quotes of maybe 2 or 3 a year.
The thing is to us they are not a problem and the volume is tolerable. To a highly accurate scientific project it is absolutely critical it never happens.
Edit: there will.also be scale to consider. In 16 or 32GB of ram that is a pretty small “surface area” but in servers you can have terabytes across multiple rooms/buildings/locations that may be dependant on each others data and the room for error increases.
I’ve seen certain environmental conditions cause memory errors (Correctable and uncorrectable by ECC) – For example a server located on site with an AM transmitter (Absolute WORST is an AM transmitter!) or a TV transmitter etc. Something that transmits with a ton of power. Also any strong EM field can cause an issue as well. Grounding usually mitigates this, with the exception of the AM tower scenario, which required “exotic” coordinated setup with the RF engineer to get things all happy.
Beyond external factors, what about just unstable hardware? You can have a just “slightly worse” motherboard, PSU, memory chips etc. etc. which is more likely (not definitely) to produce an error.
I’ve had a single bit irrecoverably flip in 16GB of DDR3 RAM after ~7 years of desktop usage (and I’ve run memtest86+ before to know it happened at a later date).
@merlino867 you’re waiting for a miracle to come with your approach. That the DRAM refresh somehow messes up and flips a bit to LOW state. How about you try some approach like RowHammer?
That’s interesting, so you lost a byte of memory out of the 16GB forever? I am assuming it can’t just not use 1 bit but has to section off more than that.
Single bit leads to a single byte which leads to a single page (4K), yes. Lost 4K of memory (yet most people would throw away the entire DRAM stick) There’re little known workarounds for faulty memory: BadRAM on Linux and Windows’ bcdedit page blacklisting - these allow you to continue using “unusable” RAM perfectly fine.
Slightly off topic but there is about to be a set of PCs sent to the ISS to test the stability of average hardware against the dangers of space like “cosmic rays” and see if software can deal with it or fix the errors.
So that should hopefully show up some.worst case scenarios when it is concluded… In 2-3 years.
Try to warm it up (reduce airflow over ram, try to get it to 60-80 C if you can).
Keep reading and writing over and over again for a few hours.
If your ram is good it should probably survive memtest just fine, if either controller or ram are flaky you’ll see some bits flipped.