Cannot find any bit flips in memory, how often do they really happen?

I have been trying to find a bit flip in memory and failed to so in over 72 hours.

The process I used was to create a 10GB block device in memory:

modprobe brd rd_nr=1 rd_size=10000000

fill it with random bits:

pv /dev/urandom > /dev/ram0

and monitor it for changes by occasionally checksumming it with a cryptographic hash:

pv /dev/ram0 | b3sum

b3sum chosen because it’s fast and makes some use of multi-threading. pv is just a nice visual pipe assist tool.

According to this: How often do ECC-correctable single-bit errors occur and how about double/multi-bit errors? | Intelligent Memory

This converts into an average of one single-bit-error every 14 to 40 hours per Gigabit of DRAM.

This seemed very concerning but I have been unable to find a single bit error in 72 hours in 80 Gbits of RAM.

Since this study seems highly unreliable, does anyone have more accurate numbers on how often these single bit errors occur?

2 Likes

From reading reports from sysadmins dealing with multiple terabytes of ram in multiple systems, they’ll see a handful a year, maby.

Decades ago it was a thing where there would be bad batches of ram that had materials containing a bit too much trace radioactive elements, and those would produce occasional errors.

All in all, as long as ram is properly cooled, a consumer shouldn’t ever expect to see ram errors, though you may have a few over the lifetime of the system.

I was overclocking 64gb ecc ram in a threadripper system for years. Once I got it truly stable, no errors.

3 Likes

I don’t see a link to the study the memory seller was referring to, but,
Last I heard, was a lot less frequent than that.

A large-scale study of Google servers found that roughly 32 percent of all servers (and 8 percent of all DIMMs) in Google’s fleet experience at least one memory error per year.

Newish Ars piece

Quoting this study

2 Likes

Modern memory is really reliable, it doesn’t surprise me you haven’t had any errors yet.

Pugent Systems have a really old blog post about how it had improved (published in 2013): Advantages of ECC Memory

There was this comment with more up-to-date stats:

I guess we are now in the era that manufacturing/material defects are mostly caught before the memory gets to a customer so bit-flips will be caused by external events. The old ‘must have been cosmic-rays’ excuse might finally have some merit :slight_smile:

On a more serious note, for workstations the ‘must have ECC ram requirement’ feels like it might now be optional… I’d still want it in my servers that are up 24/7 though!

3 Likes

Like the others have said, I likewise have heard quotes of maybe 2 or 3 a year.

The thing is to us they are not a problem and the volume is tolerable. To a highly accurate scientific project it is absolutely critical it never happens.

Edit: there will.also be scale to consider. In 16 or 32GB of ram that is a pretty small “surface area” but in servers you can have terabytes across multiple rooms/buildings/locations that may be dependant on each others data and the room for error increases.

3 Likes

I’ve seen certain environmental conditions cause memory errors (Correctable and uncorrectable by ECC) – For example a server located on site with an AM transmitter (Absolute WORST is an AM transmitter!) or a TV transmitter etc. Something that transmits with a ton of power. Also any strong EM field can cause an issue as well. Grounding usually mitigates this, with the exception of the AM tower scenario, which required “exotic” coordinated setup with the RF engineer to get things all happy.

1 Like

I’m glad the Google study was mentioned.

Beyond external factors, what about just unstable hardware? You can have a just “slightly worse” motherboard, PSU, memory chips etc. etc. which is more likely (not definitely) to produce an error.

I’ve had a single bit irrecoverably flip in 16GB of DDR3 RAM after ~7 years of desktop usage (and I’ve run memtest86+ before to know it happened at a later date).

@merlino867 you’re waiting for a miracle to come with your approach. That the DRAM refresh somehow messes up and flips a bit to LOW state. How about you try some approach like RowHammer?

That’s interesting, so you lost a byte of memory out of the 16GB forever? I am assuming it can’t just not use 1 bit but has to section off more than that.

// offtopic:

Single bit leads to a single byte which leads to a single page (4K), yes. Lost 4K of memory (yet most people would throw away the entire DRAM stick)
There’re little known workarounds for faulty memory: BadRAM on Linux and Windows’ bcdedit page blacklisting - these allow you to continue using “unusable” RAM perfectly fine.

2 Likes

I would say, you would see 1 to 15 bit flips in that amount of memory in a year.

Slightly off topic but there is about to be a set of PCs sent to the ISS to test the stability of average hardware against the dangers of space like “cosmic rays” and see if software can deal with it or fix the errors.

So that should hopefully show up some.worst case scenarios when it is concluded… In 2-3 years.

Edit: forgot the link.

2 Likes

Try to warm it up (reduce airflow over ram, try to get it to 60-80 C if you can).
Keep reading and writing over and over again for a few hours.
If your ram is good it should probably survive memtest just fine, if either controller or ram are flaky you’ll see some bits flipped.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.