Anyone have a bad story from not having ECC RAM?

ack · March 10, 2024, 9:14am

The worst I’ve had or heard of is largely just crashing of things. The family photos weren’t wiped, the online bank transfer to pay the plumber didn’t get an exponent bit flipped and made you homeless, the fall through return false in your auth code didn’t become a return true.

I’ve had a singular work incident where a bit flipped and bad things happened, but it wasn’t that bad.

Our phones, consumer routers, laptops, personal computers, almost all work computers, lack ECC.

But the world ticks on mostly fine, the hardware mostly works.

It is my experience that a far greater concern should be paid to software. The software I use has so many more bugs than my RAM has bitflips that I don’t even think about my RAM when a program does something weird.

The tech giants of the world, from Amazon to Google, develop and operate their giant money printing clouds from…Crappy ECC-less laptops.

I do wish I could get a ECC system for work, as it would likely come with a thread ripper CPU. But there’s no way my boss is buying that, they’d think I had a screw loose if I started talking about bit errors in RAM impacting our business.

I got to thinking about this topic from the numerous ZFS + ECC threads. And about how it seems funny to care about ECC at that point, when nothing else in the house creating or moving the data about has ECC.

jode · March 10, 2024, 9:52am

Let me start by acknowledging that the risk of bit flips in RAM is real. ECC is the technology to guard against it.

That said, I think your thoughts about the need for ECC are appropriate. It’s a matter to assess the risk and the need for mitigation appropriately.

Yes, but I am almost certain that all of their cloud infrastructure is using ECC memory. The reputational risk of documented failure would be too big to skimp on ECC here.

I am very skeptical that the use of ECC in home setups has a positive risk/reward ratio. Even when running a home lab that operates 24/7. But on the other hand - if the technology is available, why not use it?
In my case, I cannot imagine a failure scenario that would significantly impact my setup in case of memory bit flips. It could conceivably lead to inconvenience, but not enough to justify a home setup designed around the requirement of using ECC memory.

level1 · March 10, 2024, 1:57pm

A company I did some consulting work for had a pretty neat (for its time) access control system that was responsible for both physical access to the building and remote access to the servers. Folks had IDs connected to keycards that they could swipe/scan to get through doors. The IT guys used their ID numbers for remote access.

Everyone went home one Friday afternoon. Nothing unusual seemed to happen over the weekend. Staff couldn’t enter the building on Monday morning. All keycards failed to validate. IT staff couldn’t login remotely. IDs failed to authenticate. The two physical keys that could be used to override the locks were hanging on a lanyard in the reception area… INSIDE the building.

Locksmith arrived after lunch. Half a day of productivity lost for about 20 staff in total. Financial cost was about $2,000 in wages plus probably $2,500 in lost production… so around $4,500 in total. (Inflation-adjusted, probably $7k in 2024.) The always-running process that validated IDs had simply hung. IT killed it, the process respawned, and everything went back to normal.

Not financially catastrophic by any means, but far more expensive than the opportunity cost to outfit that server with ECC memory.

Personally, I’ve had complex and long-running simulations abort, or produce clearly corrupt results at the end of the run. No financial losses, but it sucks to have days of compute time flushed down the toilet and be forced to run it all again. I think I probably lost three solid weeks of compute time, over multiple incidents, before I worked out what was going on, switched compute to hardware with ECC support, and never had a similar problem since.

wertigon · March 10, 2024, 3:57pm

Thanks for sharing!

Me personally have never worked with critical enough data to warrant ECC. This does not mean ECC is useless, just that if it costs 30-50% more money to get a platform with ECC support instead of not paying for ECC suppport, then it becomes a question of whether or not ECC is worth it. In most cases, no it is not worth paying a premium for, but yes it is really nice to have and if you do anything with heavy data (science computations, bitcoin mining, AI learning and so on), it helps more than you think.

And with “worth it” I mean, it is not really going to do enough to offset the costs of the potential risks, but that does not make the tech worthless, if you want the peace of mind it brings you then yes, do get it. A bit like whether or not you should lock to a higher interest in a mortgage or let it stay at a potentially lower interest - Most of the time locking is not worth it, but I for one like the peace of mind it gives. Neither option is wrong.

All things being equal you should always have ECC over non-ECC, and yes I do support ECC to become standard in all modern computing platforms. It is 2024, the tech is cheap. If anyone from AMD or Intel is listening, please, could you PLEASE start mandating it on Ryzen and Core CPUs?

This could be a real sales point between, say, X670 and B650, or Z790 vs B760. If I could be certain that any motherboard with an X670 chip or Z690 chip supports ECC, then that would be great.

Intel is kinda trying with their W680 series, but that has a bunch of other compromises that makes it less than ideal.

jode · March 10, 2024, 6:13pm

I second that.

If Intel tried, they would enable ECC support across all platforms. I don’t buy that. Same for AMD.
Both see ECC as a differentiating feature and both require their customers to pay way more than the added technology cost (yes, I understand that ECC is just one of many features that differentiates e.g. a W680 based system from a consumer based system).
Today, ECC is seen as a feature that clearly separates market segments. Products are packaged and priced accordingly.
Nothing that couldn’t change, but it’s unlikely to happen.

regulareel · March 10, 2024, 10:58pm

Whats the worst that can happen? Can you image Trump/Biden winning instead of the other one because of a bitflip?

Literally happened in an election in the EU.

compy386 · March 11, 2024, 3:07am

Yeah but isn’t the byline of that video essentially that most of the time these bit flips occur without anybody noticing? It’s just the odd time that it’s the wrong bit.

Love me some Veritasium

MazeFrame · March 11, 2024, 11:15am

Sanity-checks in software seem to be the exception instead of the norm.
Spawning a little process that sometimes checks to see if “main.process” is still alive and well must be absolutely impossible on a scale larger than me tinkering about at home…

This.
Without ECC, you have no idea how bad things are.

jxdking · March 11, 2024, 12:08pm

It is not readily available on consumer motherboards.
I think everyone agree we need ECC memory on the server stuff. On the consumer stuff, it is another story.

jxdking · March 11, 2024, 12:22pm

For me, those examples are more likely using bit flipping to cover up bad coding.
For example, allocate a variable without initializing with 0; Accessing an array element outside of boundary. In the old day, these kind of stuffs were kind of common.

GTwannabe · March 11, 2024, 2:05pm

ECC used to be standard for PCs due to the poor reliability of early commodity RAM chips.

Zedicus · March 11, 2024, 2:20pm

say you have a few TB of backed up DVD, home movies, and images stored on a ‘server’ built out of old desktop gear. you have assembled a few low power computers running XBMC at all the entertainment centers with TV’s around your house, and you have fancy WMC compatible remotes to sit on your couch and select entertainment.

say you are watching said entertainment and notice an odd stutter in some media that you don’t recall seeing before, but it is 2006 and all things considered, it could be pretty much anything.

after about a year, this weird additions are causing most media on said system to be very unreliable, and sometimes even freeze. but ZFS scrubs do not report any actual problems and SMART shows no issues. You wake up the next day to discover ALL the media shows as online, but is not viewable. and suddenly ZFS reports millions of file inconsistencies out of no where. “WHAT COULD HAVE HAPPENED” you scream. While all the voices in the house come to you asking why their TV’s don’t work.

yeah, so i am in the “if it runs ZFS it will have ECC” camp. a SLOWLY failing stick of RAM with out ECC CAN cause havoc with ANY data, even ZFS.

Zedicus · March 11, 2024, 3:05pm

the “best C++ random number generator” in the late 90s was a function that would set a variable and read from it before using the variable for anything. basically what it did is just read a section of RAM at the requested length for the random number and say “OK here is that random number you requested” oh the days before application protected memory space. you say buffer overflow, i say application reset feature.

quilt · March 11, 2024, 3:51pm

While I agree in principle that silent bitrot like this is possible, it is exceedingly unlikely that all your files get corrupted but your system otherwise runs fine. Especially on cold storage that doesn’t get written to much.

This system would be crashing all the time.

More likely is some stuff gets corrupted, and the system crashes every now and then.

Zedicus · March 11, 2024, 4:00pm

the hundreds of variables and extra information i could have put into that story to answer all your questions would have made it just on the shorter side of a standard novel. The point of the story being yes, bad RAM can slowly corrupt a DATA drive.

i wonder if that would be a good selling sub-genre of horror. Tech stories and things that keep admins up at night…

i am going to guess no, probably not.

wertigon · March 11, 2024, 7:07pm

How, if the data drive is mounted Read Only?

Trooper_ish · March 11, 2024, 7:09pm

Perhaps we will never know, how many errors have been caused by memory bit flips, throughout history, just being attributed to other issues?

Zedicus · March 11, 2024, 7:17pm

i missed how that relates to anything? sure, mount something and DISABLE write access and you avoiding having things write bad data.

there are other forms of degradation that could occur, but as per this topic, it is about ECC RAM, and thus the potential of writing degradation.

insolventgenius · March 11, 2024, 7:22pm

Having ECC and knowing it was something else that led to undefined behavior in your program is tremendously valuable in an engineering context. In huge codebases, I think some stuff gets chalked up to bit flipping that was actually some other super rare edge case that’s actually solvable.

MetalizeYourBrain · March 11, 2024, 9:11pm

What you’ve said really puzzles me. How could a catastrophic event like that happen if the system that’s serving the files is not crashing?
Accessing a file doesn’t always mean that the program accessing it can corrupt it if the system doens’t have to write anything back to the file itself. I think you’re mixing up RAM failures, which lead to a general system falure, and storage failure in this case.

Ideally opening a file it’s just copying chunks at the time into the RAM allocated by the OS to the software that needs to access the file and that’s it. There’s no writing to the disk that could corrupt the files inside. What could happen is a corruption of the file system underneath the files, because that’s constantly accessed by the OS and multiple uncorrected errors can propagate to it.

Also, in my experience, a slowly failing stick of RAM is not gonna just make the system stutter and then make your media just a bunch of scrambled data. The OS is always gonna find out first if a component is failing because it “fucks around” with it constantly.