ECC Everywhere

If I interpret ECC server memory correctly.

At this point I would rather have ECC than not. I guess it comes down to cost. And how the ECC works. Speed that has stagnated thanks to GFX cards not getting updated.

Having your machine stops, hard lock because a bit flipped and could not be corrected with playing skyrim and some graphic assets was not CRC correct could get annoying it it happened a lot.

While losing data to the same thing once a quarter would be also annoying.

Hence we advocate Freenas and ECC memory for machines with barely any home network load level because, data loss. At the same time time on quad core 5G+ overclocked PC’s crunching massives amounts of data burning several hundred watts of power with no ECC happly.

I’d like ECC memory to save bit errors but not server hard stop when an unfixable error occurs. Only record it so I can monitor it. Now when your ram is know corrupt should the kernel keep going ?

Im lead to believe some of the latests attacks was it rowhammer aggressively attack memory to flip bits to gain access.

Bitflips really don’t happen all that often.

Higher temperatures and more reads/writes are directly proportional with rate of bitflips.

Yes, not all ram gets corrupt at the same time, and it makes sense for the kernel to try and save as much of the rest of the system as possible, e.g. kill the app.

If the corruption is in the kernel, no it should not keep going.

If the corruption is in the code that’s supposed to handle corruption :slight_smile: … well, too bad. (This is because there’s no way to “mlock” pages into the CPU cache on x86_64, … coreboot people would love to have that feature CAR is kind of annoying, I’m sure memory manufacturers would love it too).

Worth reading if you want to understand the types of ECC RAM and what it does.

As far as consumer kit goes the manufacturing processes and material quality for decent quality branded RAM has improved so much that the likelyhood of bitflips or RAM failures/errors in general has dimininshed massively since the 1990’s. We’re almost at the stage that bitflips are caused by cosmic radiation rather than problems caused by poor quality RAM chips & motherboard or CPU components…

Pugent systems have some good blog posts on this with data:

1 Like

What’s the advantage over just doing ECC in hardware on the memory? Why should i care about or want managing this “per app” rather than just relying on the hardware to do its job properly?

Having the kernel plow ahead even when non-kernel memory is corrupt IMHO is a bad idea, as non-kernel memory could hold parameters that will be passed to kernel functions to do potentially destructive things. e.g., a command line for file deletion gets corrupted and then passed to the kernel… etc.

2c, but any un-correctable memory corruption should cause a hard panic. It’s the only way to be safe.

edit:
whether bit-flips are caused by cosmic radiation or manufacturing problems or whatever doesn’t really matter. if they happen at all and you care about integrity (sufficiently enough to spend the money to deal with it) then you want to detect and ideally correct them. the why doesn’t matter.

I am actually suggesting that this happens in hardware, but that it happens exclusively on the cache/memory controller on the CPU, and that it be controllable by the user/os/developer. Specifically, my idea was that all RAM should become the same (ie. non-ECC ram), and you only care about the particulars of your CPU implementation.

Here’s my thinking why - typically ECC uses about 10% of silicon to protect and correct bitflips that happen once in ~never. It’s a flat cost regardless of how ram is used or whether it’s necessary in all cases.

At the moment, a large amount of ram worldwide is spent on very few apps that serve as various caches, things like memcached, various non-persistent redis deployments, various non-executable page caches for databases to fit their indices, they can easily do their own error detection much more efficiently, and can usually afford to throw-away the cache entry if they detect corruption. Many databases already compute checksums of pages when reading/writing data off of disk for example - the checksum values already exist.

From the user perspective of who could benefit, imagine you’re already deploying exclusively ECC ram in your datacenter or cloud, what if you got ~5% ram extra for “free”.

… or in a residential setting, what if you could get all the benefits of ECC at 1% of the cost instead of 10% ?

Bottom line, commodity ECC hasn’t fundamentally changed in years, I think there’s unexplored potential that’s worth thinking about there, even if it only saves 0.5% of global ram in perpetuity - it’s probably worth exploring the options there.

To be clear, in this scenario the app should be kill -9-ed or whatever the OS equivalent, and this random-ness in a parameter should only affect that particular user. Other users on the system could still be fine. Also, there’s still going to be software bugs and ECC is not a silver bullet - a process doing something undesirable will happen anyway with our without ECC, and some isolation would be nice.

ECC memory is also of much higher quality than what’s available to consumers. not for overclocking, but for reliability; you can be 100% sure that any you get will work just fine even used.

3 Likes