ECC Everywhere

Okay, I don’t use it at home, because at the end of the day, pics of my non-existent cat aren’t that valuable.
But surely it should be used throughout an enterprise environment, from desktops through storage systems to that one machine controlling the Big Industrial Plant machine?

So it would be better for me to have it on my main PC, backup machine, TV recording NUC etc, but like I posted, I don’t really have anything I can’t loose.
I would be sad if lightning fries everything, but I’d get over it more than a company loosing an employee to a random bot flipping bits and going all terminator (until the singularity causes all robots to start culling us worthless meat bags)

If you’re concerned about that, get a surge protecting UPS or make offline backups.


ECC is not bad to have, but I don’t think it’s beneficial to anyone who’s not going to lose tons of money for a bit flip. Keep in mind that ECC-correctable errors are only bit flips. If memory is damaged, ECC can’t protect you.

3 Likes

Okay, bad example.
Should have kept it simple, none of the data I have is very critical…

1 Like

For consumer stuff, I wouldn’t recommend it. You won’t notice a difference with or without.

Notice there are almost no consumer grade offerings for ECC.

1 Like

Actually I only have a surge protector- I should really look into getting one of those UPS batteries that send a kill signal over USB when the power goes.
Never bothered because the power here in south London has surprisingly been really good the last decade.
Now I have to go knock on wood for jinxing myself…

1 Like

They’re nice. If your concern is lightning, the surge protector will do its job. Just replace it after the strike.

RE: a UPS, they actually only provide an interface to read stats of the device from USB, your computer is up to the task of shutting itself down from software when it reads that it’s on battery.

1 Like

Oh, well that sound a lot less useful.
Thanks for dashing my grand dreams of a PC killing Uber battery… :slight_smile:

1 Like

There is an excellent article on the subject:

https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

You should obviously use ECC memory in critical systems, but aside from that I think it’s overvalued. A webserver dropping a few HTTP connections once a year just isn’t worth the trouble.

If you have lots of machines, and relatively few humans to maintain them, ecc is totally worth it - it doesn’t add much to cost of each machine.


Someone should invent ecc as a CPU feature, not as a ram feature and make all ram the same… and give you the choice to sacrifice some ram capacity for error correction to be done by the CPU.

Depends on the specific circumstances. When running a single webserver that does all the serving sure, go install ECC. But this qualifies as critical system.

If however you are running a service which starts and stops dozens of identical servers depending on system load every single one is replaceable. If one server goes down the next HTTP requests will just be handled by another node.

UPS is probably much more useful in a non commercial environment than ECC.

One example. If I state this wrong please correct me. If a write to disk operation is in progress, or is buffered in memory and the power to the PC is cut, the data in memory is lost. The only information written to disk is what occured before the power was lost. This can cause huge problems with RAID arrays, and ZFS.

With a UPS, these write operations can be finished before the power is shutdown by the OS.

ECC is not needed outside of mission critical operations. My .02.

EDIT: clarity and spelling

Modern file systems take power outages into account. EXT4 keeps track of planned operations in a journal so it can recover after crashes. ZFS uses copy-on-write and thus never overwrites valid data. In case of a power outage current writes may be lost but no data is corruped.

No protection is 100% of course.

2 Likes

Very good to know. Thank you for the information.

You still need the extra memory chip though to store the parity bits for each byte of RAM. ECC uses 9 bits in a byte (1 for parity)
There’s no way to do that in software via the CPU because it’s a physical storage constraint of ECC and how it’s read out with the signalling scheme RAM uses. It makes it really hard to address retarget bytes of a specific memory address.

That’s because RAM is byte-addressable. Each address identifies a single byte (eight bits) of storage. In ECC the parity bit is automatically checked and stripped off by the RAM itself when the address is read, not by the RAM controller. So it’s a bit tricky.

CPU cache SRAM btw is all ECC btw :wink:

But you cannot do error checking in software unless you just store values 2 or 3 or x times in different locations.
Which is basically RAID for RAM also known as mirrored mode RAM, often used on ultra fault tolerant multiprocessor systems.

2 Likes

So like credit card real-time fraud detection?

ECC ram is undervalued by those who don’t ‘need’ it, and overvalued by Suits practicing market segmentation. It’s like asking if you ‘need’ tire pressure monitors in your car. Sure we lived without them for decades, but now that they are here the value they provide is pretty meaningful (as long as you aren’t paying the Dealership tax when you need to replace the things). In my opinion ECC should be something that is default and taken for granted rather than having it’s true value endlessly debated.
I love having it on my personal machine, but then again I’m a baby datahoarder with a ZFS array and do stuff in VM’s rather than bare metal. In my case, due to the current ram price bullshit, I got some used basic ECC ram was basically the same price as basic non-ECC stuff so I went with it.

Sure there are a few downsides.
-A now small cost premium. Used to be a lot more, but over the years however the price of the cost premium has fallen so that it typically not much larger than the cost of the extra chip they put on the stick.
-No super extreme gaming bins available.
-Need a motherboard and CPU that makes use of it (Thanks AMD and asrock!)

Funny thing is, When used with x370 and x399 motherboards, you can actually overclock it pretty nicely if you get B-die based stuff. My ram is Super Talent F24EA8GS 8gb single rank, rated for 2400 mhz. It (and the 16gb dual rank version) easily overclock rock solid to 2966 with timings around 14-15-15-14. Because I’m a masochist and hate weekends, I was even recently able to squeeze 3200 out of it after fucking around with procODT (higher isn’t better!) and moving the memory hole with CLDO_VDDP on my taichi board. Being able to see corrected errors roll makes the whole process a lot easier. Expect the see goofy ECC ram with heatsinks and LED’s in the future.

My point is, ECC ram is literally just ram with an extra chip on it. It really shouldn’t be this big stupid deal.

As far as enterprise stuff goes, the answer to your question is “Is anything relying on this computer to not crash”? If the answer is yes, then you need ECC. You also need to remember that ECC only solves one potential failure point.
You should be asking about your automated backup and recovery plan first and foremost.

1 Like

ECC is a RAM feature and will always be a RAM feature - because to work as ECC memory, the RAM needs additional cells in it for parity.

This isn’t something the CPU can do by itself.

Nah, more likely stuff like pornhub.

(only half joking)

To say it would be “a bit tricky” would be an understatement.

I am aware of how ecc works, that’s why I’ve been saying it needs to evolve.

Imagine you had 1GB of ram, imagine just having a CPU register that you could flip at runtime, that all of a sudden caused L3 cache evictions to address-es 1K-2K to also cause parity writes to 128bytes around 910th megabyte in memory.

Your OS kernel would just know not to use that area.

point reads would require 2 reads (data+parity block) but potentially no extra latency, point writes would require lots of reads.

You’d lose some throughput, and gain some latency, but check this out.

Imagine being able to set this per app, and being able to manage error correction per process by manipulating PMTs.

2 Likes