NAS - ECC worth 2x the cost?

Hi Guys so I have done some pricing for my new NAS and now it comes down to whether or not I want ECC.

Intel Build:
i5 13400 - £135
Motherboard - £86 (Don’t really care m-atx)
£80 - 32GB DDR5 (non ECC)

AMD Build:
7400f - £107
Asus B650 Board - £120
16gb DDR5 ECC Kingston Ram - £96
Arc A310 - £96 (plex transcoding)

Now the NAS will literally just be a file store for plex and maybe next cloud on UnRAID (xfs because I have random sizes drives kicking around)

It comes to an almost £200 delta just for ECC (considering I am also buying an intel gpu for better transcoding). In the UK the second hand market for parts isn’t as good as the states so our deals on server parts are sparse and rare.

The rest of the system will cost me around £778 for case, psu, ssd and 2x 16TB drives.

Any words of wisdom for the community? I don’t think I will care Too much if some video files get corrupted etc but if I run next cloud with photos I might. Could be worth exploring cloud back ups instead of ecc?

Bearing in mind that another forum member noted to me that the AM5 platform isn’t as power efficient as the intel platform as it stands (unless you pick an APU which you need a pro variant in order to get ECC).

I’m willing to bet that you likely dont want to go bottom of the barrel when it comes to motherboards.

1 Like

AMD build motherboard pricing was just to pick something known to be compatible with ECC. From the rabbit whole of AM5 ecc compatibility - its worrying to buy a motherboard and ram combo and have it not be compatible

From my personal experience of running 3rd gen i5 with 16GB of DDR3 as a home server for 6 years without ECC:

I can’t think of an issue that was ever caused by a bit flip in the RAM

So you probably can buy the non ecc ram and then 4 years from now when prices come down for this type of memory just buy a new set, I’m pretty sure there will be a lot of decommissioned servers then.

How do you know? You can’t, because everything on a computer relies on the CPU for data, everything downstream just treats it as validated and correct and applies their error correction on corrupt bits. Because most stuff has some kind of ECC, but people skip the top of the chain, namely the CPU.

There are already a lot of decommissioned servers now. DDR4 is rather cheap last-gen. Relating to “in 4 years” is just kicking the can down the road.

2 Likes

DDR5 ODEC2 + rwCRC mostly exceeds EC4 SECDED. So it’s mainly about how you’d monitor and action EC4’s edge case SEC increase and DED logging as file data’s moving on and off the server.

This seems quite low in size and high in price. If you do go the ECC route, the one benefit I’ve heard of is that it’s easier to tell if you have a RAM stick dying. Not sure of the specifics as I’ve never dealt with those OS’s. So buying used of Ebay will probably net you cheaper and/or more RAM.

Depends how much you care about your data. Since you’re using xfs, your filesystem won’t be protecting your data from corruption. probably don’t worry about ecc though, there are better ways of protecting your data. you don’t want to rely exclusively on ecc to save you - lots of ways you can lose data without cosmic rays (/ local madman with a radar) flipping your bits

Personally, my order of priorities for data safety:

  1. real offsite backup - cloud, someone else’s house, whatever. if you don’t have proper backups, everything else is moot and it’s not data you care about.
  2. zfs (or btrfs I guess) - you want a filesystem that knows if the data got broken
  3. local redundancy - eg a zpool with parity (raidz or mirror), so you can rebuild broken data easily and it’s not a big pain in the ass when you’re on holiday/crunch time at work/personally busy and suddenly your drive shits itself.
  4. ecc - nice to have, comfort in knowing the bits in memory have integrity as well as the bits on disk
4 Likes

Either there is empirical evidence for something or not.

The home server works fine, I restart it every 8 months or something like that when doing proxmox updates.

The one time ZFS scrub reported errors was when I first built this server and it was because one of the disks dropped out due to a bad sata cable (I bought a pack of used ones for dirt cheap).

From this I infer that running nas without ecc memory on ddr3 is just fine. We can calculate the odds of a bit flip and then the odds of that occuring in a memory region that actually matters for the function the device, however I think it’s just pointless because the odds seem so low to be of no concern and we have at least one example where a system running non-ECC memory didn’t experience any problems.

I find this fear of bitflips a bit ridiculous when it comes to nas servers which have their components shielded with a steel/aluminium case.

If you really believe RAM is fragile to the point that you cannot trust non-ECC memory to function correctly then advise you to:

  • Stop doing banking on your phone
  • Restart your phone every 4 hours to make sure there are no bit flips in critical areas

And the same can be said about using your laptop/desktop.

Also I think ZFS has a debug mode where the in-memory data is checksummed as well so if you don’t want waste money but you’re feeling a bit paranoid then you try this solution instead.

3-2-1 rule never gets old!

Storage software in general. block checksums either on the physical disks (520b sectors, old school) or done in software (basically all modern storage systems and CoW filesystems (ever since SDN, so for ~10-12 years now) have this, but legacy stuff is strong in people). We solved that decades ago.

Parity, Redundancy, scrubs, snapshots…all there so you don’t need to halt operations continue without using the of backup.

In computing, it’s really a chain of data integrity as data gets passed through many hands and ultimately end on storage or /dev/null. If the CPU is sending wrong data downstream, that can’t be fixed by firmware or filesystem. ECC memory ensures that you can trust the CPU on data integrity. Doesn’t help if filesystem (or anything downstream) writes the correct data and drive, cable or firmware messes it up. That’s why stuff like checksums are important, even though there are CRC checksums (ECC) with SATA/SAS and on NAND in play.
A full chain of trust without weak links. And going without ECC memory or without checksums is just inducing a weak link.
There was mirrored memory at some point. Totay this isn’t really necessary, but I know Mainframes really have insane paranoia mode going where 1-bit correction just isn’t good enough to ship to customers. Because 1 in 10^22 chance will eventually happen given enough time.

You presented observational evidence, not empirical.

And the cases where ZFS didn’t report an error on faulty data because that is the data it got from the CPU before it was written? ZFS doesn’t know. It can only ensure you that you read back the data is in fact the data you wrote previously. It doesn’t say anything about what happened earlier.
If this was corrupt in the first place, ZFS isn’t responsible and won’t throw I/O errors or shows up on scrubs.

Observational. “I can’t see anything, therefore it doesn’t exist”

Every 10 days for me with auto-update script and cronjob. Even before running a cluster, stuff wasn’t that important to not run updates&reboot at night, no need to rack up uptime figures for no reason. I’ve seen too many servers at work that didn’t boot properly after running for months and years.

The primary causes of memory bitflips are 1) SDRAM density, hence DDR5 ODEC2, 2) DIMM failures, and 3) unavoidable trace radioactivity in DRAM packages. Putting a build in a metal case doesn’t protect against any of these.

To be a bit more specific than @Exard3k, all CPU cache levels are ECC, PCIe is CRCed, and SATA is CRCed. DDR didn’t get write CRC until DDR4, read CRC until DDR5, and doesn’t have SEC by default until DDR5.

In general I agree with the point you’re making but it also feels like you’re assuming DDR3 offers integrity comparable to non-EC4 DDR5. DDR4, DDR5, and in band ECC are, in part, responses to observed DDR2 and DDR3 behavior at datacenter scale.

Seems to me the point here’s mainly homelabbing lacks datacenter failure intercept probabilities and QoS requirements and thus has a different risk-reward profile.

+1

The dataset for occurrences is certainly lower. But that also translates to MTTF figures and AFR for drives. Stuff is less likely if to happen if your data has just 3 entries instead of 100.000. The chance to happen on a given machine is the same regardless, that’s just maths.

But people use mirrors and RAID5/parity schemes/Erasure coding, the usual Reed/Solomon equation…for a rather unlikely thing to happen. A drive failure for drives with 2.000.000 MTTF figures. With that risk profile, bit errors in memory aren’t that much orders of magnitude off. Ok, a drive are a lot of way more bits going down, potentially all together.

It very much depends on your personal standards of a secured and trusted chain of data movement. I personally don’t buy stuff without ECC memory anymore. And I gladly pay for these guarantees, be it ECC memory, storage performance (checksumming) or PLP (get those sync writes out fast).

Non-ECC is an Intel thing that just doesn’t die (although we have AMD breaking a (slow) lance). Otherwise no one would ever use non-ECC, the freaking extra die doesn’t cost shit.

The reason why EU-DIMMs are more expensive is because of low volumes sold (can count the amount of DIMM SKUs with both hands) and Intel/AMD wanting you to buy their more expensive stuff (“validated ECC, RDIMMs, go buy the fancy stuff”) and either locking out budget platforms like Core for decades now, or being ambiguously and/or silent about it (AMD).

A few to start with:

edit: Another one from today, these things happen all the time, continuously: https://old.reddit.com/r/zfs/comments/1luffcq/ram_failed_borked_my_pool_on_mirrors/

2 Likes

That quote on Intel…priceless and so true. He doesn’t give a shit and isn’t diplomatic, because he doesn’t have to be nice to corporations, they have to be nice to him :wink:

I read the Register regularly (awesome quick IT news for professionals), but certainly have missed that one.

I am not sure if I’m feeding the trolls or not, I think what I wrote was taken out of context and I’m going ot give the benefit of the doubt.

I was talking about this system in particular, please consider the context of a discussion.

I have no idea how many PBs of ram there are out in the world, if you roll a dice enough times then you’ll get your number eventually, with this attitude you might as well conclude that buying a lot of lottery tickets is a sure way to become rich.
If you disagree with this statement, then please enlighten me as to why insurance companies (assuming they are fair) can actually function and be profitable.

I am not saying that if you have a warehouse worth of servers and where you are very likely to be sued for damages if something goes wrong you should use non-ECC DDR3.

We are talking about a single NAS server.

Yes, I am aware of it.

Like you said density matters, I’m making the assumption that DDR3 is made on a lithography of bigger scale so there are more atoms representing a single bit, meaning a higher energy level is needed to flip its state.
So with this assumption non-ECC DDR3 is less vulnerable than non-ECC DDR4.

Also afaik the error correction was (finally) made a standard feature in DDR4 out of necessity - newer technological processes caused it to be too unreliable to be running without any error detection and correction.

Yes, I am talking in the context of homelabbing, because this thread is in the context of homelabbing - building a single NAS server for home use, or so I think, correct me if I am wrong and OP wants to actually build a warehouse spanning ceph cluster to rival backblaze, in that case they should probably invest the £200 more for that RAM.

I don’t think the higgs boson is real, it was just observational evidence with computational prediction.

Do you see how fast this particular line of arguing can get ridiculous?

There are unicorns on the north pole. Just because you, and everyone I asked about them didn’t see them is not conclusive of them not existing.

You’re correct on that one. Well reasoned. I’m not convinced on the other things you mentioned.

Having some experiences with Ceph I can tell you that Ceph is probably one of the least ECC- memory dependent storage you can have. Because high-availability and clustering comes with basically redundant memory as well. Ceph not only doesn’t trust the drives, it doesn’t trust entire nodes or entire racks (in case you really want to rival backblaze).
If there is a memory error and corresponding writes on node A, it doesn’t matter because it gets picked up by scrub and data+checksum from other nodes. So a 3-node cluster is basically a 3-way memory mirror for practical purposes, certainly not by a technical definition.

I know you wanted to emphasize that a single error is much more harmful and potentially costly on a big machine than on your average homelab NAS+stuff. I think this is true in absolute terms, but any amount of revenue (probably covered by some insurance because a thousand certifications) can’t be compared to very personal and potentially irreplaceable data at home.

I know we’re far away from the actual OP and the use-case. But the thread is also called “ECC worth 2x the cost?” It is for me. It isn’t 2x the cost of the system, just 2x for the modules (5600MT/s 32GB DIMM comparison), but DRAM often varies in pricing. Was close to ~1.25x at the end of DDR4. DDR5 volumes and maturity certainly brought this back to obscene EU-DIMM boutique pricing :confused:
But I see ECC UDIMM pricing with actual OEM built EPYC4005 and W880 (which still are niche platforms) picking up in volumes and thus pricing, much more than in DDR4 days.
So the price argument gets smaller and smaller in the coming years. Which is basically the only argument against ECC, which is ultimately based on consumer marketing talk, certainly not the production costs, which is probably +cents.