Anyone have a bad story from not having ECC RAM?

Zedicus · March 11, 2024, 9:52pm

cool, sure, you guys sure can assume a lot.

did i say the host was perfectly stable and ONLY the ZFS was affected?

i think you are assuming the ONLY thing happening is that the server was handling a read request.

ideally

no, but the system running for 1 year while doing tons of WRITING and reading to that file system would have. what you should have asked is “well how often was the system configured to do ZFS Scrubs?” or some such ZFS question.

or even a question about what else the server was doing would offer some more clues if you was really curious, as apposed to creating an entire scenario based on a pinhole view of the data.

It was an example of how ECC would have saved some headache. NOT that it could not have been discovered in any other way or that there was NO other signs.

xzpfzxds · March 12, 2024, 12:07am

The read-only-ness is only a flag in the mount options which tells the filesystem driver what to do, which in turn tells the block interface what to do, which in turn tells the storage interface what to do.

A memory error can affect any part of that chain.

The ACS-5 command set used for SATA/AHCI has read and write commands specified by byte codes (search for “INCITS 558-202x - Information technology - ATA Command Set - 5”).

Opcode | Command name
20h    | READ SECTORS
22h    | READ LONG
30h    | WRITE SECTORS
32h    | WRITE LONG

Guess what the difference is between a read command and write command? - exactly 1 bit.

1 stray bit in a buffer sent to the storage interface can change a read into a write.

It happens often enough that projects are starting to implement mitigations for systems that don’t use ECC, like using command IDs or enum values with a large Hamming-distance from each other (number of bit flips required to turn one value into another).

Check out this example from sudo: Try to make sudo less vulnerable to ROWHAMMER attacks. · sudo-project/sudo@7873f83 · GitHub

wertigon · March 12, 2024, 12:55am

Of course it can, but we’re talking about threading the needle of one or two kilobytes of memory… Out of a couple of gigabytes. Let’s say the code protecting the data is 8 kb and the memory is 16 GB. That is a probability of roughly 1/2 * 10⁻⁶. Or, a probability of 0.0005%. And this is AFTER the bitflip happens, which in and of itself is somewhere around once every 3 billion writes or so.

The error rate is not zero, for sure. But you are much more likely to die in your bathtub than a bitflip corrupting a write-protected drive…

risk · March 12, 2024, 5:32am

True, but servers have ECC, and some crazy people using workstations have it too.

FWIW the cpu itself has error detection correction for its L3 cache, which gets pretty large on modern CPUs, and is likely to store whatever ZFS data needs to be checksummed in its entirety before it’s sent over PCIe (probably not through ram) to disk.

IIUC, ZFS verifies the block checksums when pulling blocks from ARC too, right?

xzpfzxds · March 12, 2024, 9:27am

Be a large cloud company and have buildings full of servers running at high load 24/7 and you might start to notice the “incredibly unlikely” happen quite often

MazeFrame · March 12, 2024, 10:25am

Was going to say, one bit flip in billions may sound unlikely, until you notice Giga is already billions…

wertigon · March 12, 2024, 10:33am

Sure, I have not said anything else.

I thought this thread was about home servers though? My stance comes from running ECC in a home setting on a low traffic server. Of course ECC is a necessity in the high performance server market, but buying a high performance server motherboard + a high performance server drawing 300W of power in a home setting… That’s like buying a Lamborghini for the occasional church trip and grocery shopping. It is a free country and if that makes you happy then sure, but most people would just get a $5k fourth-hand Honda Civic for that. Or a $1k e-scooter.

Again, all things being equal ECC is the better solution. Consider this though, budget: $700-$800, I just want ECC support, nothing else, and I can choose between:

Part	With ECC	Price	No ECC	Price
CPU	Core i3 14100	$139	Core i5 14500	$235
Motherboard	ASUS Pro WS W680M-ACE SE	€369	ASRock Z690M-ITX/ax	$135
Memory	Generic 2x8 ECC kit	$50	Generic 2x8 non-ECC kit	$35
Total		$558		$405

In this example the ECC sets you back a whopping 150 bucks and you get a worse CPU due to the exorbitant price of the W680 board. The core gets 37%-50% more expensive and the total system maybe 20% more expensive.

Is it worth it? This only you can answer with your use case. I can only point out potential drawbacks. Of course you do get other features like IPMI by going with the W680 so this is not a complete apples-to-apples comparison, but ECC in consumer space still sucks.

If you already have the platform, ECC is a no brainer, we’re usually talking a difference of $5 per RAM stick. Unfortunately consumer space requires significantly more effort.

xzpfzxds · March 12, 2024, 10:49am

That’s the only answer, I think - personal or organizational requirements and what you’re willing to risk. ECC is an insurance policy. You might pay for it and have perfectly working hardware for years without any detected errors. You might not pay for it and spend hours/days/weeks diagnosing intermittent segfaults and running memtest tools.

Sure, here’s my home server that has saved something bad happening 308 times since it has been on:

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 308 Corrected Errors

Personally, I’d rather pay a bit more on ECC RAM and at least have an idea of what is broken when it happens, rather than spending time and effort finding the issue.

wertigon · March 12, 2024, 10:54am

Yep I do get that, I just disagree that it is “a bit more” money, it is a significant amount more + you can’t just take any old computer and plop in ECC RAM and expect it to work, either.

Consumer space needs a cheap way to do ECC servers. AMD is unofficially kinda supporting it if it’s wednesday and saturn is in alignment and you happened to buy a motherboard that supports it, Intel is supporting it but with an eye-watering price gouge, and us consumers are left footing the bill as usual. -_-

risk · March 12, 2024, 2:37pm

yeah like, when you cross 18 exabytes of storage and your monitoring system which was using int64 stops working, and it ends up having various other knock on effects. (true story - somehow we forgot to test for this)

Which brings me to a point I wanted to make, it’s a scaling mechanism. Without ECC you’ll have a problem at some point, with ECC you’ll also have a problem at some point, you can pick which problems you want to solve in life, and I’d say that ECC just isn’t worth it for most people, and most people should invest energy into working (and testing) offsite backups before they spend more time and money on ECC.

Spend the money on a second bunch of HDDs running ZFS or something else, elsewhere, it’s more likely you’ll need them / use them - than it is likely you’ll run into ZFS issues because of a bit flip at home.

(That said I wish non-ecc just didn’t exist, if everyone was just not given a choice it’d end up being the same price, the silicon factories will just run their stuff a bit longer, will use a bit more water and energy making ram chips).

wertigon · March 12, 2024, 3:38pm

Not disagreeing with you, but to add to this… 20 years ago only super advanced servers used 16 GB of RAM. Today that is borderline usable and Consumer space is starting to knock on the TB range.

So yes it is a scaling problem but I also fear it will soon be a visible problem on all > 128 GB systems (heck, it already is on many 32GB systems). So hopefully this will resolve itself soon due to necessity.

MetalizeYourBrain · March 12, 2024, 8:29pm

Not mentioning it sounded like the only affected part of the system would’ve been storage. In that case having a system crash multiple times over and over and over again is already a tell tale sign that something is letting go and no human being that self hosts movies would let that slide.

Which is fair in my eyes since a movie has no reason to be written over to access it.

I thought the focus of the topic was ECC so it didn’t matter the underlying mitigation other means bring to keep your data safe.

You gave me a pinhole, I used it. I’m not one to build castles in the air.

Agree to disagree. I don’t think your example was effective in conveying the importance of ECC. I think ECC helps in situations in which the issue is much more difficult to spot because everything seems to be behaving the way it should before and after anything more or less catastrophic happens to the system.

Zedicus · March 12, 2024, 9:29pm

also, to the point of ECC and errors being dificult to spot. Even ECC can HIDE issues if your board or OS does not report and correctly log all ECC errors. this is still hit or miss in the world of desktop gear with ECC UDIMM.

there is no magic bullet. there is also the ability to miss the flags and bells and warning sirens, and cruise along until everything is engulfed in fire.

OscarCharlieZulu · March 13, 2024, 9:45am

That happened in a US election if I recall - one candidate got something like exactly 4096 votes too many and it was traced to a single bit flip. Of course that bit flip was caused by a deep state sponsored UFO being guided by 5G towers and the voting booth staff were rendered temporarily frozen by vaccine implanted chips activated by those same towers.

jxdking · March 13, 2024, 11:11am

How long have you left the system on?

xzpfzxds · March 13, 2024, 11:12am

# uptime -p
up 4 years, 41 weeks, 4 days, 20 hours, 9 minutes

jus1982b · March 14, 2024, 3:16am

I have two servers setup with btrfs, one of them starting have corruption on my filesystem storage container I couldn’t find any reason for this to be happening, they have been upgraded to ECC memory and the corruption errors seem to have been resolved at this time.

taggart · March 15, 2024, 5:31am

Here’s my favorite ECC hack: squatting bit-flip domains

xzpfzxds · March 22, 2024, 9:27am

Imaging A Hard Drive With non-ECC Memory - What Could Go Wrong
- There are a lot of mini-lessons and learnings, but the most significant idea that I want to impart to the reader is the scale of how much time I wasted on debugging this. With cheap consumer non-ECC memory, your bits can flip here and there, and you’ll have no idea!
Toyota Case: Single Bit Flip That Killed
- And it turns out that the crux of the issue was these memory corruptions, which acted “like ricocheting bullets.”
Bitflip in Nessie 2024 entry 65051339 - Certificate Authority (CA) Certificate Transparency (CT) logs corrupted by single bit-flip, invalidating the log for new certificates
Yeti 2022 not furnishing entries for STH 65569149 - Certificate Transparency (CT) logs corrupted by single bit-flip
Mozilla: Some data appears with incorrect labels, with a 1 character variation compared to the label that exists - unsolved telemetry bitflips
Practical Hardening of Crash-Tolerant Systems
- Global Amazon S3 outage for 8 hours:
  
  A handful of messages had a single bit corrupted such that the message was
  still intelligible, but the system state information was incorrect. We used
  MD5 checksums throughout the system (but not) for this particular internal
  state information. (…) When the corruption occurred, we did not detect
  it and it spread throughout the system causing the
  symptoms described above
Silent Data Corruptions at Scale - based on data from Facebook
- This meant for some random scenarios, when the file size was non-zero,
  the decompression activity was never performed. As a result, the database
  had missing files. The missing files subsequently propagate to the
  application. An application keeping a list of key value store mappings
  for compressed files immediately observes that files that were compressed
  are no longer recoverable. This chain of dependencies causes the
  application to fail.

And on a lighter note:

See cosmic rays ionization streams happen in realtime on a science exhibit

jxdking · March 22, 2024, 10:44am

Question, all ddr5 come with on-die ecc. It should prevent any corruption when data is sitting on the die, is that true? I know the bus between cpu and ram is not ECCed if the dimm is not ECC dimm.