My experience:
The RAM module must have begun failing within the last three months, since I configured the data scrub to run every three months. On the day the NAS became unreachable, the power LED was solid blue, and the disk LEDs were blinking, everything appeared normal. I tried to shut it down via the power button, but nothing happened. After a hard reset, the system would no longer boot. I removed all disks with no luck. Then I took out the Synology RAM module and left only the aftermarket Kingston module installed and the NAS booted successfully.
I don’t have many details because Synology’s monitoring and logging sucks. There was only one log entry that the volume went into read-only mode.
Examining the kernel logs, I found numerous Btrfs errors and integrity-check failures on both mirrored disks, with files marked as unrecoverable. I tested both drives, they’re fine and the controller seems healthy, since it still serves other disks.
After some Googling, my best guess is that corrupt data or checksums were written from the faulty RAM back to the disks for weeks. As a result, anything modified in the last three months might be corrupted. Fortunately, most of the data is either backed up or are backups themself.
With proper logging, monitoring, and alerting, this issue could have been detected earlier. But if the system had used ECC RAM, it would have detected that something is not right when the module started to fail.
Once bitten, twice shy… my next NAS will use ECC RAM
It confirms this:
Exard3k:
How do you know? You can’t, because everything on a computer relies on the CPU for data, everything downstream just treats it as validated and correct and applies their error correction on corrupt bits. Because most stuff has some kind of ECC, but people skip the top of the chain, namely the CPU.
You will simply not know if something is wrong until it’s too late.
And don’t think of ECC only as a protection against cosmic rays, failing memory modules are the higher risk in my opinion.
I went with last gen ryzen pro and ECC DDR4.
6 Likes