That makes total sense. After a few hours of research this morning I think I have a plan and I will proceed with vm testing. I might post if something comes up (hopefully not ). You have been of great help, thank you very much!
I will attempt to add some value here for anyone stumbling upon this in the future by logging my findings and thought process on this last subject (keep in mind I’m a noob).
So root partition and redundancy + ease/speed of recovery
The golden standard (for my needs) seems to be zfs in raid1/mirror (not raid-z1). Most importantly due to checksumming the data, enabling me to know which disk is bad in a silent-failure situation.
This is a no-go since it would take some meddling around around using RootOnZfs, which was recently removed from ubuntu, yielding me zfs but with a limited set of features.
Next up raid1 and xfs. This was easy, xfs’s scrub (self-heal) is experimental. + I think it’s meant for metadata. Didn’t go further to see how that would play with raid1.
Btrfs, the string theory of file systems. Took some time because on paper It’s perfect. Checksums, scrubs, raid etc. But then I found this article by arstechnica… And suffice to say that between scrub weirdness, booting a degraded array clunkiness, and seemingly lack of ease-of-use when most needed (during failures), this isn’t it.
dm-integrity. Promising middle ground. Sits below mdadm and adds checksumming to the data. What’s more, it reports a disk error if it detects silent bad data. NICE! Not so nice is the setup process described in the answer to this question. In addition, the previous article points out that dm-integrity uses crc32 which is collision prone (didn’t follow up on chance % and it’s relation to data size being hashed), and it’s need to write over the whole disk on initialisation or disk replacement.
So!
I can make 1 of 2 trades.
-
dm-integrity. Yields a chance to know which drive reports bad data in silent situation. This saves time because it allows swapping only the failing drive and carrying on. The price for this is increased complexity/time during installation.
-
Expendable OS. As discussed above with @MadMatt, I can go with the far simpler stack of mdadm raid1 → LVM → ext4, and consider my OS a possibly expendable part. This keeps the installation overall simpler and faster to redo from scratch, which is good for other failure scenarios as well. On the flip side, silent bad data means more downtime because if I don’t know which disk is bad, then both have to go. There is also the money to buy 2 disks, but for explicitly boot drives it shouldn’t be high.
For my particular situation the trade-off of the expendable OS seems good. Especially when you add into the mix a backup image of the installation or a snapshot using lvm (haven’t read much on how feasible this is though!).