Best FS/RAID configuration for a home NAS?

I have an RPi 5 8GB, a 6 bay HDD case (USB3.2 gen 2) and an assortment of old HDDs of varying sizes (1x1TB, 1x2TB, 2x3TB and 1x4TB) that I want to assemble into a NAS. I also have an old spare 128GB SATA SSD that if possible I would like to use for caching. I do plan to eventually upgrade all the drives to 4 or 8TB ones, but only as budget becomes available, which in practice means that I’m stuck with mismatched drives for a year or longer. And right now I’m weighing my options for the configuration. For the system I think I’m best served with the default RPi distro (basically Debian) with openmediavault. And what I’m stuck at is what file system and RAID configuration to use. Here are the features that I want, in the order of priority:

  1. At least 50% storage efficiency.
  2. Protection against bit rot.
  3. Ability to lose 2 disks without data loss (the HDDs are old, so I wouldn’t feel safe with just 1 disk of redundancy).
  4. Ability to add/remove/replace disks with as little headache as possible.
  5. Ability to efficiently use mismatched disks.
  6. Ease of software maintenance.
  7. Improve performance via caching with the spare SSD.

I’ve basically narrowed it down to 3 options that I think would be the best, but all of them have downsides that bother me a lot, so I’m wondering which you would pick if you had the same hardware to work with for a NAS…

I. Btrfs RAID1c3 for metadata, and RAID6 for data.
This would be the ideal solution for my use case if it actually works, because it covers points 1-6 without reservation, and for caching there are also some solutions (using bcache or lvm). However I read that RAID6 on btrfs is unstable, but I don’t know how unstable it is if I use the recommended approach with RIAD1c3 for metadata. Would I then be as safe as with ZFS RAID-Z2, or is btrfs RAID1c3 for metadata not sufficient to ensure data safety with RAID6 for data?

II. ZFS RAID-Z2.
This would give me points 1-3 and easier access to caching, though I will probably have less than 50% storage efficiency until I upgrade all the drives. But I will completely give up the ability to resize my array (beyond just upgrading to bigger disks), mismatched disks would just mean that storage on the bigger ones is completely wasted and I would have to be careful with system updates for the NAS because the ZFS driver is not part of the kernel.

III. Btrfs RAID1.
I would give up the 3rd requirement, but keep all of the other advantages of option I, and it is also considered stable and safe to use. With a bonus that restoring the array after a disk failure would go faster, so while the array would be at a greater risk while it’s being restored with a new disk, it would be back to full strength quicker.

So I guess my main question is, how safe is option I. And if it’s not as safe as option II, what would you pick for this project? Maybe you also have a different suggestion that I hadn’t considered or dismissed too early (bcachefs?).

I recently had an evening of COVID and “reminiscing” on ZFS approaches, so thought I may as weel write it down here as you asked!

So my first, very personal and opinionated view is that ZFS is a good move. While there are some issues (nothing being perfect), I even use it on RPis. The disk management commands take a little getting used to but they are very consistent. So learn once… (sorts 1, 2ish, 3, 4,6,7?)

Next, I like that fact that mismatched disks just work. Pick your lowest capacity first then just do equal or greater disks. Again not perfect, but nothing is!

Cache? Yeah sure…just config away

Finally, when your weakest link (the Pi) falls apart you can move the array to another machine pretty easy.

Do investigate more to confirm for your exact needs, but linux boot can be awkward, BSD better and I find ZFS even with a single disk just works (and simple to manage) and is a joy to use.

The downside - just remember your backup strategies and so on for your NAS - none of your options fix that issue! And the alternatives will be interesting…

I’d opt for “II. ZFS RAID-Z2.” for simplicity and if you are intending to upgrade the disks later anyway, but this alone will give you only 2T of usable space since it’s limited by the 1T disk. 15% of your disk space would be usable, 15% used by parity, and 70% wasted.
You could partition it to be usable, but if you use a single disk for parallel filesystems, then your IOPs degrade by more than 50% due to head seek latency.

Simple matter with ZFS to use a SSD as L2ARC - just add it to the pool, and can remove it later if you want.

I would have to be careful with system updates for the NAS because the ZFS driver is not part of the kernel.

That is a good concern. Although, if you don’t have the root FS on ZFS, then it’s less of an issue because you can always fix it by a downgrade. Anecdotally, all my machines use ZFS root FS with Arch + linux-lts kernel and I can’t remember having an issue in the last 5 years. If you do use root FS on ZFS, just make sure your initramfs builds correctly, and zfs.ko is present in /lib/modules/$NEW_KERNEL before you reboot :slight_smile:

If you’re feeling adventurous, you can use ZFS RAIDZ2 and waste less of your current disks, and still have an up-gradable storage for later by replacing disks with larger capacities, how about this idea:

LVM2 to join the 1T and 2T into a virtual 3T.

ZFS RAIDZ2 over 4 VDEVs : the 2x 3T disks, the 3T virtual, and a 3T partition of the 4T disk.

You’d end up with 6T effective, and can lose any two of the set of 3T and 4T disks, which includes both the 1T and 2T, along with another disk.

1T of the 4T is wasted, but can be used as a spare partition for anything.

Visually, raw disks:

1T: A
2T: BB
3T: CCC
3T: DDD
4T: EEEE

Become:

3T: ABB
3T: CCC
3T: DDD
3T: EEE
1T: E (unused / extra)

Example commands:

# create single partition using all space, starting at 1MB
sgdisk --zap-all --new=1:1M --typecode=0:8300 /dev/A
sgdisk --zap-all --new=1:1M --typecode=0:8300 /dev/B
sgdisk --zap-all --new=1:1M --typecode=0:8300 /dev/C
sgdisk --zap-all --new=1:1M --typecode=0:8300 /dev/D

# create 3T partition at 1MB, rest of disk as second partition
sgdisk --zap-all --new=1:1M:+3T --new=2:: --typecode=0:8300 /dev/E

# setup LVM2 to join disk A (1T) partition 1 and disk B (2T) partition 1
pvcreate /dev/A1
pvcreate /dev/B1
vgcreate vg.ab /dev/A1 /dev/B1
lvcreate -n merged -l 100%FREE vg.ab
blockdev --getsize64 /dev/mapper/vg.ab-merged
-> will return ~3T

# setup ZFS pool
zpool create naspool raidz2 /dev/mapper/vg.ab-merged /dev/C1 /dev/D1 /dev/E1

Then at boot time, just make sure the LVM2 init script (which just needs to do a vgchange -ay really…) comes up before the ZFS init script (zpool import).

Go with Btrfs RAID1 - it’s your best bet here. Set up your four biggest drives (2x3TB, 1x4TB, 1x2TB) as the main array and use the rest as backup or hot spares. Yeah, you lose the two-disk failure protection, but you get exactly 50% storage efficiency (about 4.5TB usable), rock-solid stability, and all the flexibility you need for those mismatched drives. Btrfs RAID6 still has write-hole issues and sketchy recovery even with RAID1c3 metadata, while ZFS RAID-Z2 would kill your efficiency at 25% because of the smallest-disk limitation - that’s just wasting storage. RAID1 rebuilds fast (hours not days) so you’re not exposed long, you keep snapshots and compression, and you can expand without headaches.

So if anyone is wondering how the project ended - I decided to go with btrfs raid1c3. Upgrading disks will take a while, and after looking at the prices I’m honestly not sure I will ever have matching disks in this array, because by the time I replace half of them, it will be cheaper to go bigger. So I think I’m better off with something more flexible than ZFS. I’m down to 33% efficiency, but I found another 5TB disk to add to the array, which means I’ll have 6TB usable space there ((1+2+3+3+4+5)/3), which will be enough for me for now. And considering that 4 of the 6 disks in the array had at some point shown signs of degradation, I am really not comfortable with just 1 disk of redundancy, even though running badblocks for the past 3 days on all of them only showed intermittent issues on one of them.

I still haven’t figured out a caching solution that would work for me (I thought I would do bcache, but after looking more into it, it seems that I have to let bcache manage the devices rather than btrfs, and then run btrfs raid on top of them, which is too complex of a setup for my taste), but I ran KDiskMark on the mounted NFS share, and iperf3, and the speeds from the 2 are identical, so my bottleneck is the network anyway. So even if I can’t find a caching solution that works for me, I don’t think I will notice a difference anyway. So SSD caching would be a “just for science” thing for me.

Thanks everyone for your input.