(Yet Another) ZFS Pool Layout Sanity Check and Question

Yummyfudge · January 16, 2024, 12:49am

TL;DR Currently using replication of my 6 wide raidz2 single vdev over to a 3 wide raidz1 so I can burn my soon to be old configuration in a fire… or find a virtual lego for it to step on.

My (current) use case is mostly large file (my movie collection), and some random stuff I really don’t wanna loose (that is already stored in multiple places and mediums). Plus a little bit of dabbling in block storage territory (vm’s, containers, containers in VM’s etc). But I have a single m.2 where that usually goes. But mostly large files for now.

So my 10TB HDD 6 wide raidz2… I like the extra security the raidz2 gives me, but 6 wide makes it just that little bit more expensive to expand my pool. Plus the need to expand and the benefits that gives me (more data vdevs) will happen less often. So I’m starting from scratch, but this time, Once More with Feeling. Sorry. I meant with feedback.

I am moving my data onto some 16TB HDD. Those will end up being my second copy backup. Probably as a pool of mirrors. Easy and cheap to upgrade, plus they will just sit there doing nothing, except the weekly replication.

The real question is how to configure my new pool. I now have eleven 10TB HDDs. I was originally thinking 2 x 5 raidz2, plus one hot spare. It’s only one less wide, but drive by drive as my data hording only get’s worse adds up. Plus I like the idea of needing 3 drives to fail in a single vdev before I loose everything appealing.

But now I’m considering either a 4 or 5 wide raidz1. Two more drives and I could have 3 x 4 raidz1 with a hot spare.
So, 2 x 5 raidz2
or, 2 x 5 raidz1
or, 2 x4 raidz1 (soon to be 3x 4 raidz1)
And I will be using some Optane drives SOMEONE mentioned being a really good deal as a metadata device. I don’t see any benefit with either a log or cache on this pool give my current use case.

And I’d be happy with any of them, but more brains==moreBetter. What would you do? Do those options sound good, and I just need to pick one, or is there something I’m not thinking about?

Happy MLK day my state side people, everyone else, thanks and happy Monday I guess

Ghan · January 16, 2024, 2:32am

Well, I like mirrors so I’d always consider that. Even if you are sure you don’t need the performance and this is just for file storage, mirrors give you much better ability to recover from disk failures quickly without being hampered by the slow write speeds of a parity-based RAID setup.

If you don’t buy that, here’s an article from someone much more knowledgeable that is worth the read before you decide: ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog

If you are dead set on using a parity-based configuration, I would probably consider either 2x RAIDZ2s or a single big RAIDZ3. Stay away from Z1 - it’s just too risky, even with a global spare.

risk · January 16, 2024, 4:02am

After years and years I’ve given up on storing movies on ZFS/btrfs , and use snapraid and mergerfs for the majority of bytes I have in this large archival/mostly readonly storage.

I can’t really keep track of machines and disks and various sizes anymore, there’s 4 hosts with large disks of varying sizes at this point.

But, I have converged on a similar setup across the disk hosts.

Ok each disk I do a stack of LVM (2-way split), with cryptsetup on top.

Smaller part goes into btrfs raid whatever - depends on the number of drives on each host, larger part goes into archival.

The archival part, is just a single drive btrfs partition… and there’s mergerfs and samba that stitches all of it together across disks and across machines, … and a cronjob that runs snapraid to add 1 disk of redundancy for archival bits.

OS, home, VMs, containers and “scratch space”, which are stored in traditional software raid, is now probably <10% of total bytes after raid replication.

Typically, when I get a new disk, it takes about a week to burn it in.

Then I start it off with 1TiB LVM partition for regular raid, and a couple of TiB partition for archival media, and a shell script to start rebalancing mergerfs storage that I forked off from mergerfs docs.

When I move a disk across machines, I only need to worry about non-archival storage, actually being copied. Archival storage moves and I go and update samba configs and fstabs.

It’s super manual relative to poking some buttons in a web ui, but it’s very efficient in terms of bytes.

Generally I’m at about 60% bytes used in various "home"s, which is basically hot storage, and around 90% bytes used in archival after snapraid.

(yes I realize Ceph fixes this, but it needs more powerful hosts, and my storage network is a bunch of potatoes with comically large drives).

Yummyfudge · January 16, 2024, 4:08am

Thanks @Ghan, that article was a great read, favorite line:

One last note on fault tolerance

No matter what your ZFS pool topology looks like, you still need regular backup.

Say it again with me: I must back up my pool!

I’m giving mirrors a good second look now, I’m actually watching a video from the OpenZFS YouTube channel on the resilvering impact. https://youtu.be/SZFwv8BdBj4?si=NgmkYX7royxYU05x

The only reason I didn’t put a pool built out of mirrors on my list was some other random articles, but mostly because twice I’ve had mirrors fail. As in both drives, and had to try to do a data recovery process. Now granted the first pair was back in '99 I think, the second failed in about '04ish.

But we have come a long way in the quality of our drives, and this time I do have a plan to have a backup copy of all the data. But I’m just nervous of the idea that if two drives fail in the same vdev, I’m borked.

But I’m putting a pool of mirrors back on my list of options while I read a bit more.

Yummyfudge · January 16, 2024, 4:25am

Thanks for chiming in @risk. I’ll admit I’ve never heard of snapraid or mergerfs. Something new for me to look into for sure, but I think I’ll put that into the “learn about it, and if it’s worth the benefit, try it later” category.

Also, sounds like you have some problems to solve for that I’m not having to deal with right now. I’m currently planning to stick with one network device that all of my storage (some type of NAND storage devices for block storage is next on my list) that will hand out, well files or storage devices as needed. Soon there will be a second device where I will move my backup drives to in a separate location to send my weekly replication to, but that is the plan for the next 2-5 years I think. Only two devices to keep track of.

I am curious what the primary benefit to snapraid and mergerfs are, since I’ve never come across them. Is it primarily a better way to handle data when you have so many hosts?

risk · January 16, 2024, 5:06am

It’s not so much the hosts, as it is for the drives being of varying sizes and types, …

…and me buying them when I need them or when I want to pay for them.

e.g. I’m sitting at about 5T of empty archival space now, out if about 200T total, and I’m just looking at maybe adding a single 18T or a 20T drive, or maybe replacing one of the older, smaller 8T ones in one of the hosts, and sending that 8T into another host to replace a finally recycle one of the 3T drives that’s part of a raid setup.

Most people would use mergerfs on a single host, with individual disks.

But at the end of the day, it doesn’t care where files come from, whether it’s from a raid volume, from single disks, or from a network. If you have space in some directory somewhere, it can probably work with it.

Ghan · January 16, 2024, 11:58pm

This is definitely a valid concern. You could also consider dRAID, though this ZFS VDEV type is typically aimed more at larger disk pools. 11 disks is on the lower end of where it would typically make sense. dRAID trades the cost of having fixed stripe width for the advantage of having much faster resilvering times compared to RAIDZ. Otherwise, it operates much the same, with the ability to specify your parity level and if you want to have any built-in spares. Could be worth a look.

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html

NicKF · January 17, 2024, 12:08am

Ehhh 11 disks is too small for dRAID. DRAID doesn’t really make sense until you are “bigger” than will fit in 4U of rack space. Not for any technical reasons, it just seems to work out that way for the break even point.

Yummyfudge · January 18, 2024, 6:47pm

When using rsync to move data from my temporary holding place while I rebuild my pool (have not actually killed the pool just yet) does anyone know if it moves the metadata from the spinners where it’s currently at, over to the metadata vdevs in my new pool as part of the rsync?

It may impact my decision on going with a bunch of 2disk mirrors or raidz2. Two speakers at a couple of BSD conventions were discussing how the resilvering process works, the various iterations of improvements, and how metadata plays a large role in it.

Which makes me wonder also if having metadata vdevs in a pool makes the resilvering process happen faster. Since the topology of the devices in the vdev being resilvered is a very important part deep in the guts of the process, however I have not seen much on this, and have yet to find some good testing and benchmarking on it either.

But if anyone know, chime in please! I’d love to hear.

edit: this is one of the videos I was refering to, by Allen Jude at an OpenZFS webinar about a year ago. Only 425 views which is a shame, it’s got some good stuff in there. https://youtu.be/pAzkgn_gKkw?t=862