Best practice for raidz on differnt sized disks in 2023?

If one has three 8TB and four 16TB disks, is it best to divide those in two different vdevs?

When googling that’s what most people say, but since zfs is a bit of a moving target I thought it would be better to ask the people who are up to date.

Edit: Changed terminology to match what I was thinking.
Edit: Chamged units since I’m old.

1 Like

To help think about this, if they were all 16Gb, what would you do, and why?
For any pool with the 8Gb all disk will use 8Gb. If you replace the 8Gb with a 16, it will then use 16 on all. So if you are going to upgrade the disk soon it won’t remain an issue.
In the longer future, you may be able to expand pools without trashing everything, but that has been w while coming so best not to rely on that option.
Hope that helps!

I assumed this was still the case but I thought it was better to ask in case something had changed recently.

Well, the same-size constraints are true for vdevs, but not entire pools. You can do something like 3x8GB raidz1 + 4x16GB raidz1 in the same pool just fine.

What are your end goals for the data on this? That would help inform online redundancy, backup, and performance targets, and therefore what ZFS layouts might make sense.

2 Likes

A bit of everything but mostly storage, then a bunch of servers, and VMs. Small files up to X size will be on some faster disks.

The question to ask is: what are the benefits of having multiple pools? The very concept of a pool is that you only have one and feed it redundant block devices

more vdevs = better
less vdev = worse
wider RAIDZ = worse

Thanks for clarifying - I meant vdevs (even the Q asked that!)…

Correction, different sized disks in a raidz vdev. I’m not fluid in zfs yet.

1 Like

Why would you want to do this? You got the disks to make multiple vdevs. Wider vdevs just slows things down even more with no apparent benefits.

Relevant…?

Don't actually do this...

Technically, the vdevs may not be optimal, with chunk sizes and stuff, but realistically, its just testing and playing with files, right?
then go ahead.
If you actually have such small physical devices, it’s probably flash sticks/thumb drives, and no harm in playing around.

I would say, make a raidz of 3x8GB, and a raidz of 4x16gb.
technically the 4 wide raidz should perform better, but I doubt the USB bus will show much difference?

1 Like

I know his videos and you can do this. But the drawbacks are significant. Performance will tank and small records will result in more space inflation, negating some of the benefits.
If you absolutely love 4k random reads/writes on your HDD, I can totally recommend doing this. Nothing will will cut large blocks into 4k blocks better than this method.

And storage efficiency on RAIDZ is a lie. You never get the space you expect unless you exclusively store very large files.

OP mentioned drives in the GB level. I presumed it was simply for testing/messing/playing.

But, serious suggestions for actual best practices are good.
Perhaps link him to the other thread you are involved in? (best practice and tips etc…) I think this would be a good fit

Best practice is buying drives of equal size. And using as much vdevs as possible until you hit your personal storage efficiency threshold.

Well, I’d use mirror tbh but this is a RAIDZ thread, so…

Go for two vdevs. I don’t see any benefits merging stuff and make things slower.

Storage efficiency won’t be better, because you probably won’t run 7-wide RAIDZ1, so 2 disks for redundancy remains the same.
In practice, space inflation on RAIDZ will net less effective space.
And that’s with equal sized drives. So the situation is even worse than this.

1 Like

Honestly a cool concept.

Let’s play the puzzle game - how would you chunk those?

Allez optimizer!

I would do 12x8TB. The excess of the 12TB drives goes into a vdev with the 4TB solo drives. The remainder becomes a third vdev. That’s 104 TB. No vdev has less than 6 devices, so you’re getting space efficiency. Importantly no drive is part of more than 2 vdevs. In the video, you lose 1 of the 12 TB drives and you’ve degraded ALL 4 vdevs.
I’m sure it’s some named NP-complete problem or another. :grin:
I dare you to come up with a more optimal solution!

Also, why not tho?

It is. But also very specific in what situations this is actually useful.

Well, yes and no. RAIDZ capacity is a lie because a MB of data can and often will use more than a MB of space. Parity+Padding are the buzzwords here. Depending on what data is stored on the pool and how you handle compression, you may as well end up with 100% cap after 65TB of user data. All hail the joys of RAIDZ!

you trade performance for gross space efficiency. Nothing new. You can use 1TB partitions and get even more gross space efficiency by inflating RAIDZ width even further. The reason you don’t do this is the same reason why people don’t use this method in the first place.

Being part of more than a single vdev is reason enough to be very cautious. Not only for redundancy, but also for performance concerns.

I did it once for science, because the concept is fascinating and brings software-defined storage to a new level. In practice you just add complexity and weaken the pool for marginal gains.

The thread is named “best practice”. This certainly isn’t best practice anyone would recommend. This is more an experimental thing for desperate people.

Okay, first of all, parity is not a lie or wasted space. You either choose safety or space. It’s like the uncertainty principle in physics. The storage uncertainty principle?
On the other hand padding (and recordsize) is esoteric bullshit that nobody without a procurement department ever has a reason to care about.

How’s that less performant? That makes no sense.

Same question.
How much load balancing does ZFS itself do, compared to the OS and drive itself?
Take a look at the part filesystems occupy.

Getting rid of some of the inflexibility of ZFS is not a marginal gain.
Although I will put money on fusion coming before RAID-Z expansion.
Coming soon™

Ok, so, Aspie ZFS rant time and I don’t have an English degree, so buckle up.

So, for the most part, for drives over 2TB, just don’t use RaidZ1, move on to RaidZ2, unless you like to live dangerously. I have done it, but I have access to a good data recovery lab but it nagged in the back of my mind for years until I spun up a new pool in RaidZ2 and migrated all the data from the old pool to the new pool.

Generic advice for vdev sizing gets a bit fun, but usually for RaidZ2 I stick to 6 disks per vdev and stripe vdevs as I need. This means if I want more space down the road and drives are way bigger then, think 8TB > 18TB then I “only” have to replace 6 drives at a time if I do not have empty HDD bays to just add more vdevs too. Example gsheet below.

4k sectors

512 sectors

NOW, the fun part, if you are using compression (AND YOU SHOULD 99% OF THE TIME) you will have allocations smaller than your record size, this means that you may have some small holes to fill later in the pools life, eg: you write a text file and it compresses to 32KiB of your 128KiB recordsize and you don’t have anything else to fill that record, you will have “fragmentation”, like you come right behind it and write a video file or a large picture or something, well, that leftover space gets skipped by ZFS to write full 128KiB writes sequentially, so you “lose” that 96KiB until later as ZFS is trying to avoid fragmentation. Now lets say you pool is older and been running for a bit and stuff is all over the place over the years, and you are 80+% full, you might start running into having to “find” places to cram data in the pool, and ZFS is smart enough to use what it can as it finds it the vdev, so now you start filling all these little holes everywhere and that data is slower to retrieve than older data that was written contiguously. This is ZFS fragmentation, there are ways to fix it, just takes time. (You have to make a new pool or dataset and move the data from one to the other, then you can move it back if you want after clearing snaps and waiting for ZFS to free the old blocks.)

Now, the way ZFS load balances data across vdevs is whoever the fastest vdev is gets more transaction groups to commit to disk, so if you have a 6 disk 8TB vdev that is 50%+ full and you add a new vdev of 6 18TB disks, the new disks will be faster until they start filling up to ~50% where spinning rust tends to get slower. So maybe the 8TB vdev gets 33-40% of the writes and the 18TB vdev gets 67-60% of the writes, over time it will balance out.

Now, if you have a crapton of disks you can play with something called draid, a new ZFS pool layout/type that more or less chunks up the disks and makes virtual hot spare and all kinds of neat crap.

Ok, now that we have some of the basics out of the way, if you have 3 8TB and 4 16TB disks and you wanted to hermit crab it up at a later date, aka have more space at a later date, the cheapest way would be to make a single RaidZ2 of 4 16TB + 2 8TB then keep that last 8TB handy because that is now your cold spare incase a drive dies. Then when a good deal comes up for 2 16TB drives (hopefully same make/model inside the same vdev, but that’s just for OCD mostly) you snach them up, then replace one 8TB drive at a time and hope the rebuild process does not kill a 2ed disk from the load, but, it is a RaidZ2, so you can lose 2 drives and still rebuild that vdev. Then once that process is done, if the proptery of “autoexpand=on” is set, the pool will swell to it’s new size all on it’s own once the final resilver is done.

Also, if you really want a pool for running VMs and stuff off of, grab a bunch of “cheap” 1-2TB SATA SSDs and make a pool, best practice would in 2x2x2x2xetc, but SATA is garbage half duplex trash, so if you have the cash to burn, grab some SAS2/SAS3 SSDs for the VM pool or some U.2/U.3 drives that are oddly cheap on eBay, and use those, or even like the cheap ASUS Hyper M.2 x16 Gen 4 Card and cram it full of cheap NVMes to make a pool, just make sure your motherboard supports PCI-E bifurcation where you plan on plugging it in.

I think that is most of what I wanted to autism about, sorry for the wall of random train of thought. Let me know if I glossed over something.

2 Likes

If my writing is too crap, here is one of the better resources out there, but, it’s more or less what I said, just written way better.

1 Like

I’ve got a question for ya @gcs8 if you have some time. I’ve got an older HP ProLiant DL380p gen8. I’ve upgraded it to a 10 core xeon and have room for a second, but it is just the one for now. I’ve got 64 gb of ram in it and 8 x 1.2 tb sas drives. I also cramming 3 optane m.2s on a pcie 3 riser in the back.

Currently it is in a hardware RAID 5 and I’ve been using it to study for cybersec certs. So I make new VMs and delete them all the time. I plan on rebuilding proxmox and the VMs on it when the optane drives get here.

In your opinion, is it better to ditch the hardware RAID 5, and configure proxmox to do raid1 with zfs for the install and have the rest in raidz1? I’m going to keep the optane drives in jbod config.

So:
RAID1 - proxmox root
Raidz1 - VM disk pool (6 drives)
3 optanes - jbod

The optanes are going to be used for AC Hunter CE for log ingestion… so a lot of read/writes. Would this setup be optimal for VM performance?

RAIDZ or any parity RAID isn’t optimal for VM disks. With VMs you usually deal with small block sizes. Parity RAID cuts the data into small pieces and distribute them across the disks. This makes small blocks even smaller → more random I/O → bad performance. Use RAID 0,1 or 10 for best random I/O.
But usually you cache most of the important small stuff, so ZFS can bypass this problem for the most part. DRAM can solve a lot of deficiencies. So can a SLOG for your (sync) writes. Hardware RAID can’t do these tricks

mirrored boot is never a bad choice.