A (totally unofficial) Conversation about Relative ZPool Performance

NicKF · August 31, 2023, 5:14am

A (totally unofficial) Conversation about Relative ZPool Performance

DISCLAIMER:

I am not an authority figure on ZFS. I’ve made several generalizations and assumptions here that may not be accurate. Please, if you find discrepancies, join the conversation and help correct them. This is a collaborative effort for the benefit of all.

Introduction:

Been a member of this community for a while, and often I notice newcomers seeking digestible insights into ZFS. So, here’s my attempt to shed some light. Feedback from forum regulars and ZFS experts would be greatly appreciated! ![image] I’ve standardized everything around 8VDEVs for simplicity, but there’s really no reason why I chose that number.

Assumptions:

Baseline HDD Performance: An 8-drive stripe equals 100% performance, equating to 1,200 MiBps for both read and write. These figures assume sequential operations, averaging ~150 MiBps from a single hard drive.

Baseline HDD IOPS: An 8-drive stripe equals a baseline of 800 IOPS for both read and write. Averaging ~100 IOPS from a single hard drive.

Baseline SSD Performance: An 8-drive stripe equals 100% performance, equating to 4,000 MiBps for both read and write. These figures assume sequential operations, averaging ~500 MiBps from a single Solid State Drive.

Baseline SSD IOPS: An 8-drive stripe equals a baseline of 400,000 IOPS for both read and write. Averaging ~50,000 IOPS from a single Solid State Drive.

RAIDZ3 Assumptions:

Sequential Read: 90% of an 8-drive stripe.
Sequential Write: 60% due to three parity calculations.
Random Write: 50%.
Random IOPS: 50%.

RAIDZ2 Assumptions:

Sequential Read: 90% of stripe (rounded down from 92%).
Sequential Write: 65%.
Random Write: 55%.
Random IOPS: 55%.

RAIDZ1 Assumptions:

Sequential Read: 95% of stripe.
Sequential Write: 75%.
Random Write: 65%.
Random IOPS: 65%.

Mirroring Assumptions:

Read: 100% (same as stripe).
Sequential Write: 90%.
Random Write: 80%.
Random IOPS: 80%.

Sequential Write Performance (MiBps) - Total Disks:

ZFS HDD Performance Comparison:
ZFS Performance and capacity is d ependent each VDEV, not of each disk
This represntation is designed to more clearly show that the number of disks required to maintain baseline performance grows quickly

Read Performance (MiBps) - Total Disks:
|------------------ 100% (600 MiBps)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

Sequential Write Performance (MiBps) - Total Disks:
|------------------ 100% (600 MiBps)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

Random Write Performance (MiBps) - Total Disks:
|------------------ 100% (600 MiBps)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

Read IOPS - Total Disks:
|------------------ 100% (400 IOPS)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

Sequential Write IOPS - Total Disks:
|------------------ 100% (400 IOPS)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

Random Write IOPS - Total Disks:
|------------------ 100% (400 IOPS)
|█████████████████████████ 4 Drive Stripe (4 disks)
|███████████████████████ 4 vdev Mirror (8 disks)
|██████████████████████ 4 vdev RAIDZ1 (12 disks)
|█████████████████████ 4 vdev RAIDZ2 (16 disks)
|████████████████████ 4 vdev RAIDZ3 (20 disks)

ZFS HDD Performance Comparison for 12 Disks (Logarithmic Scale):
***** Normalized to 12 disks instead of to a VDEV layout*
Using a Logarithmic Scale to emphasize the performance dropoff for different topologies

Read Performance (Logarithmic Scale):

|------------------ 100% Baseline
|█████████████████████████ 12 Drive Stripe
|███████████████████████▒ 6 vdev Mirror
|█████████████████████▒▒ RAIDZ1 (3 vdevs of 4 drives each)
|███████████████████▒▒▒ RAIDZ2 (3 vdevs of 4 drives each)
|█████████████████▒▒▒▒ RAIDZ3 (3 vdevs of 4 drives each)

Sequential Write Performance (Logarithmic Scale):

|------------------ 100% Baseline
|█████████████████████████ 12 Drive Stripe
|███████████████████████▒ 6 vdev Mirror
|████████████████████▒▒▒ RAIDZ1 (3 vdevs of 4 drives each)
|██████████████████▒▒▒▒ RAIDZ2 (3 vdevs of 4 drives each)
|████████████████▒▒▒▒▒ RAIDZ3 (3 vdevs of 4 drives each)

Random Write Performance (Logarithmic Scale):

|------------------ 100% Baseline
|█████████████████████████ 12 Drive Stripe
|████████████████████▒▒▒ 6 vdev Mirror
|██████████████████▒▒▒▒ RAIDZ1 (3 vdevs of 4 drives each)
|████████████████▒▒▒▒▒ RAIDZ2 (3 vdevs of 4 drives each)
|██████████████▒▒▒▒▒▒ RAIDZ3 (3 vdevs of 4 drives each)

IOPS (Logarithmic Scale):

|------------------ 100% Baseline
|█████████████████████████ 12 Drive Stripe
|█████████████████████▒▒ 6 vdev Mirror
|███████████████████▒▒▒ RAIDZ1 (3 vdevs of 4 drives each)
|██████████████████▒▒▒▒ RAIDZ2 (3 vdevs of 4 drives each)
|████████████████▒▒▒▒▒ RAIDZ3 (3 vdevs of 4 drives each)

ZFS Topologies Visualization

Legend:
= Disk
= Parity/Mirror
─────────────────────────────────────────────────────────────────────────────

Striping (4 Drives)
| | | | |
Total Drives: 4
Pool Size: 4TB
Raw Size: 4TB
─────────────────────────────────────────────────────────────────────────────

Mirroring (4 vdevs of 2 drives each)
| | | | |
Total Drives: 8
Pool Size: 4TB
Raw Size: 8TB
─────────────────────────────────────────────────────────────────────────────

RAIDZ1 (3 Disks per vdev, 4 vdevs wide)
| | | | |
Total Drives: 12
Pool Size: 8TB
Raw Size: 12TB
─────────────────────────────────────────────────────────────────────────────

RAIDZ2 (4 Disks per vdev, 4 vdevs wide)
| | | | |
Total Drives: 16
Pool Size: 8TB
Raw Size: 16TB
─────────────────────────────────────────────────────────────────────────────

RAIDZ3 (5 Disks per vdev, 4 vdevs wide)
| | | | |
Total Drives: 20
Pool Size: 8TB
Raw Size: 20TB
─────────────────────────────────────────────────────────────────────────────

NicKF · August 31, 2023, 4:43pm

Edited for better formatting?

GeorgePatches · August 31, 2023, 8:03pm

I like what you’ve done, but I think the average user that’s looking to make a ZFS pool is probably more limited in the number of drives they can use (40 drives is a big server). For me, the number of drive bays I have is my second limiting factor after the budget of actually buying drives. Might I suggest a second section where you take a fixed number of drives and carve them into various vdev/parity configs.

For example: 12 drives
12 vdevs of 1 disk YOLOs
6 vdevs of 2 disk Mirrors
4 vdevs of 3 disk RAIDZ1s
3 vdevs of 4 disk RAIDZ2s (EDIT: actually this is just a worse version of a pool of mirrors)
2 vdevs of 6 disk RAIDZ3s (EDIT: and this one is even worse)
1 vdev of 12disk RAIDZ1-3

Your math here isn’t adding up. RAIDZ2 5 disks X 8 vdevs = 40 disks, and illustrated 5 drives + 2 parity, which sounds like a 7 disk vdev. Same problems with RAIDZ3, 7 X 8 = 56 and you illustrated a 10 disk vdev.

NicKF · August 31, 2023, 8:22pm

Yeah I’ll fix that…there are errors in my info-graphic my bad!

But the reason this was normalized to 8 VDEVs and not a specific number of drives was an attempt to illustrate the negative performance impact of parity, which I’d imagine is something new users don’t quite understand. It also illustrates the point that ZFS performance is DRIVEN by vdevs not drives. In other words…you need alot more drives than a new comer might think to make a RAID Z2 come close to the performance of 8 striped drives…

So 8 was a number I though that would really really show that…where 40 disks are performing worse than 8 just given the topology and no other factors. Although maybe I will change and normalize it to 4 for simplification.

I do also think I should add another set of graphs normalized to like 12 disks instead of being normalized to VDEVSs, so I hear where you are coming from.

GeorgePatches · August 31, 2023, 8:44pm

Yeah, just trying to help make the info-graphic work, it’s a good visual.

I understand the idea with the parity, but I thought that was more of a CPU overhead problem? We live in an age where CPUs power is nearly infinite, at least as far as parity calculations for spinning rust. Like where does the reduced bandwidth actually come from in tese 8 vdev eexamples?

NicKF · August 31, 2023, 10:23pm

Yeah - my assumptions may be wrong because of relative CPU power - looking for recomendations on the relative %s between the types in that format if you have any suggestions happy to hear em.

NicKF · September 1, 2023, 2:55am

Should look better now

Trooper_ish · September 1, 2023, 8:30am

Would raidz/2/3 fare better or worse with fewer drives per vdev? Or does it not matter?

As drives become larger, the expectation of more than one drive failing during a rebuild/heavy workload increases, so z2 and z3 are more important.
But with so few in a vdev, it seems real world one would rather choose wider mirrors?

I understand you want multiple vdevs, and originally proposed a ridiculous 40 drives in a z3 setup.

Not sure what access to drives you have, but have you done any numbers yet? Even on 12 drives?

I liked back when the calmeo guy tried out different vdev sises, but iirc, only 1 vdev per pool?

I like the idea of multiple vdevs

vic · September 1, 2023, 9:03am

I felt many storage builders are preaching the wrong solution to their target audience. I mean their solution may not be wrong as tech demo. Whether it meets the target audience need is very questionable.

Take this forum as an example. The need varies widely but mostly for home or personal use. My home data archive is less than 20TB which is very easy for me to manage. Hence, I don’t fall into the category of needing a storage system nor I would desire to run a storage system 24/7. But I realize some people may have hoarded lots more video/movie and etc and they need a storage system to house them.

For such a problem and for these people, the priorities for them IMO in descending order are:

easy storage expansion
low fixed cost to start with; low additional cost to expand storage
reliable data integrity protection
somewhat good data availability; if you have to take the storage system offline for a few hours or a day or two, not a big deal.

Given these criteria, what will be your (not just OP but readers of this thread) recommended ZFS layouts?

NicKF · September 1, 2023, 1:40pm

It depends. But generally, more drives per VDEV make the VDEV faster. The pool is only as fast as the slowest VDEV.

I have a system with alot of drives I will be using to try and bring these numbers from the theoretical into reality.

I like Mirrors. You just add two drives at a time. They perform the best and it’s easy to grow. They also open up some doors and are easier to manage in some lower level tasks. Obviously the downside is a 50% storage cap.

3 disk RAID Z-1 is another good choice for the usecase. You have to buy 3 disks at a time, but you get 67% usable space and the performance may be fine for what you are doing.

In general, I see alot of folks in the homelab community just make one big vdev and call it a day. That’s not good systems engineering practice, given how we know ZFS works. It also makes it more difficult to expand. In fact, until RAIDZ expansion drops it’s likely impossible for many given the physical constraints of their chassis.

Using my example, which with 12 drives should fit in most used server chassis and full size ATX ones. These are both pretty good/valid choices IMO.

Mirroring (4 vdevs of 2 drives each)
| | | | |
Total Drives: 8
Pool Size: 4TB
Raw Size: 8TB
─────────────────────────────────────────────────────────────────────────────

RAIDZ1 (3 Disks per vdev, 4 vdevs wide)
| | | | |
Total Drives: 12
Pool Size: 8TB
Raw Size: 12TB
─────────────────────────────────────────────────────────────────────────────

GeorgePatches · September 1, 2023, 1:52pm

Better or worse in what regard? In terms of performance, you’ll get greater throughput with more drives for a sequential workload, but your IOPS will remain fixed at roughly 1 drive. In a rebuild scenario, a wider vdev might have an edge in that the greater number of drives will have an easier time saturating the write bandwidth of the replacement drive.

Well really it’s dealer’s choice. One of the wonderful things about ZFS is you can lay it out however works best for you. You want the most space possible, 1 massive RAIDZ1 will do ya. You want the most IOPS with some redundancy, pool of mirrors is for you. Need maximum data redundancy, you can make a 5 drive mirror if you want. The recipe is up to you.

My recommended layout is a pool of mirrors. Reason being I want an array that can do anything (host VMs and store files) and this gives me more spindles for IOPS. Also if I need more space, I just add more mirror vdevs. You can’t change the size of a RAIDZ vdev once it’s created.

Trooper_ish · September 1, 2023, 2:00pm

R.e. Vdev size/width; I was thinking there was a neat multiplier, per z2/z3 of optimal drives.

Iirc, the calmeo blog tried fewer and more drives, to see if there is a “best” number, for kind of optimal stripe size.

So if recordsize is like 4k and you have 4 data +parity drives, then each data drive gets a 1k chunk, and each parity gets parity

I was suspecting that a z3 of 4 disks, would not be a representative number of providers, as 3 drives receive a parity chunk, and only 1 drive receives the data chunk.

Does that make any sense?

Where like, 6 data +parity, the block is broken up and each drive receives a smaller chunk (so quicker)

Obviously multiple vdevs give parallel speed enhancements

Again, with the recordsize set appropriately

(parity mixed among drives, so no dedicated actual partity drives)

Trooper_ish · September 1, 2023, 2:02pm

Blog with single vdev pools, various sized vdevs:

https://calomel.org/zfs_raid_speed_capacity.html

That was made before NVMe really took off, iirc

And I only mean, that it might be unfair on z3 to try with too few disks, where the pain point of a more complex algo, is exacerbated by fewer providers

vic · September 1, 2023, 7:12pm

The popularity of NAS at home actually isn’t that high. Among those homes with a NAS, the most common choice is a off-the-shelf 4-bay NAS. While I think for a custom NAS, HDD slots are more flexible. However, more slots increase the initial fixed cost. A bigger box also occupies a larger floor space or room space which could be very costly to some people. So while not explicitly mentioned in the problem statement, perhaps we shall limit the HDD slots to be 4 or fewer.

Mirror VDEV seems a popular suggestion. A user can start with two HDDs. Grow storage capacity by adding another Mirror VDEV. I think it’s easy and low cost because the two HDDs in the existing VDEV can continue to serve.

Though the 50% HDD utilization perhaps raises the per-GB effective cost.

When both mirror VDEVs need storage expansion. The user can replace two HDDs in the older VDEV to higher capacity HDDs. I think ZFS is/will support VDEV capacity expansion now or very soon.

The cycle repeats itself as the user needs more space. Does it sound like a good strategy? For use cases that outgrows a single high capacity HDD but less than 50TB archive data (as of 2023)? Assume the user’s “natural” data growth rate falls below the rate of grow of the industry’s leading HDD capacity. I think this storage system could last a very long time.

Is a 4-disk raidz1 VDEV a thing? If under the four HDD slots limit, sounds like a more attractive alternative for users with larger archive size or faster projected data grow rate than possibly to be satisfied with two Mirror VDEVs.

Also the HDD utilization is now 75% (?). Cheaper per-GB effective cost.

The expansion strategy will depend on the same new ZFS feature. The downside is that four HDDs (instead of two in a Mirror VDEV) have to be replaced in a single space upgrade cycle. I anticipate the upgrade cycle perhaps longer and offset the expansion cost a bit.

Does it even make some sense?

Right. At the other end of the spectrum, over engineering. Perhaps a big NAS with many HDDs spinning. Otherwise a 99.9% idle storage system. That doesn’t sound nice and doesn’t serve a meaningful purpose.

What’s actually the “Best Practice” in terms of the ratio of occupied storage size to left-over available storage size when building a new storage system?

GeorgePatches · September 1, 2023, 8:17pm

ZFS supports this today, and has supported it for some time. You can replace any vdev drive with a larger one and once you’ve replaced every drive in the vdev your pool size will increase to the new drive size. This takes a lot of time as you effectively have to resilver after every drive change, but it totally works (mirrors or RAIDZ). What you can’t do today is change the width of a RAIDZ vdev after you create it. A 4 disk RAIDZ1 for example can not have a disk added to it to make it a 5 disk RAIDZ1.

Yes, absolutely a thing. And in your 4 slot limit scenario, it’s a fine option. You’re giving up IOPS and rebuild speed for capacity, but at 4 drives that’s fine.

Some of us use our home labs as a place to practice and learn our professional skills. I don’t like idle hardware wasting power as much as the next guy, but sometimes you gotta pay to play. I’ve mitigated some of this issue in my personal setup by consolidating all my servers that previously were a separate box into VMs on my storage server. Instead of a storage server, a router, and a couple of raspberry pis I have just the storage server running. The “Forbidden Router” as it is known around here.

NicKF · September 1, 2023, 9:00pm

“It depends”
4-disk raidz1 is a thing. RAIDZ is more felxible than traditional raid in it’s disk count requirements. I would say tho, a 3-disk raidz-1 should be the approach when/if you have a chassis that supports drive configurations in multiples of 3…commonly you’ll see 12-drive servers these days for pretty cheap. Alot of fullsize ATX cases too.

You’ll notice that 4 is also a multiple of 12. So you can choose the extra capacity you’ll gain by using 4-wide Z1, or you can choose the additional performance and safety of 3-wide z1. Heck, you can even choose MORE safety with 2x6-wide Z2s…at the cost of even more performance. You’ll also notice, the more disks we allocate per VDEV, the higher upfront cost there will be each time we go to grow our system.

The increased flexibility of 3-drive vdevs, and the performance gains of additional vdevs as you grow, make 3-drive vdevs more preferable in those situations. If you are using a system that only has 4 drive bays like alot of small NAS’s do…4 drive RaidZ-1 is fine.

This is another “it depends”.

Typically, (and if you are shopping as an enterprise customer you will hear this), you should assume you want double the amount of space than you KNOW you need so you have room to grow. You also should choose a topology that makes it easy enough to grow, like we talked about above.
That logic trickles down to us plebs also IMO. ZFS really does not like to be full, its copy-on-write. Your pool fragmentation goes bananas depending on how full it is and what record size you have. Plan to not let your pool ever go over 80%…but you can probably eek out 85-90% before you see any real issues. Depends on your risk aversion how full you want to let it get. So if you KNOW you need 10 TiB, plan for 20…because you can only really use ~7TiB before you start seeing alarm bells.

A Fun Sidebar on Sizes and space:
A 10 TB Hard Drive is really only a bit over 9 TiB. Pay attention to the additional i in TiB

The base-10 (decimal) system, where: 1 kilobyte (KB) = 1,000 bytes 1 megabyte (MB) = 1,000 KB = 1,000,000 bytes 1 gigabyte (GB) = 1,000 MB = 1,000,000,000 bytes

The base-2 (binary) system, where: 1 kibibyte (KiB) = 1,024 bytes 1 mebibyte (MiB) = 1,024 KiB = 1,048,576 bytes 1 gibibyte (GiB) = 1,024 MiB = 1,073,741,824 bytes

Because of history, and fun, we’ve had to create multiple definitions of what a “byte” is.

SSDs (and flash storage in general) make this worse by having weird sizes in the other direction. They are sometimes marketed as 128GB but they actually mean 128GiB… In other cases they kinda lie.

A fun way to prove this? Take a 8GB flash drive from two or three different manufacturers. Try and DD exactly 8GB to them. Then try and DD 8GiB to them. On top of the weird problems with GB vs GiB you’ll likely find those flash drives are neither 8GB or 8GiB, but somewhere in between. Lets refer to this as the fudge factor.

How do I know all of this? I had to DD thousands of flash drives and SSDs over the last few years. Don’t ask. But also don’t take my word for it. Feel free to audit my claim. I didn’t know if this same phenomenon exists with HDDs unitl much more recently. But it does, and for probably different reasons.

I have a hodge-podge of shucked drives and used SAS drives in my pool at home. TrueNAS says this:

All the drives seem the same at first glance:

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Ultrastar He10/12
Device Model:     WDC WD100EMAZ-00WJTA0
Serial Number:    JEKRH59Z
LU WWN Device Id: 5 000cca 267f47fa0
Firmware Version: 83.H0A83
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH721010AL4200
Revision:             A9G0
Compliance:           SPC-4
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned

The keen eye’d among you may already see where this going.
**But if you dig just a bit deeper:

Disk /dev/sdk: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdj: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdn: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdt: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
**Disk /dev/sdb: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors**
Disk /dev/sdl: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdm: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdg: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdi: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdu: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdv: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
**Disk /dev/sdy: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors**
Disk /dev/sdp: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdc: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdq: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdh: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdo: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sds: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdr: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdx: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sde: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdw: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdf: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors
Disk /dev/sdd: 9.1 TiB, 10000831348736 bytes, 2441609216 sectors

Because the QTY of sectors on 4Kn and 512k drives is differant, there is even more to the story than one might first think. TrueNAS is picking up on something I’m not…Cause the math checks out in that update. So in this case TrueNAS seems to be just alarming because of the different sector sizes, and I should figure out if I can change it on those two disks to 4k. Or there’s an Alien consipiracy.

Then, add the filesystem? Iin this case, ZFS, will inherently use some space.
Let me provide a hypothetical example:

In RAIDZ1, one of the disks’ worth of space is reserved for parity. So, with 4 x 1TB drives, you’ll effectively have 3TB of usable space.
ZFS reserves a small portion (around 1/64th) of the disk space for its “slop space”. This is used to ensure that ZFS doesn’t run out of space for its administrative tasks.

For a 3TB setup, this would be around 51539607552 bytes.
Converting this to GiB (using the base-2 measurement): 46,875,000,000,000 bytes ÷ 1,073,741,824 bytes/GiB … ish 40GiB of additional overhead.

vic · September 2, 2023, 6:13am

Expansion inside a VDEV will become a reality real soon [0].

Perhaps with higher IPOS requirement, the preferred choice is to go with Mirror VDEV as you suggested. Since the storage system (as described in my problem statement) is for data archive & bulk data, and in a typical home environment, high IOPS is not a priority.

This brings up an interesting question: should ZFS be used as a system partition where the OS and applications are stored?

I would say why not? But I would also question why one has to go this way?

One common reason quoted by people to support system partition on ZFS is the snapshot feature. So that when new updates break the installation, they say simply reverting back to a previous snapshot.

Sounds so convenient but seems also add a bunch of unnecessary complexity and uncertainties. No?

Once upon a time, Apple wanted to use ZFS to replace its ageing HFS+. The story ended up not with ZFS but Apple developed its own APFS. AFPS has the snapshot feature just like ZFS. Seems up until today, Apple is very cautious in using its snapshot feature.

I could see it’s used for quick and temporary backup purpose. That’s before a permanent backup storage is online and it’s able to transfer those snapshots to the permanent storage.

I see Apple is also using its APFS snapshot capability for system update. But seems to me it’s used for the purpose of sealing and signing off the system partition rather than for the purpose of revert back in case of update failure.

I get your point. But a 12 HDD bay ATX case is pretty big already…that deviated from my clarified problem requirement.

To refresh my memory, I searched and looked up the VDEV expansion ticket on Github. Now thinking about it I have a revised strategy for a 4-bay ZFS-based NAS:

Depends on the requirement on storage capacity. A user could start with a 3-disk raidz1. When the need grow, the user could either 1) use “swap disks & resilver” method to expand the capacity and maintain the 3-disk VDEV. Or 2) add one more disk to the VDEV to make it 4-disk VDEV by using the new ZFS feature [0].

I haven’t read the VDEV expansion in detail. I would guess there are pros & cons either way.

When all 4 bays are occupied, future capacity expansion will go through the way of “swap disks and resilver.” With some planning ahead, this should happen once every several years. So shouldn’t be much a hassle. I expect such a storage system could also last a very long time for home/personal use.

Double what you need is such a human instinct when they don’t know or face uncertainty. If the user has a 12TB data archive now, he fancies a new storage system. Should he go with 24TB effective capacity in his new system or 36TB? Assume he wants to put every dollar spent to efficient use. What’s your take?

The second has 512byte logical sector size. The first has 4096byte logical sector size. Hence, the difference in number of sectors. The underlying physical sector size doesn’t matter to this report it seems.

[0] raidz expansion feature by don-brady · Pull Request #15022 · openzfs/zfs · GitHub

NicKF · September 2, 2023, 1:35pm

I am deliberately being vague here FWIW…because I’m trying to dedicate this thread to generlized information rather than specific use-cases.

I concur that given a smaller chassis a single vdev 4 drive RAIDZ-1 is fine.

“It depends”
You haven’t defined a use case. If you have a static-ish dataset, you’re throwing money out the window. If you expect growth, maybe. What’s your expected growth? How long are you trying to “set-and-forget”?

Yup. But notice the sector counts, and then do the math. The RAW value comes out to be the same, as reported by fdisk, but TrueNAS in that case didn’t like it. Why is there an alert generated if they are the same size?

My assumption here is that there’s likely more to that part of the story…wheby the math does not always line up quite so neatly. 10,000,831,348,736 is a very specific number.

Particularly, the ‘831348736’ part.
Using the definitional understanding on TB and TiB we should expect:
10,000,000,000,000 bytes.
-or-
10,995,116,277,760 bytes.

But, once again, we have found it somewhere in between. My assumption here is that the physical platters between these two models are the same, which means they have a similar layout for spare sectors when/if problems go awry on the surface of the platter. My fudge factor still exists.

Given that we know they are all the same size, albeit neither 10TB or 10TiB…The question remains, is TN just erroring because of the mixed logical sector size? The fact that they are both 4k physical is irrelevant. I think the TN developers assumed that different drive models or differant logical sector sizes may really be different sizes, and I just happened to have found an exception to that general rule.

vic · September 2, 2023, 4:37pm

You can assume I got your intention from very early on. And I’ve been trying to dance along. lol. In a hope, home/personal users will take away a bit useful info for building a ‘custom home NAS’…perhaps with an old mATX compact tower collecting dust in garages.

I haven’t used FreeNAS/TrueNAS (no plan to use it either). So you lost me. Now I went back to the screenshot, I saw the orange exclamation mark on its GUI (?)

Thinking about it now. Perhaps it’s just a silly assumption in its GUI or lower layer application logics. Perhaps it doesn’t carry any crucial information.

NicKF · September 2, 2023, 6:09pm

Sorry for any misunderstanding
<----I can be quite dense

A (*totally unofficial*) Conversation about Relative ZPool Performance

A (totally unofficial) Conversation about Relative ZPool Performance

DISCLAIMER:

Introduction:

Sequential Write Performance (MiBps) - Total Disks:

ZFS Topologies Visualization

ZFS Topologies Visualization

A (totally unofficial) Conversation about Relative ZPool Performance