Zpool config recommendation

Hi there
I am in need of some recommendations.
I have an existing zpool consisting of 24 x 8TB SAS disks in a NetApp DS4246, linked up with 6G SAS. This pool is configured as one big 24 device raidz3, and special meta device on mirrored nvme.
Has been working great, and with a filled zpool a resilver typically took about two days…

I have now invested in 24 x 18TB SAS, and installed them into a DS3246 (but replaced the IOM module with an IOM12) so link speeds are 12Gb and I can even link up two connections if I like.

I have been doing some testing with draid, and I think this is the way to go here… but I am not sure. Hence the recommendation :slight_smile:

One obvious setup would be a draid3 with one “build in spare” spare, so that I get about 20 disks worth of data space.

Another approach would be two vdevs draid2, and no “build in spares” (I always have a cold spare ready).
I think that 5 disks out of 24 is a little much :slight_smile:

Just to compare, if I were to setup 24 disks in a NetApp controller, I would use their RAID-TEC (3 parity disks) and that would be it… so why should I do more with ZFS? :wink:
Well there is always the resilver times which would be horrific with a raidz3 and 24 x 18TB (about a week).
But draid kinda “solved” this… although if you use the “build in” spare the resilvering is a two step process… the first resilvering is done reading and writing to all devices, and is fast, but the 2nd part of replacing the physical disk takes longer because we need to read everything and write it to the single disk at a maximum of 250MB/sec.

Any suggestions are welcome :wink:

/B

How is the data backed up, and how critical is the data?

You are rocking nearly a 1/2 Peta Byte of storage and at that size i would say the risks of spinning rust start to really go up. I would be more inclined to sacrifice more drives for higher performance and redundancy before I worried about maximizing my pool size.

Bruh.

Just be aware of the downsides of draid. By having a fixed blocksize you reduce your compression efficiency, and you increase the overhead of 4k block writes - which may affect the performance of 4k writes. Zvols will also need a higher volblocksize to match. So have a think about you ruse case, and if this is a serious performance issue.

However with 24 x 18 TB hard drives, you’re definitely in the territory where long rebuilds are a fact - and DRAID will make sense to reduce that.

Without using the builtin spare you aren’t going to get faster rebuilds - in which case there is no reason to run draid and you should just use 2-3 RAIDZs.

I know that someone would pick up on that :wink:
But seriously, two days resilvering is OK for me…
And I did this because if I were to setup a shelf of 24 disks on a professional NetApp system, I would have no problem configuring it like this… (RAID-TEC which is also 3 parity disks)…
But I know that some of the ZFS hardcore guys don’t like this :wink:

Well the important data is backed up using the znapzend script which works fine.
I just cannot get past that a resilver will eventually come down to the speed of the drive you replace… i.e. 250MB/sec. and if most of your pool is full, this is like a weeks worth of resilvering :wink: I cannot see anything other the draid that can “fix” this… Or will it? Because if your whole pool is full, you will have to read and write everything in order to create the virtual spare? And while this might be faster than a “normal” resilver, you then still have to replace the disk at which point you will be at it again, another weeks worth of replacing… :slight_smile:
But I guess there is no easy way around the fact the 250MB/sec. just isn’t fast when we are talking 18TB… and the only way to guard against that, it to add more redundancy like more parity or more vdevs…

draid3 with one virtual spare is about 400TB
2 vdevs with draid2 (no spare) is also 400TB

I think it will be one of the above… I cannot estimate which one is the “safest”.

Another possibility that just came to mind, was 2 x draid2 with one virtual spare, but I am not sure if that is even possible? I guess the virtual spare is “bound” to the vdev and not the pool? I could of cause just have a hot-spare…

cowphrase: Are you sure that draid isn’t faster at resilvering even without a virtual space?

The compression and performance is not an issue for me as this is for large files that are already compressed…

If you had eight shelves in a rack you would maybe do RAIDZ3 over 24 disks using three disks from each shelf. You only have one shelf! There are more usable TB than IOPS in that pool.

Link to draid documentation for reference. dRAID — OpenZFS documentation

This does work (as shown in the image in the above article). The logical spare can be a spare for two RAID groups in draid.

Good catch there. I missed that the fixed block size makes sequential rebuilds faster, unlike the datastructure rebuild that classic vdevs have to do. Still draid without the built in spares makes no sense - it’s literally a hotspare but better.

With a hotspare, you need to wait for the rebuild to finish before you’ve restored redundancy. So if a second disk fails during the rebuild you’ve now lost two disks worth of redundancy. Too many failures during a rebuild can destroy your entire array.

With draid the parity for the first disk will be rebuilt very fast, so redundancy is restored very quickly. If a second disk failure happens after the rebuild, but before you’ve replaced the disk, you’ll have only lost one disks worth of redundancy. Hence your array can survive many disks failing within a short period of each other. Though it’s still a long road to actually copy data back after replacing the drives…

On a traditional RAID it’s too real having all your redundant disks fail, and hoping that the first rebuild finishes before you can start the second or third. Also having the hot spare as part of the pool provides a small boost in IOPs.

Well… as I mentioned… I do NetApp setups as a profession, and even though the SAS disks are used for backups (normally), we have always setup RAID Groups of over 20 disks with three parity disks… (the maximum is 29).
And in my way too long career in storage, I have yet to see one of these RAID-TEC break…

Right now I am leaning towards a draid2 with a common distributed spare, that is 5 disks worth of spares… I think that should be plenty protection for me… :slight_smile:

24 disks can be fine depending. The nice thing about ZFS is tuning for the work load. Some systems will benefit from z2/z3 with lots of disks in the vdev because its being optimized for raw storage. Others will nest vdevs because they need the IOPs. Some will do stripes of vdevs because they want consecutive read/write speeds…

I think the reason you get so many questions and concerns from people is we dont know your workload, and we dont know your dataset so we are concerned that you are optimizing for the wrong reason and putting data at risk.

Losing data makes many a sysop sweat at night.

infinitevalence: You are right… I tried to point out in an earlier post, that the important datasets are replicated to another datacenter.
But of cause loosing 300 odd TB is never great :slight_smile:

I think I need a bit of help configuring this setup:
Two vdevs of draid2 and one common distributed spare…

Here is the command:

# zpool create <pool> draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>

Do I create the first vdev with the spare option, and then the other vdev without? or can someone explain this? :wink: (maybe it’s not even possible…)

What is a nested vdev?

More vdevs give you more IOPS. Usually. Assuming a balanced pool.

# zpool create tank mirror sde sdf mirror sdg sdh
# zpool status
  pool: tank
 state: ONLINE
 scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sde     ONLINE       0     0     0
	    sdf     ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    sdg     ONLINE       0     0     0
	    sdh     ONLINE       0     0     0

errors: No known data errors

OK, semantic differences, I guess. I wouldn’t consider that a nested vdev; that’s just two mirror vdevs. Cheers.

yeah very basic example but you can go deeper :stuck_out_tongue:

regardless my point was more related to understanding the dataset and requirements. Hard to recommend a setup if we dont know the dataset.

I’m also a professional ZFS and erstwhile GPFS admin, so hopefully we can come to a consensus here. I haven’t worked with NetApp much (other than as a dumb disk shelf) and I’m not familiar with RAID-TEC, so I have no idea how it works via-à-vis ZFS. After a five-minute google it looks like RAID-TEC has more in common with dRAID than RAIDZ.

My concern for you isn’t related to breakage, it’s related to performance. The theoretical maximum IOPS* of a zpool is the sum of theoretical maximum IOPS of all data vdevs†. The theoretical maximum IOPS of a vdev is constrained by the slowest device in the vdev‡. Jim Salter aka mercenary_sysadmin has written about this extensively on his blog and on reddit. Here’s a note on this subject that I have bookmarked.

Since your pool only has a single vdev, the theoretical maximum IOPS of the pool will be more-or-less equal to the theoretical maximum IOPS of your slowest disk. Let’s call it 150 for the sake of discussion. If your use-case only requires 150 IOPS for writing 168TB of capacity, that’s… well, I guess I’d like to know more about your use-case. Your special vdev is probably helping a lot, but even so, 150 IOPS for data blocks is pretty anemic.

* – I’m mostly referring to write IOPS here. Read IOPS are similarly constrained but somewhat less so because most of the time ZFS doesn’t need to read the parity data. The magnitude of the constraint correlates to the ratio of non-parity disks to parity disks.
† – Assuming a perfectly balanced pool, since writes aren’t distributed evenly across all vdevs. I’m also conveniently ignoring special vdevs and other edge cases.
‡ – Assuming the recordsize is large enough to hit all the non-parity disks in a vdev when it’s chunked out. RAIDZ’s variable stripe width makes it not quite as straightforward as I’m presenting it here. If you have a small recordsize relative to the number of non-parity disks, you might be getting two or three disks worth of IOPS.

man, may need to pick your brain at some point because i cant seem to get my 12x 4tb SSD array to be worth a shit regardless of how I config it!

I will defer to you since you clearly have more hands on with this type of setup, and im off to transition to CEPH anyway… because why not.

Unfortunately ZFS on SSD is just kind of bad right now no matter how you configure it. The devs are working on optimizing the internals to work better when the underlying devices can do millions of IOPS instead of hundreds. I reach for ZFS when I need data integrity, flexible snapshots, inline compression, etc., not when I need performance.

That being said, there are some things you can do:

  • You want ashift=13, usually, which might not give you great performance up front but should lead to fewer P/E cycles and better sustained write speeds (and drive longevity).
  • You absolutely have to dial in the recordsize/volblocksize for each use-case. That can lead to orders (plural!) of magnitude IOPS differences.
  • Your topology will determine your theoretical max IOPS as I described above, so for more IOPS prefer many mirror vdevs over a smaller number of RAIDZ vdevs.
1 Like

So… my “dream” of a zpool config of two draid2’s with one common distributed spare isn’t possible? :slight_smile:

Not sure if I follow bambinone’s statement that one vdevs performance equal to the fastest disk in the vdev…
I have just tested this on my 24 disk wide raidz3, where I just copied some data, and looked at zpool iostat, and I could see that each disk used 100+ IOPs, which was combined into the vdevs total iops… so I don’t get your statement :wink:

                    capacity     operations     bandwidth 
pool              alloc   free   read  write   read  write
----------------  -----  -----  -----  -----  -----  -----
aggr0              173T  2.33T  8.90K  5.96K  71.6M  82.8M
  raidz3-0         173T  1.85T  8.90K  5.96K  71.6M  82.7M
    0a-0              -      -    326    293  2.28M  3.52M
    0a-1              -      -    327    295  2.28M  3.52M
    0a-2              -      -    314    294  2.32M  3.36M
    0a-3              -      -    471    305  3.55M  3.36M
    0a-4              -      -    473    303  3.55M  3.52M
    0a-5              -      -    466    289  3.55M  3.52M
    0a-6              -      -    494    293  2.91M  3.34M
    0a-7              -      -    335    176  3.12M  3.38M
    0a-8              -      -    241    190  2.61M  3.50M
    0a-9              -      -    204    182  2.75M  3.50M
    0a-10             -      -    264    190  2.51M  3.43M
    0a-11             -      -    375    185  3.55M  3.43M
    0a-12             -      -    410    185  3.55M  3.52M
    0a-13             -      -    288    128  3.41M  3.38M
    0a-14             -      -    494    305  2.91M  3.35M
    0a-15             -      -    375    187  3.06M  3.41M
    0a-16             -      -    337    322  2.23M  3.53M
    0a-17             -      -    339    304  2.22M  3.53M
    0a-18             -      -    261    191  2.53M  3.43M
    0a-19             -      -    469    298  3.55M  3.36M
    0a-20             -      -    473    293  3.55M  3.51M
    0a-21             -      -    458    295  3.55M  3.51M
    0a-22             -      -    482    304  2.93M  3.34M
    0a-23             -      -    425    278  3.14M  3.50M
special               -      -      -      -      -      -
  mirror-1         109G   487G      0      1  29.2K  58.4K
    nvme01-part1      -      -      0      0  29.2K  29.2K
    nvme02-part1      -      -      0      0      0  29.2K
----------------  -----  -----  -----  -----  -----  -----