Zpool config recommendation

I am a bit confused… and sorry if this is getting a zfs tutorial…

I did the following:

zpool create tst draid1:3d:7c:1s sde sdf sdg sdh sdm sdn sdo

Thinking that it would create two draid1 vdevs with each three disks, and leave one for spare…
But not sure if that was actually what happend :wink:
Here is what the pool looks like…

  pool: tst
 state: ONLINE
config:

        NAME                 STATE     READ WRITE CKSUM
        tst                  ONLINE       0     0     0
          draid1:3d:7c:1s-0  ONLINE       0     0     0
            sde              ONLINE       0     0     0
            sdf              ONLINE       0     0     0
            sdg              ONLINE       0     0     0
            sdh              ONLINE       0     0     0
            sdm              ONLINE       0     0     0
            sdn              ONLINE       0     0     0
            sdo              ONLINE       0     0     0
        spares
          draid1-0-0         AVAIL   

I do have the correct avail…

NAME   USED  AVAIL     REFER  MOUNTPOINT
tst   1.09M  71.2T      279K  /tst

Is is just me, or is the view a bit strange…
I would have liked it like this…

  pool: tst
 state: ONLINE
config:

        NAME                 STATE     READ WRITE CKSUM
        tst                  ONLINE       0     0     0
          draid1:3d:7c:1s-0  ONLINE       0     0     0
            sde              ONLINE       0     0     0
            sdf              ONLINE       0     0     0
            sdg              ONLINE       0     0     0
          draid1:3d:7c:1s-0
            sdh              ONLINE       0     0     0
            sdm              ONLINE       0     0     0
            sdn              ONLINE       0     0     0
            sdo              ONLINE       0     0     0
        spares
          draid1-0-0         AVAIL   

But I can also see the “issue” that the number of disks doesn’t make sense… and you also cannot put one physical disk with the spares, because they are distributed… so I kinda get it, the more I think about it… :wink:

So for my large pool it would be something like

zpool create toolarge draid2:d12:c24:s1

But I think I am running into the issue that this way the “vdevs” or redundancy
groups as it is, have to be equal number of disks and this will not fly if I would like one distributed spare…
So I think I may be forced to have two distributed spares, or one hot spare…

zpool create toolarge draid2:d11:c24:s2

Redundancy is the enemy of capacity I guess :slight_smile:

Haven’t looked in to draid, but draid is a vdev, so would use like:
Zpool create poolname draidN:Xd:Xc:Xs sdX sdX sdX sdX draidN:Xd:Xc:Xs sdX sdX sdX sdX

Which would mean spare is tied to each vdev?

Edit: auto-Uncorrected a little

What recordsize? How much data? What speed did it copy at according to whatever copy command you ran?

This was just a simple “cp” of 4TB large files from one dataset to another on the same zpool…
The pool was created with no special options…

Something I do not understand, and I think isn’t that well described on the manual page is the fact that I can do this:

zpool create -f test draid2:14d:24c:1s 0a-0 0a-1 0a-2 0a-3 0a-4 0a-5 0a-6 0a-7 0a-8 0a-9 0a-10 0a-11 0a-12 0a-13 0a-14 0a-15 0a-16 0a-17 0a-18 0a-19 0a-20 0a-21 0a-22 0a-23

Which gives me this:

  pool: test
 state: ONLINE
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          draid2:12d:24c:1s-0  ONLINE       0     0     0
            0a-0               ONLINE       0     0     0
            0a-1               ONLINE       0     0     0
            0a-2               ONLINE       0     0     0
            0a-3               ONLINE       0     0     0
            0a-4               ONLINE       0     0     0
            0a-5               ONLINE       0     0     0
            0a-6               ONLINE       0     0     0
            0a-7               ONLINE       0     0     0
            0a-8               ONLINE       0     0     0
            0a-9               ONLINE       0     0     0
            0a-10              ONLINE       0     0     0
            0a-11              ONLINE       0     0     0
            0a-12              ONLINE       0     0     0
            0a-13              ONLINE       0     0     0
            0a-14              ONLINE       0     0     0
            0a-15              ONLINE       0     0     0
            0a-16              ONLINE       0     0     0
            0a-17              ONLINE       0     0     0
            0a-18              ONLINE       0     0     0
            0a-19              ONLINE       0     0     0
            0a-20              ONLINE       0     0     0
            0a-21              ONLINE       0     0     0
            0a-22              ONLINE       0     0     0
            0a-23              ONLINE       0     0     0
        spares
          draid2-0-0           AVAIL   

How is that possible? Does it create one redundancy pool at 12+2 and then one at 7+2 ? Or what is going on?
Is there a way to see this?

Update…
I think I “finally” got it…
This command:

zpool create -f test draid2:9d:24c:2s 0a-0 0a-1 0a-2 0a-3 0a-4 0a-5 0a-6 0a-7 0a-8 0a-9 0a-10 0a-11 0a-12 0a-13 0a-14 0a-15 0a-16 0a-17 0a-18 0a-19 0a-20 0a-21 0a-22 0a-23

Yields a filesystem 262T large, with a disk size of 16,4TB thats 16 data disks, so if this should make sense with 24 disks, it would have to create two redundancy groups with each 9 data disks and two parity as well as two spares.
So we have four spares in total here…
So it is not possible to have one common distributed spare after all… :frowning:

I think the page on dRaid could do with some more examples that would explain this… :slight_smile:

I’m so sorry to keep on going here, but I just cannot figure out what happens :slight_smile:
I am trying to create a pool with the following command:

zpool create -f test draid2:9d:12c:1s 0a-0 0a-1 0a-2 0a-3 0a-4 0a-5 0a-6 0a-7 0a-8 0a-9 0a-10 0a-11 draid2:9d:12c:1s 0a-12 0a-13 0a-14 0a-15 0a-16 0a-17 0a-18 0a-19 0a-20 0a-21 0a-22 0a-23

But the size is too small…

zfs list test
NAME   USED  AVAIL     REFER  MOUNTPOINT
test  3.09M   262T      767K  /test

The disks are 16,4TB:

lsblk /dev/disk/by-vdev/0a-0
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sde      8:64   0 16.4T  0 disk 
├─sde1   8:65   0 16.4T  0 part 
└─sde9   8:73   0    8M  0 part

262TB divided with 16,4 is 16 data disks.
I would expect four parity disks, and two spares which together with the 16 data disks is just 22 disks used… what happened to the last two disks???

Here is what the pool looks like…

zpool status test
  pool: test
 state: ONLINE
config:

        NAME                  STATE     READ WRITE CKSUM
        test                  ONLINE       0     0     0
          draid2:9d:12c:1s-0  ONLINE       0     0     0
            0a-0              ONLINE       0     0     0
            0a-1              ONLINE       0     0     0
            0a-2              ONLINE       0     0     0
            0a-3              ONLINE       0     0     0
            0a-4              ONLINE       0     0     0
            0a-5              ONLINE       0     0     0
            0a-6              ONLINE       0     0     0
            0a-7              ONLINE       0     0     0
            0a-8              ONLINE       0     0     0
            0a-9              ONLINE       0     0     0
            0a-10             ONLINE       0     0     0
            0a-11             ONLINE       0     0     0
          draid2:9d:12c:1s-1  ONLINE       0     0     0
            0a-12             ONLINE       0     0     0
            0a-13             ONLINE       0     0     0
            0a-14             ONLINE       0     0     0
            0a-15             ONLINE       0     0     0
            0a-16             ONLINE       0     0     0
            0a-17             ONLINE       0     0     0
            0a-18             ONLINE       0     0     0
            0a-19             ONLINE       0     0     0
            0a-20             ONLINE       0     0     0
            0a-21             ONLINE       0     0     0
            0a-22             ONLINE       0     0     0
            0a-23             ONLINE       0     0     0
        spares
          draid2-0-0          AVAIL   
          draid2-1-0          AVAIL   

zpool list shows the correct size but includes parity as well…

I have been playing around to try to figure this out… and I simply cannot understand it…
If I create a raidz2 with 24 x 18TB disks (lsblk shows 16,4TB) I get 349TB (15,8TB per data disk if I divide it by 22).
If I then try to create a draid2 (with no spares) I only get 262TB (11,9TB per data disk)… that is not just a little extra, that’s 87TB gone :slight_smile:
Can someone explain this, because I sure cannot :slight_smile:

Guess I’m not the only one…

I always hear that ashift=13 has minimal drawbacks even in the case when the drive actually has 4k sectors (i.e. correct ashift=12). Here you seem to say there are benefits too (not only lack of drawbacks) - am I understanding correctly? If so, could you explain the benefits a bit more?

It’s clear to me that the fact that we often don’t know the sector size for SSDs makes a higher ashift a safer bet. But what about the hypothetical case where we know we are dealing with a 4k drive, is ashift=13 still better?

Honestly, every time I think I’ve got my head wrapped around NAND flash and ZFS ashift and recordsize/volblocksize, I learn something new that blows my old understanding out of the water. So let me preface this by saying that I am not the final authority on these subjects and all of this is to the best of my current understanding.

Almost no “4Kn” SSD is actually 4Kn. Even the 4K sector size is emulated—even after changing LBA formats. NAND flash is divided up into blocks and pages. Page sizes range between 4KiB and 16KiB and blocks sizes are some multiple of that number, e.g. 16×4KiB. Reads occur at the page level and writes occur at the block level. When you write out less than a full block of data, the SSD controller 1. marks the original block as invalid, 2. copies the original block to an available block—incorporating the modified page(s), 3. moves the block pointer, 4. erases the original block and marks it as available. That’s the program–erase or P/E cycle in a nutshell.

So at first blush you might want your ashift to match the page size of the underlying NAND flash and your recordsize/volblocksize to match the block size. In practice, however, it’s very difficult to come by this information. Also, there’s all kinds of “weird shit” happening inside the SSD controller “black box” and that makes it challenging to suss out these characteristics. You might have QLC with 128×8KiB blocks nominally that’s being used as SLC write cache with 32×8KiB blocks. So your initial benchmark with ashift=13 and recordsize=256K looks great, but when the SLC write cache fills up halfway through a sustained sequential write, it starts writing directly to QLC, throughput drops off a cliff, and write amplification shoots through the roof. And you have to consider when and how the SSD controller flushes its write cache so you don’t get inconsistent results between benchmarks.

What I’ve learned from experience across a wide variety of drives and implementations and from reading the experiences of others is:

  • the penalty for setting the ashift too low seems higher than the penalty for setting the ashift too high (even though the SSD is ostensibly doing the same amount of work)
  • 8KiB seems to be the predominant page size today and for the foreseeable future, especially on Samsung drives
  • RAIDZ seems to make write amplification exponentially worse when everything doesn’t line up perfectly, so prefer mirrors over RAIDZ

Finally, if your application is doing e.g. 4K random writes then it doesn’t matter what your recordsize/volblocksize is; you’re going to have a ton of write amplification no matter what. So you might as well tune recordsize/volblocksize to the application and/or its consumers per best practices. And if you can tune the application’s block size, go as big as your use-case allows to mitigate write amplification.

That’s just the tip of the iceberg. I’m conveniently pretending that async I/O and transaction groups don’t exist, conflating sync and indirect sync I/O with and without a SLOG, etc. Anyway, corrections welcome.

3 Likes

I’ve got 100+ enterprise SAS drives in one of my home lab ZFS pools using RAIDZ.
I have another large pool similar to above using dRAID.

I would never consider having a pool made from 1 vdev that is 24 disks wide regardless of using dRAID or RAIDZ. You’re giving up a lot of potential performance in doing so as well as much longer resilver times that just aren’t justified.

I don’t recall you mentioned what type of data you are storing on this pool but that should be taken into consideration for sure in the design you use. Databases and VMs for example could have way better performance using multiple vdevs as you will have higher IOPs. This also allows the use of multiple vdevs at the same time so overall latency should be shorter.

Something else and far more important is resilver time. The wider your RAIDZx vdev is the longer the resilver time will take. The higher the fragmentation level of the vdev the longer it’s going to take. However, for RAIDZx the speed it can resilver at max is going to be the write speed of the drive being written to which will likely be in the 150MB to 200MB range. If the pool is made from 1 vdev then you know every read or write being done to the pool while resilvering is slowing it down. If you had 12 drive vdevs vs 24 drive vdevs then roughly half your reads/writes being done by the pool are to the bad vdev. If using 6 drive vdevs then roughly 75% of IO outside the bad vdev is being handled by good vdevs. Also with multiple vdevs ZFS is able to write any new data to the non bad vdev.

Having only 3 parity drives for 24 drives is pushing it IMHO. Even for 12 drives (18TB each) 3 parity drives is not enough IMHO. To me for those size drives the biggest vdevs I would target using RAIDZ3 is 8 wide giving you essentially 5 data drives and 3 parity drives for each vdev. That would be 9 parity drives in total from the 24 drives leaving 15 data drives. This assumes the data is precious which we do not know as you haven’t mentioned what kind of data is being stored or what the impact would be if you have to reload everything from backup.

With dRAID I’d probably do vdevs of “Raid2” using 6 disks. That would use 8 parity disks leaving 16 for data. That would give you roughly 4 times the IO compared to what you have now. In both RAIDZ and dRAID cases I’m assuming you have a couple free bays in the host computer you can use for spares. To maximize performance with dRAID you want to have 4 spares available, so this assumes you have those drive bays available internally.

dRAID really doesn’t come into it’s own until you get up to around 50 drives or 75 (even better). dRAID is of no use on small pools. I’ve not tried it with 20 to 30 drives but would think it’s on the lower side of being useful. If you plan to keep the DS4246 online as well as the DS3246 you would be close to 50 drives making dRAID look like a much better option. With some disk shuffling such as putting the 18TB in he first 12 slots of each and the smaller drives in slots 13-24 you would gain additional IO this way as well as all reads and writes would be getting split between both disk shelves. You would alternate the drives bays when creating the vdevs so your using an equal amount of drives from each shelf. It’s ok that you will have 2 different size set of disks. Don’t mix sizes in the vdevs however.

What you would do is create the new pool and then copy over all the data from the old pool. You can then decommission it and add it to your current pool essentially doubling your vdev count. As new files are written or updated ZFS will balance things best it can for you.

I recommend that you watch Tom’s video about: Playing with ZFS Pools & TrueNAS Scale

Thanks, this is very useful information. Lots of details that make a lot of sense and that I did not think about.

Interesting. I had no idea that read and write sizes were different. Might explain some unexpected results I got in the past.

Assuming we get the internal data sizes right for the write scenario (and the sake of argument) how would ZFS compression factor in? It seems like block-size alignment would be difficult as write blocks may shrink to some lower multiple of page size, causing amplification?

Right. In my case I’m using some 1Tb WD SN850:s and they are supposed to be TLC with some SLC write cache. They are formatted as 4k which was also the maximum that could be selected (which I apparently can’t trust as being the actual page size). I’m not sure about the size of the cache, and to what extent I can keep my regular writes within its range. I would assume that when data is moved internally to the TLC then additional aggregation will take place, so that as long as writing happens to SLC then the SLC block size would be the important one?

On a related note. Do you (or anyone) know whether there is a way to investigate these sizes by writing known chunks to a drive, while watching the SMART field that logs TBW?

The last question seems to depend on whether the TBW-related SMART fields record what is written to the flash, rather than simply what is sent to the controller. I’m asking partly because I’ve tried measurements along those lines (results forthcoming), however I don’t know whether I would pick up on internal write amp from within the drive, or only write amp due to OS & software layers.

Great to hear your aggregated experience like the above, really useful!