How should I arrange my 100+ drive zpools and vdevs?

SKIP TO How should I arrange my 100+ drive zpools and vdevs? - #28 by Sawtaytoes for the final configuration and the journey to get there with pictures.

I have 100 SSDs in my Main NAS, and I think I’ve been doing zpools and vdevs wrong this whole time.

I have an offsite all-hard-drive which I use for an extra backup.

Hardware arrangement

  1. 58 of them are directly attached to SFF8643 SAS cards.
  2. 4 are connected to the board’s NVMe. They’re Optane drives (for metadata).
  3. Connected to the board’s SATA, I have a 2-drive mirror for boot drives and 2-drive mirror for TrueNAS apps.
  4. I have an external SAS card with 4 SFF8643 ports that hold the remaining 34 drives. They’re in a second chassis with 4 SAS expander cards.
  5. Each of the 4 SAS expanders is connected to 15 drive ports, but only 3 of those cards are in use. In the future, I plan to use the 4th row of 15 drives for surveillance hard drives.

zpools

Because of this arrangement of some drives being in a different chassis, I have one zpool per chassis.

Every vdev is a 2-drive mirror. I currently have no other vdevs.

In both zpools, I have 2 drives assigned as hot spares.

Issues

I thought about this more and realized there are a few issues:

  1. My main vdev only has 2 more drive slots, so if I need more capacity, I’ll need to buy larger SSDs, but the size gain is minimal.
  2. Because I have 2 zpools of mirrors, I have 4 copies of the same data. I’m wondering if it’d make more sense to have only 3 copies of that data and get more overall storage capacity. Just remember that my hardware is split between dedicated SAS cards and SAS expanders.
  3. While it’s safer to have 2 zpools, with only 2-drives vdevs, my ability to checksum errors is lessened versus having 3-drive vdevs.
  4. It’s also safer to have two zpools because a pool failure is easier to deal with.
  5. A second zpool can have different snapshots from the main zpool.
  6. Since my fastest uplink is 25Gb, using SAS Expanders shouldn’t affect zpool performance.

Question

I’m wondering what to do.

Should I:

  1. Keep what I have.
  2. Merge both zpools into a single zpool with 2-drive mirrors?
  3. Merge both zpools into a single zpool with 3-drive mirrors (resulting in greater storage capacity)?
  4. Merge both zpools into a single zpool with multiple copies of the same data (ZFS has a setting for this).
  5. Something better (please respond).

Are these SAS SSD’s? If not, consider using interposers to improve availability if you haven’t already. Linus did a video fairly recently on a purchase he made in which he uses said interposers to be able to use a regular SATA drive in a SAS environment:

This is part one, how he got it:

Linus’ storage solution from a year back, there’s a 2nd part too:

I’m aware it doesn’t really answer your questions, but hopefully gives you food for thought for an alternative solution suitable for your situation.

depends on what your goals are

How much capacity do you need?
How many users work with it?
Are there concurrent workloads and does everything have to go through a 25Gb link?

Adapt your vdev’s to the requirements of your applications, capacity and throughput raidz, IOPs mirror and the application matching zfs recordsize/stripe width.
Build your own “application-aware” storage

https://blog.programster.org/allan-jude-interview-with-wendell-zfs-talk-more

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html#raid-z-stripe-width

How much capacity do you need?

I’m fine with half per drive. 100TB is fine for now. With the ~67TB I have right now, I feel limited since it’s not recommended to go over 70% capacity with ZFS, and I’ve got about 45TB on it (snapshots make this higher than the actual data).

How many users work with it?

Just me.

All I do is store files on there like photos, videos, documents, etc.

All my active projects like Source Code and Games are stored on their respective machines’ NVMe drives and backed up to the NAS.

Are there concurrent workloads and does everything have to go through a 25Gb link?

Not everything needs to be 25Gb, but it’s more convenient that way. This is a TrueNAS SCALE system, and they only let you DHCP a single interface :person_shrugging:.

My wife and kids access the NAS through a Roku or NVIDIA SHIELD when using Plex, but they’re usually only around 50-90Mb/s.

When using the NAS, downloading files, unzipping things, etc. I want it to feel like it’s natively connected to my PC.

I used to have these drives all in my PCs and then backed up to the NAS, but I had no way to sync the data. It was easier to just put everything on the NAS instead.

I have 3 PCs cable-connected which I use to access data from this NAS, but the other two are only 10Gb because I haven’t wired fiber to them.

Adapt your vdev’s to the requirements of your applications, capacity and throughput raidz, IOPs mirror and the application matching zfs recordsize/stripe width.

I want speed and reliability. That’s why I’ve chosen mirrors. Since I can fit 60 drives in each Storinator XL60, I have a lot of room to add more drives.

I can add or remove vdevs at will and replacing a single drive happens in minutes; not days. I’ve heard too many horror stories related to RAID-Z. RAID-Z is also less versatile.

I’ll take a look at those articles you linked.

I also don’t wanna take this thread in a different direction, but I can at least answer your questions.

Are these SAS SSD’s?

All my SSDs are SATA because they’re simply more affordable, and I already had 8 of them when I started building this thing. I use Crucial MX500 drives.

Consider using interposers to improve availability if you haven’t already.

I’ve seen those videos, but I didn’t think I needed SAS interposers. From a quick look at some ServerFault responses, they definitely have a use case.

Aside from the drives connected via SAS expanders, all other drives are directly connected to each port on those SAS cards.

please provide a “zpool status”, I don’t get the 4 copies and we talk about the MX500 2TB?

# zpool status
  pool: Bunnies
 state: ONLINE
  scan: resilvered 65.9G in 00:23:44 with 0 errors on Thu Jul 27 05:38:05 2023
remove: Removal of vdev 30 copied 2.51G in 0h1m, completed on Tue Jul 18 22:54:35 2023
        2.73M memory used for removed device mappings
config:

        NAME                                      STATE     READ WRITE CKSUM
        Bunnies                                   ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            ed5edcb7-8763-49f0-bc00-f780a3b6f409  ONLINE       0     0     0
            e30dc3cd-cc08-4bf1-9b84-f9419c83b536  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            d0a69595-7e13-11ed-a976-a8a159c2849a  ONLINE       0     0     0
            62035c92-5940-432b-a6f4-d18b3f12fe20  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            d0b4495b-7e13-11ed-a976-a8a159c2849a  ONLINE       0     0     0
            d0ba8170-7e13-11ed-a976-a8a159c2849a  ONLINE       0     0     0
          mirror-3                                ONLINE       0     0     0
            fafb2e96-d1a5-45a5-8fa4-c79cbe9439dd  ONLINE       0     0     0
            b9790124-e168-49d7-bf15-4ef5d92c9727  ONLINE       0     0     0
          mirror-4                                ONLINE       0     0     0
            81d41751-fe27-4b1a-a45f-9ec11efd0ec2  ONLINE       0     0     0
            da8385bc-ce10-43f4-b1d6-b75ac22fc886  ONLINE       0     0     0
          mirror-5                                ONLINE       0     0     0
            070917b1-2043-4c93-8c46-34c6ec1f0a8a  ONLINE       0     0     0
            26bad8a4-7d6f-4542-8c17-d1c5aec98440  ONLINE       0     0     0
          mirror-6                                ONLINE       0     0     0
            d10956de-72eb-441f-864f-b1b810e3ed37  ONLINE       0     0     0
            46d58c6f-cf6c-4e87-80b4-0382b642a472  ONLINE       0     0     0
          mirror-7                                ONLINE       0     0     0
            a6620c29-00ef-491e-a76f-21a9cfb0d5fb  ONLINE       0     0     0
            12288a27-cd2a-48b7-923c-cc5897a10ee1  ONLINE       0     0     0
          mirror-8                                ONLINE       0     0     0
            fe6db0d0-45ca-4a9e-96d3-0c6d9ff235c3  ONLINE       0     0     0
            e6e0c7e6-b380-40ca-87e1-25292cad6215  ONLINE       0     0     0
          mirror-9                                ONLINE       0     0     0
            0fb792da-4f62-4b7b-8e7c-734e185ce37b  ONLINE       0     0     0
            485016cb-069a-4c4b-be78-05f472396179  ONLINE       0     0     0
          mirror-11                               ONLINE       0     0     0
            aad8d68e-1763-4c44-aa42-d5b95521bc12  ONLINE       0     0     0
            0c34ca9a-787d-4486-bab1-49edffee1432  ONLINE       0     0     0
          mirror-15                               ONLINE       0     0     0
            d8569602-18d1-441e-a97b-0c15c73c5776  ONLINE       0     0     0
            e7167f98-c5c0-4a5d-bda2-4348579bf98d  ONLINE       0     0     0
          mirror-16                               ONLINE       0     0     0
            dbc073ca-54e6-4391-8086-95f61a7d97bd  ONLINE       0     0     0
            1771fac6-ce5e-4ed9-afb1-cd840cad7cf4  ONLINE       0     0     0
          mirror-17                               ONLINE       0     0     0
            c2ef0edd-8450-4874-a238-8a89c8ef0f90  ONLINE       0     0     0
            a30ec5fa-b252-48c3-87f1-187146911fe5  ONLINE       0     0     0
          mirror-18                               ONLINE       0     0     0
            83cddd5d-5778-4a87-890d-885134f5d897  ONLINE       0     0     0
            db183522-817f-489e-b1bd-078867d52cd6  ONLINE       0     0     0
          mirror-19                               ONLINE       0     0     0
            ba02825f-88df-49a6-96c8-347a25a4ea6e  ONLINE       0     0     0
            fe16ba6b-22a7-48fd-b2d6-81bbb5e1c22f  ONLINE       0     0     0
          mirror-20                               ONLINE       0     0     0
            d0548c12-6f97-42b4-986d-5f53cb79d682  ONLINE       0     0     0
            f4394834-efa9-45e3-8553-133c3e73091f  ONLINE       0     0     0
          mirror-23                               ONLINE       0     0     0
            db236920-8547-42ec-b1df-03cba6fa1745  ONLINE       0     0     0
            e89f8abd-0807-46ea-8bb7-2e56ce53e51d  ONLINE       0     0     0
          mirror-24                               ONLINE       0     0     0
            e7c12f73-a06c-446a-8145-ca2fa121c326  ONLINE       0     0     0
            3b5a83a8-acc5-439c-ac58-59ee2277ae4b  ONLINE       0     0     0
          mirror-25                               ONLINE       0     0     0
            bf8ed2e1-6e43-4095-adb4-a98c69b50964  ONLINE       0     0     0
            0c953574-673e-458d-8595-2c2f5168b8de  ONLINE       0     0     0
          mirror-26                               ONLINE       0     0     0
            355b0cd1-11f8-4369-bba3-7988f921730c  ONLINE       0     0     0
            f653c5f5-a4a6-46ff-a4f9-aef5626b7330  ONLINE       0     0     0
          mirror-27                               ONLINE       0     0     0
            b3253c5f-7b4b-4d06-bdea-18f57e349394  ONLINE       0     0     0
            45dca8cf-752b-43ac-b698-7abb80506f08  ONLINE       0     0     0
          mirror-28                               ONLINE       0     0     0
            5f8c7fb5-e2fd-4740-a47e-886632fb27cd  ONLINE       0     0     0
            d8563a9f-0a41-43ec-9649-b6858cc8b2e1  ONLINE       0     0     0
          mirror-29                               ONLINE       0     0     0
            635785c4-202e-418f-b190-3ca68cb0fa55  ONLINE       0     0     0
            e44144a7-8f9f-470c-96f4-e72e55484fbe  ONLINE       0     0     0
          mirror-31                               ONLINE       0     0     0
            fd63daf7-4c04-4f52-ba01-b41ed627b406  ONLINE       0     0     0
            b4863955-3c1a-4d37-a41e-930137c90432  ONLINE       0     0     0
          mirror-32                               ONLINE       0     0     0
            5e00ea87-03a0-4915-82ea-907ed4cdabe1  ONLINE       0     0     0
            7f31f244-1133-4076-b961-463f28a1f2b6  ONLINE       0     0     0
          mirror-33                               ONLINE       0     0     0
            9940c16a-920e-4bb0-9a41-dac0182dd00f  ONLINE       0     0     0
            482e06b1-7d2c-4238-b917-82c8b621a773  ONLINE       0     0     0
          mirror-34                               ONLINE       0     0     0
            c6b6b2ea-6aa7-4b63-881c-f1e9131e4d3d  ONLINE       0     0     0
            ac596471-f702-40c9-8fc9-0fd90acf7b7d  ONLINE       0     0     0
        special
          mirror-13                               ONLINE       0     0     0
            4f961158-125c-474c-a90b-fdc6f9833f4a  ONLINE       0     0     0
            ac5c6fe2-0b3b-4941-9337-28649600136e  ONLINE       0     0     0
          mirror-14                               ONLINE       0     0     0
            25e755ed-cc1c-438c-890c-d9ff5be64191  ONLINE       0     0     0
            548a3fcd-4aba-4b70-814c-079e388e41dc  ONLINE       0     0     0
        spares
          85ce9c9a-0527-402a-94c4-649ecd78d93d    AVAIL   
          d9af6488-edd1-4d5d-a48e-379c974a5712    AVAIL   

errors: No known data errors

  pool: TrueNAS-Apps
 state: ONLINE
  scan: scrub repaired 0B in 00:05:35 with 0 errors on Wed Jul 19 04:27:36 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        TrueNAS-Apps                              ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            53c7f825-cb94-4237-8c79-728becb6c753  ONLINE       0     0     0
            b531e1a4-3069-4c90-9437-77eeb333eaac  ONLINE       0     0     0

errors: No known data errors

  pool: Wolves
 state: ONLINE
  scan: resilvered 36.4G in 00:08:39 with 0 errors on Thu Jul 27 15:51:01 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        Wolves                                    ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            7ea93cff-4798-4451-8914-8bd8475340b3  ONLINE       0     0     0
            2da54a3c-8725-478e-adaf-1a7a53c0890b  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            a9e3750e-7381-436d-a4de-8d68a8136bdd  ONLINE       0     0     0
            e631eb85-1ab7-40b5-8161-727e8d1a9a3a  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            f287f68c-2293-4d4b-abde-e6e019faf73b  ONLINE       0     0     0
            e471104b-c2ab-4e2b-9c7c-14aa187aef32  ONLINE       0     0     0
          mirror-3                                ONLINE       0     0     0
            33dcaafb-6eae-4931-8743-20f9b32d44fd  ONLINE       0     0     0
            4fa1a66c-effc-4531-b5c0-8c514293055d  ONLINE       0     0     0
          mirror-4                                ONLINE       0     0     0
            d2edc92c-4250-40c4-b382-0beb39ab8d6d  ONLINE       0     0     0
            0da118f2-7df8-42ee-8a94-fa02dcca1b96  ONLINE       0     0     0
          mirror-5                                ONLINE       0     0     0
            11fcc837-6d7c-466e-b2c0-a0d5e8c82fa5  ONLINE       0     0     0
            435d8322-9143-42fe-803b-7b6abbe75baa  ONLINE       0     0     0
          mirror-6                                ONLINE       0     0     0
            dc9cafbd-4c82-4ae0-a3be-b47ed18fa6ac  ONLINE       0     0     0
            3bb29f56-8158-4ac9-8d25-8aff78bd1f7f  ONLINE       0     0     0
          mirror-7                                ONLINE       0     0     0
            6084d12e-e15c-4f89-8bd6-396bc723fb84  ONLINE       0     0     0
            72fa67ef-c81d-4b98-a7b7-aa192e1625ea  ONLINE       0     0     0
          mirror-8                                ONLINE       0     0     0
            e8bc4881-36f2-45f3-9795-597ce62e0e05  ONLINE       0     0     0
            2b55a8d1-a442-45b8-8431-853ca7b10e9b  ONLINE       0     0     0
          mirror-9                                ONLINE       0     0     0
            0c8790ab-d731-4ae2-8b4d-f134b02929d2  ONLINE       0     0     0
            ec03bf0d-4168-418f-b83b-7e21e457ca83  ONLINE       0     0     0
          mirror-10                               ONLINE       0     0     0
            8fbf70fd-c1ad-4a22-8440-5ceba8d08651  ONLINE       0     0     0
            b313dd9b-febb-4920-b63b-0df0d6ac1e72  ONLINE       0     0     0
          mirror-11                               ONLINE       0     0     0
            89bd80f0-a350-4405-a4c8-8fc40d137b9b  ONLINE       0     0     0
            9af510b2-dcb8-4ecc-b449-a63607024c5d  ONLINE       0     0     0
          mirror-12                               ONLINE       0     0     0
            80bf4f91-48ce-47d5-a985-bb706b231d0f  ONLINE       0     0     0
            78cb889e-ab63-48d4-80e9-fce901a6d7bc  ONLINE       0     0     0
          mirror-13                               ONLINE       0     0     0
            5f0cff31-6a7b-43de-a050-a0c2873fdc65  ONLINE       0     0     0
            6723932b-5de0-4adc-ae62-83fdf243143c  ONLINE       0     0     0
          mirror-14                               ONLINE       0     0     0
            a23b7a21-2ad4-4904-a952-b45af3bda968  ONLINE       0     0     0
            57a2d473-8bd4-491e-b66d-950e0bc33bdf  ONLINE       0     0     0
          mirror-15                               ONLINE       0     0     0
            f3f66d81-5ad0-45d9-8784-a83de883bbb8  ONLINE       0     0     0
            f49cd42c-5b9d-4cdb-80a5-4cb4cabe3cd3  ONLINE       0     0     0
          mirror-16                               ONLINE       0     0     0
            8511c931-a917-4e70-95ac-b746a1d9b305  ONLINE       0     0     0
            021f1f63-42d9-4264-996b-e938310327f5  ONLINE       0     0     0
        spares
          226b101d-c20c-4f72-a844-84ee2848125a    AVAIL   

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:08 with 0 errors on Wed Aug  2 03:46:09 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdbt3   ONLINE       0     0     0
            sdcd3   ONLINE       0     0     0

errors: No known data errors

No. It’s a mix of 2TB and 4TB; at least, for the main pool.

I had eight of the 2TB drives prior to setting up this NAS, and when I went to buy more drives for this NAS, there were deals, but there was a limit on how many 4TB I could get, so I bought both 2TB and 4TB.

1 Like

And you replicate the data from Bunnies to Wolves, hence the 4 copies and 67TB useable space?
I mean, if you absolutely don’t want raidz and you want the data to be replicated to a second pool, then your config is the way to go.
Two separate pools are always better than just one pool but with more redundancy.
Although I don’t understand why you replicate within the same system and additionally to a second backup NAS, or don’t you have an extra backup NAS?
If not, think about a backup NAS, I lost four disks at once 4-5 years ago due to a short in my SAS backplane, without a second system, all data would have been gone.

I mean, I wouldn’t have taken 100 SSDs with your use case, 25Gb sequential with large files is easy to get.
SAS HDDs would do the job perfectly and you would have far more capacity, so that the topic of expansion for the lifetime of this server would be off the table.
With less hardware, less can break!

Are you sure, I can’t think of any reason why there should be a restriction…
At the same time, I can’t think of any reason why you need a DHCP address with Truenas?
My TrueNas has 1Gb, 10Gb and 2 x56Gb interfaces, each with a fixed IP-A, it’s a server after all.

You can create partitions to use the entire space of the 4TB SSDs and create a second Zpool.
I did this with my system because I have a 12 and 16TB mix.
This is not recommended with FreeBSD, but I haven’t had any problems with it on Linux.

sdb           8:16   0  10.9T  0 disk  
└─sdb1        8:17   0  10.8T  0 part  
sdc           8:32   0  10.9T  0 disk  
└─sdc1        8:33   0  10.8T  0 part  
sdd           8:48   0  10.9T  0 disk  
└─sdd1        8:49   0  10.8T  0 part  
sde           8:64   0  10.9T  0 disk  
└─sde1        8:65   0  10.8T  0 part  
sdf           8:80   0  14.6T  0 disk  
├─sdf1        8:81   0  10.8T  0 part  
└─sdf2        8:82   0   3.7T  0 part  
sdg           8:96   0  10.9T  0 disk  
└─sdg1        8:97   0  10.8T  0 part  
sdh           8:112  0  10.9T  0 disk  
├─sdh1        8:113  0  10.8T  0 part  
└─sdh2        8:114  0 101.8G  0 part  
sdi           8:128  0  10.9T  0 disk  
└─sdi1        8:129  0  10.8T  0 part  
sdj           8:144  0  14.6T  0 disk  
├─sdj1        8:145  0  10.8T  0 part  
└─sdj2        8:146  0   3.7T  0 part  
sdk           8:160  0  14.6T  0 disk  
├─sdk1        8:161  0  10.8T  0 part  
└─sdk2        8:162  0   3.7T  0 part

edit: Building a NAS and getting it right
https://jro.io/nas/#overhead

1 Like

2 in the same machine because I don’t wanna run 2 servers.

And then yes, I have an offsite NAS using HDDs. I would’ve had HDDs all-around, but 9 out of the 66 I bought didn’t work, so I kept 24 and sent the rest back.

Wasn’t worth the hassle for my system at home.

Also, the SSDs I did have already held all my data, so I wasn’t supposed to need to buy up 100+ at the time.

Snapshots and disc rips don’t help. The majority of my own personal data like documents, app installers, etc take up a reasonable amount of space.

I set this up for my YouTube videos because some of them are ~150GB each. Although, disc-rips are growing much faster than YouTube videos.

I tested with a bare metal Windows last week to see how much performance is lost due to virtualization.
And I added a P4618 NVME to my NAS, to see the difference between HDD and NVME for my workload.
I configured the two partitions of the P4618 as a stripe and did some tests with SMB and iSCSI.

To make a long story short, for SMB it makes no difference whether it is a bare metal Windows installation or a VM, at least with sequential workload, with 4k random bare metal is twice as fast.
The limiting factor is the CPU of my NAS, at least that’s what it looks like to me at the moment.

Bare metal iSCSI performance doubles compared to virtualization and the HDD Zpool wins with files larger than 15 GB.
Which actually can’t be, or rather, it’s currently unclear to me why, because the Intel P4618 is rated for Sequential Read/Write up to 6650/5350 MB/s.

12 GB file from NAS HDD Zpool to local NVME

18GB file from local NVME to NAS HDD zpool

Local NVME to NAS NVME zpool

With both pools, the performance goes down after about 15GB, but the HDD pool holds up better overall. In my opinion, this has more to do with the ZFS settings of TrueNas Scale than with the capabilities of the hardware

Disk5 is the NVME pool, Disk6 HDD pool

Switching to SSD for my NAS doesn’t make sense in my case, but SR-IOV would be useful, I think I’ll swap the CX-3 with a CX-4 on the client side and leave it at that

1 Like

Hey, sorry to respond so many months later.

I’ve figured out what I want, and it requires I rethink my data storage strategy on this NAS.

Secondary NAS is fine (for now)

On my secondary NAS with a HDD cluster, it’s all mirrors, but it has 100TB of space. So far, that’s enough:

image

Data filling up

But my main NAS is the problem, and I need a solution quick:

image

While it might look like “wow, 5TB will take some time”, I’ve been ripping every movie, show, and CD discs I buy (into MKVs, not the whole ISO :P), and as you can see, it’s quite a bit of space; only ~7TB are snapshots.

Bunnies in my main pool and Wolves is the every-6-hour-backup.

Rethinking my strategy

Instead of buying another Storinator XL60 or continually buying more 4TB SSDs, I think a new strategy is in order.

I went with mirrors because:

  1. They’re more flexible. You can add and remove drives for redundancy at will.
  2. You can easily upgrade the capacity of a mirror by purchasing only 2 drives. Adding or removing a RAID-Z vdev requires I replace 10-15+ at a time, and I typically only buy drives in sets of 2-4.
  3. You don’t even need to remove any existing drives from the mirror until after resilvering is complete.
  4. You can add more mirrors with a minimum of 2 drive purchases.
  5. You can remove mirror vdevs very quickly. Is this the case with RAID-Z?
  6. Mirrors can fully maximize the speed of ZFS.
  7. 2-drive mirrors can only sustain 1 drive death, but they’re also much faster to resilver, so the risk of data loss is minimized.

Still, mirrors are only good if you have a ton of money and enough drive slots.

I have more physical ports on my SAS expanders and more SAS controllers, but I don’t have enough slots in my case.

Mirrors simply don’t scale for home use. I shouldn’t have to buy another Storinator XL60 this soon after buying the other two. I really wanna rethink my strategy.

Using what I have

RAID-Z seems like the best option. Without even buying new drives, I can go from 50% usable capacity from mirrors to 80% with RAID-Z!

Which should I choose?

I’m thinking either of these:

  1. 10-drives in RAID-Z2
  2. 15-drives in RAID-Z3

vdevs are nice because they don’t need to match in size or capacity, so I can always mix 'n match based on your suggestions.

:question: Should I do more or less in either case? :question:
:question: Which RAID-Z should I go with? :question:

My current drives

  • I have 38 2TB drives and 22 4TB drives in the top case using all 60 slots.
  • I have 34 4TB drives in the bottom case. That leaves me with 26 slots for this transition.
  • I also have 15 4TB drives in shipping to my house right now. Exactly enough for a single RAID-Z3 vdev.

Rethinking hotspares

I dedicate 2 drives to hotspares in both zpools. Those need to be 4TB drives. I’m wondering if I even need hotspares if I go RAID-Z3.

At that point, cold-spares make more sense and allows more total capacity and full utilization of each slot in the case.

Steps to upgrade

I have two ideas on how to upgrade.

  1. Add vdevs to the main pool first and eventually move all 2TB drives to the backup pool.
  2. Add vdevs to the backup pool first and once that’s complete, start adding vdevs to the main pool; slowly removing all mirrors.

:question: Which should I do? :question:

I’m thinking the first one is better because it has the possibility of removing all 2TB drives this way. Eventually, I’d like to have all 4TB drives, but maybe that won’t matter if I move to RAID-Z3.

Steps for the first idea

  1. With the 15 4TB drives on the way, I will make a single RAID-Z3 vdev and add it to the main pool.
  2. I’ll start removing 4TB mirrors from the pool until I have enough to make another 15-drive RAID-Z3. Then add that back to the main zpool. I’ll probably need to take my four 4TB hotspares to have enough drives.
  3. Just two 4TB vdevs is enough to keep this pool happy for a while at 96TB.
  4. I can start removing every 2TB mirror from the main zpool now.
  5. Once I’ve removed all those 2TB mirrors, I can work on upgrading the backup pool.
  6. I can add a bunch of 2TB 10-drive RAID-Z2 to the backup pool temporarily (I’ll need to borrow two 4TB drives since I only have 38 2TB drives). That’s 64TB of space which is more than the backup pool’s total capacity today.
  7. Once those are created, I can remove all 4TB mirrors from the backup pool.
  8. With only 30 of those 34 4TB drives, I’ll make two new 15-drive 4TB RAID-Z3 vdevs in the backup pool. That’s another 96TB.
  9. Now, I can remove all four 2TB RAID-Z2 vdevs from the backup pool, and I’m done for now.

All in all, this will result in 96TB of usable capacity in each zpool with 30 slots remaining in each case. I’ll have and 11 4TB and 38 2TB drives just lying around at this point ready to be used in case I need more capacity. Having extra 4TB drives is nice for cold storage.

With 38 2TB drives I can also add two more 15-drive RAID-Z3 vdevs. Not bad considering it’s another 24TB of usable capacity.

:question: What do you think of this new strategy? :question:

I did a lot of reading on this, and made the decision to go RAID-Z2 in this configuration:

In both zpools, I’m gonna make it 3 x 4TB 10-drive RAID-Z2 and 1 x 2TB 10-drive RAID-Z.

  1. Hot-spares are still useful because they cover the whole zpool. In the event of a failure, it swaps the drive for you so all you have to do is take out the bad drive and replace the hot-spare.
  2. 15-drives in RAID-Z3 means I will have fewer vdevs, 2 or 3 depending on how I handle it. That’s not good either because I will actually lose a ton of speed compared to my mirrors. I have a 25Gb NICs I’d like to have the capability of saturating.
  3. With smaller RAID-Z2, it will be much cheaper and easier to create a new vdev and also easier to manipulate disks in the chassis should I need to do some Tetris.
  4. With the 38 2TB drives, I technically only need two more drives for four RAID-Z2 vdevs. This will give me emergency expandability in the future for the cost of two 4TB drives.
  5. With the amount of Tetris required, I don’t think I have enough 4TB drives and space available to actually create two 15-drive vdevs. I might wind up having to use a 10-drive RAID-Z2 vdev anyway, so why not keep to it then?
  6. Since I have a 6-hour backup pool and an offsite backup NAS, RAID-Z2 is probably good enough. Still better than a mirror in terms of failure compensation.

The question is how big do I want my RAID-Z2 vdevs.

I think 10 drives is fine. 15 is perfect for this chassis, but 10 is the same 80% usable capacity, and I’ve read that 10-drives in RAID-Z2 are a good number for parity:

If I’m wrong, please gimme that info :). I’ll be working on this over the next few days.

1 Like

Raidz2 with hotspare is IMO plenty for a home NAS, given that you have a well maintained backup system.
Personally I don’t use RaidZ2/3 at home, because I don’t need the availability, I destroy my pools anyway every other year, to keep the fragmentation in check, so If a resilervering goes sideway, I see it as maintenance :slight_smile:

Which by the way never happened to me and I use ZFS since release of Solaris 11, a bit over 11 years now.
But a few years ago I lost four disks at once due to a short circuit in the SAS backplane, RaidZ3 wouldn’t have helped either, since then I’ve been checking my backup system regularly.

To your question how wide your zpools should be, since I’ve always used compression, I never worried about it, there were always other criteria more important, but here’s a good article about it, if you don’t already know it.

Number of drives in a RAID-Z doesn’t matter

Thanks! That’s a great explanation.

I use compression, but it’s clear it shouldn’t matter how many drives you have.

I have too many datasets to wanna recreate them, but I can fix fragmentation by adding new datasets and removing old ones.

I wish ZFS had a defragger.

How many drives?

Should I do more than 10 drives for RAID-Z2?

I wanna use as many of the 60 slots as possible I thought 10 made sense because I can have 6 vdevs this way, but I will have drives as hotspares, so I should leave room for them; less than 10 slots though.

11-drive vdevs

Using 11-drive vdevs, I get 5 slots with 5 left over. That seems like a better idea now. I’ll probably go with that.

For now, I’m only gonna have 4 vdevs. Using 11 drives lets me have 4 vdevs with 16 remaining slots. Enough to do some tetris if need be.

I think I’ve been going about this all wrong.

dRAID; the answer to ZFS storage

Maybe I should be using dRAID instead. It provides a lot more flexibility and can scale up and down to however many drives I want.

This is just like ZFS mirrors, but it’s RAID-Z style.

I can remove drives from the vdev!:

Storage Comparison (mirror vs dRAID)

My main storage pool with 60 drives (26 x 2TB, 34 x 4TB):

A new dRAID2 pool I spun up with 2 distributive hotspares (same as before) and the whole thing is only 32 x 4TB drives:

Even without any 2TB drives and four less 4TB drives, the usable capacity of dRAID is pretty spectacular. Very similar to having many RAID-Z2s.

Special (metadata) vdevs

NOTE: You absolutely want to have Special Metadata devices defined for any zpool using dRAID vdevs.

If you don’t add a special metadata vdev, your usable capacity will be much smaller for each block of metadata.

Small files without a special vdev bloat your dRAID because of how it does fixed-width stripes.

RAID-Z uses dynamic width stripes, but suffers from slow resilvering and difficulty to expand and contract because of it.

Drawbacks

dRAID is like one big RAID-Z2 in terms of redundancy, but it’s a lot more reliable than a super large RAID-Z because of how it handles striping and parity bits.

Differences with RAID-Z2

If I have a single RAID-Z2 vdev, then it’s not much different from dRAID except that today’s RAID-Z2 doesn’t support expansion and definitely doesn’t support contraction.

Losing a drive in RAID-Z2

If I have multiple RAID-Z2 vdevs, that’s where differences pop up.

If you lose a drive in one RAID-Z2 vdev, the chances of you losing another drive in the same vdev are smaller than losing a drive in another vdev which also has 2-drive redundancy.

Losing a drive in dRAID2

With dRAID2, the whole vdev of X drives (could be 100+) only has 2-drive redundancy, but that redundancy is spread out equally to all drives (with some randomization algorithm).

If a drive dies in this massive vdev, then you’ll wanna replace it ASAP which is why they suggest at least 1 distributed hotspare as part of the vdev. Space for that hotspare is already built into each drive meaning all drives are hotspares at the same time. That’s why it’s so much faster to resilver; you’re doing it across the whole vdev rather than writing to a single drive.

Now, when a physical drive needs to be replaced, I think that is still slow. I’m completely unsure how this is handled, and the docs aren’t clear. I think it’s related to how easy it is to add and remove drives from a dRAID; something about “sequential” and “healing” types when resilvering.

dRAID expansion

I spoke too soon. You can’t remove drives in a dRAID (why not?):

image

I can’t seem to add them either. Anyone know the command?

I might’ve misread dRAID expansion somewhere. I know RAID-Z has expansion coming, and I thought dRAID already had it. Hmm… :thinking:.

Other Testing

In some other testing, I tried offlining 3 drives. Nope! You can only do 2 with dRAID2 just like RAID-Z2:

dRAID is for large installations, with many disks, so I haven’t tested it yet, but as far as I know the biggest advantage with dRaid is the massively reduced resilvering time, but this comes with the disadvantage that the stripe width is fixed, so pool topology is more important again.
Short resilvering times are key in production environments with SLA for storage, I don’t have an SLA signed with my wife, so I’m off the hook here :wink:
How long does it take to resilver a 4TB SSD, two hours?.. and is this a problem?

1 Like

There is no such thing. Expansion follows the same rules as normal: You just add another vdev.

And dRaid support in TrueNAS is very very new, I wouldn’t trust dRaid on any TrueNAS derivate right now. Scale still has a lot of bugs and dRaid is low priority. I recommend testing in Proxmox or spinning up some Linux server and manually install ZFS. Proxmox has added dRaid a year ago and works as expected.

Mirrors and dRaid use sequential resilver which is pretty much max sequential write speed. So SSD and NVMe resilver time is trivial. Even if you use RAIDZ, NAND is damn fast.

But resilvering 100s of TB worth of HDDs…yeah, you will love that sequential resilver. Having sub-par performance and then having to also run in degraded state and resilver job in the background…that pool performance will suck for extended periods if failure rate is not a one-time incident.

That’s why dRaid is great. If your pool is always degraded because there is almost always a resilver in progress…you kinda want a solution and deal with the spares at the same time.

I listened to a lecture from CERN regarding resilver of PB Ceph clusters…boy, if you are doing stuff at PB scale, you don’t want a chassis with 22TB drives failing, ever. I heard of cloud providers not using anything larger than 4-8TB NVMe because resilver and maintenance drags down the cluster and networking for uncomfortable amounts of time.

1 Like

RAID-Z Uncertainty

I’m uncertain about RAID-Z in general.

I’ve heard even with just 11 drives in my RAID-Z2, that my rebuild times are gonna be awful, so who knows.

I’m gonna run more tests coming up here soon.

But also, my arrays are 30 to 70+ drives. I can do it with RAID-Z2, but dRAID seems so much more appealing. Something cool and mysterious.

RAID-Z vs dRAID pool topology

Yes, dRAID is gonna be bad for small files, but even so, it’ll give me a lot more space than all the mirrors I’ve been expanding :stuck_out_tongue:.

Real dRAID Expansion is a thing

dRAID Expansion is a thing, but it’s in IBM’s implementation of who knows what filesystem:

Surprisingly, this article is the reason I finally understood dRAID in the first place. Sadly, the ZFS implementation leaves out that important expansion feature.

The upcoming RAID-Z expansion feature also doesn’t touch dRAID :frowning:.

My wife’s SLA

My wife does have an SLA for me.

Sometimes she wants to watch something on Plex with the kids while I’m working. This stuff needs to be up for that reason alone.

My kids also use Plex on their mobile devices sometimes. Surprising that in 6 months, it’s become so important.

Great points! I’m not in the PB territory, but I’m at 82TiB and growing.