Choosing the right drives for ZFS

I have a Synology with 8 drives, I am happy and it works well, but now that I was looking at a 2021/2022 model with 12 drives they seem to have removed the NVME M.2 cache slots and I felt that it’s time to jump over to ZFS.

The goal right now is TrueNAS Core or TrueNAS Scale depending on what the status is in two months when I got the hardware.

I am getting a Netapp DS4246 enclosure, connecting to a LSI SAS 9300-8e on new (used) server.
I am looking to run Exos 16TB x 7 per vdev in RAIDZ and have one or perhaps two hot spares, the goal is to start with one vdev and then two more later since the disks are expensive.

I got two CPU and a few U.2 Slots and I thought, why not just do something fun?
I was thinking about 2 smaller P4801X 100GB Optane for SLOG

For L2ARC I am thinking about two 1.6TB 2,5" U.2 SFF PCIe NVMe SSD Intel DC P4600.

I got a 2x40GE card for networking to my Arista switch.

So, is there anything more I need to think about?
I know that there is Metadata but I don’t see the need right now, correct me if I am wrong.

First post here, I hope I got the right part of the forum.

2 Likes

Hey, welcome to the forum!

I’m sure someone with experience will drop by, but as I’m sure you’ll see, there are plenty of opinions of setups here.

My $.02 would be to suggest dropping spare, and running a 8 drive z2. If you already have Optane for caches, then great, the random access can be awesome, but L2Arc needs training, so effects not always immediate.

40gig switch sounds pretty leet. Not sure what’s connected the other end to source data from / send it to that fast, or if it is just gonna be between the storage server and the computers it’ll serve?

Otherwise, welcome to the forum, I’m sure you’ll find lots of differing ideas, and reasoning’s around :slight_smile:

1 Like

ZFS doesn’t really care. You just feed raw disks to the pool and ZFS is happy.

I went with Toshiba MG08 16TB, but that was more a matter of price. Was considering the Exos as well. Solid 250MB/s write performance during send/receive with 512MB cache. Can’t complain about my choice, but I’m using them for only 2 months, so can’t tell anything to long-term failure rate yet.

Got all my metadata and small block size on NVMe as special allocation vdev without any L2ARC. My memory is plenty to cover MRU.

Don’t expect to write anywhere near that network bandwidth with a single RaidZ2 vdev.

800MB/s sequential writes / 1GB/s sequential reads should be within reach with 8x16T raidz2 …

(3-4G/s … doubtful unless there’s a biiiig read/write cache somewhere and your workload makes it very likely for reads come out of cache)

(There’s all kinds of things to consider)

special vdev should help with random small files and metadata, but I’m not not sure how much benefit you’d have out of l2arc / slog at that point.

@Log might have useful insights

Sorry about the slow reply from my side.
Then I will go with 8 drive z2, thanks!
Now I am planning to use a either a netapp shelf or a HP 19" M6720 Storage 24x 3,5" LFF HDD SSD 2x JBOD 6G SAS SATA Controller 4x SFF-8088 2x PSU (TrueNas ZFS, Ceph, NAS, SAN) - Serverschmiede.com GmbH

and a LSI SAS9300-8e: LSI SAS9300-8e 9300-8e PCIe x8 2x SFF-8644 12G SAS3 S-ATA HBA for HDD SSD JBOD Controller (ZFS, Ceph, MS Storage Spaces) - Serverschmiede.com GmbH
Should I connect each head to one port each or should I connect one head to the other and that head to the LSI card?

Thanks in advance!

Maybe an alternative to disk shelf approach, not sure if you’re already made up, but how about this case: Inter-Tech 4F28 Mining-Rack ab € 140,76 (2021) | Preisvergleich geizhals.eu EU

… and then go with an HBA with internal ports and and a bunch of SFF-8087 - 4x SATA cables which are about $10-$15 a piece - for 16 drives you need 4.

… and btw, the cheaper SAS2/LSI92xx is good enough for SATA3 HDDs, each lane is 6Gbps and spinning rust is slower than that.
… the only question then is how many PCIe lanes can you spare on these old refurbed cards… often times they’re PCIe 2.0x8 (40Gbps/s), and that may be a reason to go with a more expensive card.

You may also want to consider using a “SAS expander”, it’s basically like a network switch for SCSI protocol, that just uses PCIe for power, but let’s you connect a lot more drives to a single controller (you can even daisy chain them)… ofcourse, there’s a question of bandwidth into the controller, you get 24 Gbps for 4x SAS2 lanes or roughly 10xHDD worth of sequential spinning rust throughput…

Since I don’t have all disks at start, I expect to buy 8 at a time.
Should I set up 3 raidz2 or is it possible to use draid2 for this?

You don’t need to specify the great plan right away as you can only add things to the pool you actually have.

zpool create poolname raidz2 disk1 disk2 disk3....disk8

while using disk-by-id is recommended. sda,sdb is faster to type but can cause troubles when disks are changed

and then zpool add to add your cache,log and special vdevs to the pool. With new drives, zpool add is your friend again. Really only two lines of simple command to get basic pool configuration working.

Can’t say much about draid as it was never relevant to my use.

Oh, is this a new thing? You can expand a zfs vdev these days?

No, he’s talking about adding another vdev (8-wide z2). He wants to start with 8 drives and upgrade in batches of 8 drives up to 24 later.

1 Like

I forgot, is there any good U.2 disk to PCIe connection where I can connect the disk right on the PCIe card? Right now I only need PCIe 3. I would prefer to use that instead of M.2 to have more capacity.

Thanks!

You are limited to the IOPs of the slowest drive in a vdev. So having a single vdev means that you would need enough IOPs out of a single drive to hit 800MB/s which is not going to happen. Either he will need multiple vdevs or go with mirroring.

For reads, if the data is accessed frequently enough, it should be in memory or an L2ARC can help if memory is maxed. For writes, he will see an initial high rate of transfer, but once memory flushes to disk, it will dramatically slow down.

I have two TrueNAS boxes with 40Gbe networking (ChelsioIO cards). The primary has 4 Raidz2 vdevs (6 drives each) along with redundant NVMe L2ARC, special and SLOG. Using PCIE bifurcation 4 x 4 nvme cards. The most throughput I’ve seen has been a little over 2GB/s when transfering to my desktop which runs mirrored NVMe PCIE 4.0 cards on a sTRX4 3970x build.

1 Like

I will have 1 vdev to start with, after about two weeks 2 vdevs and when I can afford a third vdev :slight_smile:

This is a common fallacy.

Which iops count and how throughput scales with iops is complicated.

It’s just not as simple. Sequential reads/write speeds can scale with the number_of_drives_in_a_vdev - number_of_parity_drives - overhead of metadata and logging.

If your drives can do 250MB/s each and you’re doing 8disk raidz2 and then maybe move both ZIL and metadata to separate vdevs from data (avoiding seeks) and you have enough PCIe bandwidth to your disks, and enough CPU… You should be able to get 6*250=1500MB/s… theoretically, while e.g. copying things like 50GB files or similar stuff.

However, most workloads are not that simple, so performance is never ideal.

There’s a few different setups considered here: Six Metrics for Measuring ZFS Pool Performance Part 2 - iXsystems, Inc. - Enterprise Storage & Servers

(debugging throughput issues is also a hard job, ZFS internals don’t really expose useful performance counters you can just look at on some dashboard and immediately tell where time is being wasted and where things are blocking)

2 Likes

OK Here is my plan.
vdev: 2 vdev with 8 Exos 16TB drives running zraid2
slog : 2 intel optane P4801X 100GB
l2arc: 4 Samsung PM983 960GB

Does that seem sane/OK? Should I do something different? larger/smaller of l2arc? go for special/metadata?
I have most reads and will use NFS
I am using a dual 40GE Mellanox NIC and have a 40GE switch.

I’d partition these, half space as metadata in two mirror vdevs. Other half as 4 way l2arc

Thank you! Would I need to buy a larger size then or is that OK?

I don’t know how big the files will be and how many or how much metadata you’ll have, you can expand later if needed.

The partitioning is just so you’d make better use of the iops.

Thank you!