Calling on the collective experience of the Level1Techs people…
If you wanted to design a 100TB server based around ZFS that can saturate a 10G network when transferring large sequential files (10GB file size), what hardware do you think can get you there?
I have been playing around with ZFS and I am curious what hardware would be needed to accomplish the above task.
Fast, async sequential transfers to and from a small number of 10GbE clients is pretty trivial. You don’t need fast CPU, SLOG, L2ARC, special vdev, or even tons of RAM. All you really need are 1. lots of spindles and 2. a 10GbE NIC. Five or six mirrored pairs using modern spinners will probably get you there for ~one client at a time, or a few narrow raidz* vdevs, or one very wide raidz* vdev.
All the complexity in zfs hardware sizing comes when you have lots of metadata and/or small files, random access over large working sets, sync writes, simultaneous clients, opaque block storage, or some combination thereof. Or you need high availability or what have you. Let me know if any of those are included in your hypothetical use case.
ETA:
* - raidz sequential write performance will be very fast on a fresh pool but degrade significantly with fragmentation as the pool ages.
Indeed, AES-NI capable CPU will probably help as SMB / Samba more or less requires crypto these days for any modern platform but I don’t know how well it scales/performs.
I wouldn’t go beneath a minimum of 8 cores / 16 threads, 64GB of RAM and a CPU that’s at least 2st gen Zen or 8th gen Intel (be it consumer or enterprise variants). Make sure that your devices, like NIC or HBA cards are not limited by the number of PCI-E lanes. Although you can technically get away with way lower specs (people used to do this more than a decade ago, when hardware wasn’t that advanced - YMMV).
Spinning rust can saturate 10G in sequential reads. Assuming you get “decent” drives that can do about 150 MB/s reads, you’d need like what, 8 drives in stripped-mirrors? By that point, you can probably go 10 drives in raid-z2 for that sweet-spot on capacity, if you care about capacity more than throughput.
Now, there’s a few more questions to raise:
what’s the workload? (one db reading files from a single NAS? / a hypervisor reading the VM’s vdisks or block devices or iscsi? etc.)
how many clients will it have?
how many connections do you expect to be open?
what protocols are you looking at for file sharing? (smb, nfs, s3?)
I played around with ZFS in both TrueNAS SCALE and by manually setting up ZFS inside Xubuntu. I had really disappointing results with both and was just curious if my hardware was too under-powered and was interested in what people thought would be adequate hardware to accomplish the task.
FWIW, my Xeon E5-2667v4 (configured to 8c8t) is quite capably saturating dual 10GbE links. I’d probably put Xeon E5 v3/v4 (Haswell-/Broadwell-EP) as the lower bound for this use-case, although if you’re paying more than 10¢/kWh or so it probably makes sense to go with something newer and more efficient.
The hardware I originally bought was for building a traditional RAID setup with T-10 PI enabled.
AMD Epyc 7275
48GB ECC RAM
ASRockRack ROME6U-2L2T (Built in Intel X710 10G NICs)
Areca ARC-1226-8i RAID Controller
(8x) Seagate Exos X20 20TB SAS Drives
The above setup has no issues saturating the 10G (~1.1GBytes/s sustained) network.
Then I kept reading about how RAID is potentially unsafe and is prone to “bit-rot” and ZFS was a better way to go for data integrity. So I replaced the Areca RAID card with a AOC-S3008L-L8i HBA card. This setup only would net me about 190MBytes/s with pauses at regular intervals. I got the same results using TrueNAS SCALE and building a ZFS filesystem on Xubuntu.
I also tried a ZFS build based on an Intel i5-13400, 16G RAM using same SAS HBA and drives. Got the same results as the Epyc system.
With the i5 system, I then created a MDADM software RAID6 under Xubuntu. That was able to saturate a 10G network. Since MDADM sofware RAID could saturate 10G, that tells me the hardware should be fast enough. I guess ZFS adds a lot more overhead.
I assume you’re running all 8 drives a single RAID-Z2 vdev, correct? And how were you testing (e.g. file explorer drag and drop over SMB, dedicated testing tools like iperf, CLI file copies via scp or otherwise, etc)?
The reason I ask is, if you’re utilizing SMB, you’d then want to set xattr to be self assigned within the dataset on the ZFS side. If this is only ever going to be highly sequential large block IO, you’d also increase recordsize (and on and on).
If you’re looking for high IOPs, you’ll want to stick with mirrored vdevs. You’ll also want to ensure that the recordsize utilized matches the workloads requirements; in cases where your read or write requests are significantly larger than the configured recordsize, you’re incurring unnecessary overhead on the filesystem.
ZFS almost certainly won’t be as fast as pure block storage IOPs, but provides enough tuning options that one can really tailor the filesystem to multiple unique workloads quite well at least (and that’s not even to mention that this isn’t really an apples-to-apples comparison - one provides nothing but physical drive failure redundancy, while the other… well, ‘is ZFS’). I wrote this up a couple years back as a primer for UnRAID folks who were ‘ZFS-curious’, and while it doesn’t explicitly pertain to the unique question posed, I do think it should at least help give you some background. The biggest takeaway (or if looking for a TL;DR just to get the gist of it) is really the very last section.
Your HW can saturate a 10Gb link, though likely only for dedicated sequential workloads, at least for the number of drives you currently have (though whether possible with them in z2 would again probably need at least a little configuring to suit the usecase). For anything else, you’re going to need more spindles IMO.
RAID 6 will correct inconsistencies (“bitrot” being one form of them) when scrubs/consistency checks are run, ZFS can also do the same under the right circumstances; both system will let unread data possibly bitrot (this phenomenon is significantly over represented in media/articles) if scrubs are not run on either system and the data is not accessed. The advantage ZFS has in this situation is that during a file read, a scrub for only that specific file happens, whereas with RAID 6 a scrub needs to be performed discreetly on the entire volume.
–just wanted to point out that ZFS doesn’t prevent bitrot without scrubs, the same as RAID 6.
I guess I’ve been sort silently equating the term ‘drive failure’ with data inconsistency resulting from media issues (in addition to all other failure types, mechanical or otherwise lol) mentally and hadn’t actually realized it till I read this
Totally valid point!
And agreed on the bitrot front - while it is something to be aware of, and even concerned with for those who retain unchanged extremely long term data, it’ll typically not be a concern with ZFS as long as you’re properly maintaining your pool against the various other (far more likely) causes of data loss
I’m glad people are at least aware of the possibility of this happening, but the blast doors blew wide open with the number of folks asking what actions they can take to avoid it, seemingly highly concerned they’d be impacted by this, often while only ever having a single copy of their data
I tried organizing the drives as a single RAID-Z2, dual RAID-Z1s and four mirrors. The RAID-Z2 gave me the best results (~109MB/s), the dual RAID-Z1 was a little slower and the four mirrors had a significant drop off in performance.
All my computers run Linux, so I connect via NFS (v4) with sync disabled. It is a home server so transfers usually consist of one large file (25-50G) at a time and maybe one other video stream to a TV. (Probably 50Mbps at most.)
I bought some 118G P1600X Optane drives a while ago when they were on sale. I even tried them as a single 8-drive RAID-Z2, dual RAID-Z1s and quad mirrors. The best I could get was about 290MB/s sustained as a single RAID-Z2. The other configs were slower.
I used the term “bit-rot” earlier but it may not have been entirely correct for my situation.
A number of years ago I started noticing that some video files I would watch would start showing corruption. These files were written once and marked as read only. (My attempt at telling the filesystem to not let ME do anything stupid to the files.) At first I could watch them without problems but then one would show a problem, then more started showing issues. I would run a “scrub” and it would say that everything was fine. The RAID this happened on was MDADM software RAID5.
The next iteration of the system I moved to an Areca hardware RAID controller. I don’t remember encountering the same issue. I truly don’t know what the cause was - could have been hardware failure, a bug in MDADM or me doing something stupid. This scenario just made ZFS look very intriguing.
I wasn’t annoyed that the data was corrupt as that is always a possibility. It was the fact there was no message telling me the data was corrupt that was really annoying.
10 GbE is 1,25 GB/s transfers. SATA practical limit for HDDs is ~125 MB/s. We simply need something that reads from 10 drives at the same time without slowing down. So, stripe 10 drives in a RAID0 config via SAS, and you should be done. Naturally as soon as one of those 10 drives dies, all your data is lost. So make sure you have backups, but that is the bare minimum.
If you think that is too unsafe, I would look into SSDs. Either buy a couple of 4TB NVMes that you could use as a transfer cache, or, invest in proper server infrastructure with E1.S/E1.L form factors. It is only a matter of time before prices on 16TB and 32TB SSDs drop to affordable levels. Once 32TB is down to $500, I don’t see much reason to keep around HDDs, but right now we are at $800 for 8TB, and that is simply too much…