ZFS and huge file reads, would SSDs help?

dooglz · February 27, 2024, 6:34pm

In a Hypothetical world, where I’m hypothetically specing out A ZFS server of upto 4PB (no joke). A big load of JBODs with big HDD spinners. On it are big image files of around 1gb+ each (uncompressed video frames), and some 1tb+ video files. I want to serve this to an office of people who want to read these files fast.

random read’s are basically not happening, this is big sequential streaming tasks.
how could I configure or add in SSD’s to the mix to speed up file reads? would it even help?

Having 200+ HDDs would go pretty fast by itself, once they are on track, and the data is super sequenial. But what if more peopl want different files at the same time?

Has anyone got/seen any numbers for the read speeds out of a big HDD jbod, and how it copes with multiple clients hitting it for differrent large streaming reads?

Potentially, if 2 people wanted to read the same image twice, and it was sitting in a l2arc on ssd, that’s a win. but this will be rare. Most people will be copying a few TB at a time. and this is expensive quantities for ssds.

bonus question. can zfs prefetch across files? so if I’m asking for image1.png, it can preload image2.png into a fast cache? And if not, is there anything out there that can do this?

Thanks for reading.

xzpfzxds · February 27, 2024, 11:59pm

How many simultaneous clients are you serving, and what minimum throughput do you want to each client? are we talking 10Gbit/s range or 100Gbit/s+ range?

I’m thinking lots of ~10 wide RAIDZ2 vdevs if you want to keep it simple.

ZFS won’t, unless they’re small files that share a block, but I’m guessing that’ll be rare by your description. That’d be more at the fileserver protocol, e.g. Samba has the vfs_preopen module that might help with that:

This module assists applications that want to read numbered files in sequence with very strict latency requirements. One area where this happens in video streaming applications that want to read one file per frame.

When you use this module, a number of helper processes is started that speculatively open files and read a number of bytes to prime the file system cache, so that later on when the real application’s request comes along, no disk access is necessary.

dooglz · March 1, 2024, 9:10pm

Simultaneous clients
less than a 100 worksations, 10gbe.
+some CPU servers in a nearby rack, connected via 100bge switch.
all are copying into local ram.

Agree, Leaning towards the idea of Many ~10 wide RAIDZ2 devs, lots of HBA fun, one fast headnode.
just I can’t tell if filling that headnode full of ssd’s nets any benefit. I can probably do a big flash l2arc, and assume some small quantiy of files are read often and then the’re a win.

Did not know about the Samba modules, neat. Thanks.

cowphrase · March 2, 2024, 6:38am

For a pool this big I’d be looking at dRAID. This provides much faster re-silvering so your data is safer, The downside is the fixed stripes, meaning you lose (some) compression and small file efficiency. Your use case appears to be ideal for this.