ZFS review (before it's too late)

C-Otto · November 13, 2023, 3:00pm

Up front, thanks a lot for the excellent content already available on this forum. This helped me a lot, both when setting up the “old” service, and now when planning the new setup.

I’m running a well-known public FTP/mirror server (https://ftp.halifax.rwth-aachen.de), which currently serves around 35 TByte of data at 4 GBit/sec on average (peak 18 GBit/sec). The hardware for this service will be replaced soon, and I’d like to continue using ZFS (on Linux) as I’ve been very happy with this in the past couple of years.

I kindly ask you to criticize my setup, point out mistakes and suggest improvements. I already learned that some settings cannot easily be adjusted (dedup, ashift, disks per vdev, …), and I’d prefer to “get things right” this time.

Up front, if you’d like to see any additional information from the existing setup, just let me know!

Future setup [I’ll get the hardware for free, please don’t complain about it ]

19x 4 TByte HDD (512n, SATA 6 GBit/sec)
- a prime number, isn’t that wonderful for planning?
3x 800 GByte SSD (512e, SAS 24 GBit/sec, Mixed Use)
- even though “Mixed Use” doesn’t sound like it, based on my experience with the old setup running on “Read Intensive” SSDs I’m pretty sure the SSDs would survive extended SPECIAL or CACHE usage
2x 480 GByte SSD (M.2, hardware raid)
- I might reserve most of this for OS stuff
256 GByte RAM (DDR5, ECC)
2x Intel Xeon Gold 5415+ (2x 8 Cores)
2x 10 GBit/sec or 2x 25 GBit/sec of network/internet connectivity

Current and future use case:

Serving large files (ISO images, Debian packages, …) to users via HTTPs
Synchronizing this data from other (main) mirrors via rsync
Serving the same data set to other mirrors via rsync
Minor stuff like bitcoind, nostr relay
Ordered by importance: disk space, performance, reliability

My idea for the ZFS setup:

pool
- raidz1 spanning 6 disks
- raidz1 spanning 6 disks
- raidz1 spanning 7 disks
- I don’t really care about “disasters”: 100% of the data is available on other public servers, the service is offered for free, and plenty of alternatives exist
as much L2ARC as possible (making use of the 800 GByte SSDs, possibly also the 480 GByte “OS” SSDs)
SPECIAL vdev on the 800 GByte SSDs (3-way mirror)
- around 100 GByte?
- on the current system I have around 1.8 GByte of “SPECIAL” usage per TByte of data
no SLOG device: barely any sync writes, performance on current server didn’t change when I removed the existing SSD based SLOG device
ashift=12 (4k) to be on the safe side
recordsize=1M (lower for some specific filesystems)
compression=zstd
redundant_metadata=most
relatime=off
atime=off
xattr=sa
ZFS based deduplication is not worth it. It’s easier to use hardlinks for redundant data, which I can easily do via some background job.

Performance of the existing setup, running with 8x 6 TByte ST6000NM0034 disks and some SSDs, is OK, but not great. The disks benefit from SSD based L2ARC and SPECIAL caching. Even though most requests correspond to large files, the sheer number of users turns this into a festival of many random reads (mixed with quite a lot of somewhat random writes for updated/new data).

Thanks a lot!
Carsten

NicKF · November 13, 2023, 3:06pm

If random is the game, consider mirrors instead. The idea here is that your iops are limited to the number of vdevs in a pool. You’ll have an almost linear increase in random IO performance the more vdevs you add. 8 vdevs of 2 way mirrors will have 4ish times more performance than the raidz1 topologies.

Plus no parity calculation is required so you free up some cpu overhead too

C-Otto · November 13, 2023, 3:14pm

I’m not sure how random it is - certainly less random than picking tiny bits out of a large database. I assume that nginx (web server) and Linux/ZFS already pre-fetch most of the data, reducing the randomness of read requests. However, as lots of data needs to be retrieved, this translates into disk seeks.

With 8 mirrors (2 disks each) I’d get around 32 TByte, which is less than what the current system offers. I need to provide more (usable) disk space, and performance isn’t as important.

My gut feeling is that three raidz1 vdevs is good enough (taking into account that ARC and L2ARC do their thing, too). The current system only has a single raidz1 spanning over all 8 disks.

NicKF · November 13, 2023, 3:17pm

Without having more detail about the workload its hard to say, but IMO based on what you described random is as random does.

C-Otto · November 13, 2023, 3:18pm

I could measure it? The old system is running as we speak I don’t really know what to look for, though.

NicKF · November 13, 2023, 3:19pm

Unless you have historical trends logged/graphed I’m not sure what it will tell us. We can run some commands to give us a snapshot-in-time idea of whats happening right now but that doesn’t really tell us much given the random nature of the workload.

C-Otto · November 13, 2023, 3:20pm

I’ll send you some collectd based links via DM, maybe this helps. EDIT: There’s no such thing as DMs. Bugger.

NicKF · November 13, 2023, 3:21pm

C-Otto · November 13, 2023, 3:22pm

Not for me, I guess my account is too new to be allowed to spam random people Could you send a message my way please?

Exard3k · November 13, 2023, 3:35pm

A hero of the people emerges from the crowd!

First look…looks very good! You’ve done your research.

You probably know that HDDs don’t like random reads (as in concurrent users as you mentioned) and you faced this problem before. ZFS can’t change that. But with caching, you can offload the most frequently and recently used data to main memory (ARC) or SSD (L2ARC). And with that kind of workload, you can expect ARC to be highly optimized because those workloads are what ARC was built for.
I agree on getting ~2TB of L2…if you know the Pareto principle, getting the upper 5-10% of the pool available on high performance storage will certainly have an impact on HDD load.
There are some tuning parameters to optimize your L2ARC capacity for your workload, e.g. by restricting L2 to MFU data.

L2 has barely any writes…“read intensive” is plenty. Mixed use is always better for endurance, but I doubt any L2 has ever worn out. I’d check smartctl a few days/weeks after deployment to check on the TBW values of all SSDs just to see what you can expect in terms of writes.

That’s a whole lot. And I’m pretty sure most of the metadata will be handled in ARC along a lot of MRU/MFU (compressed) data.

Will be fine. If it gets full, ZFS allocates further metadata to HDDs, so you don’t get I/O errors or sth like that. I’d check allocation via zpool list -v tank so you see what your metadata/data ratio is. Recordsize=1M means that metadata amount is low anyways. Will mostly be an idle SSD as ARC handles most of it itself.

Totally fine. As it is an FTP mirror, 99.999% availability isn’t necessary. RAIDZ is good for capacity, but bad for IOPS. Mirrors are bad for capacity but good for IOPS.

auto detection usually figures it out by itself. All my pools are ashift=12 without me doing anything.

It’s great. Check your CPU usage in production. You can change compression on a dataset level at will if CPU usage is too high. This applies for future blocks stored there.
I don’t think 2x8 cores are having trouble in decompressing ZSTD, compression of 18Gbit/s is a different matter…htop in production or during test running helps to figure this out.

C-Otto · November 13, 2023, 3:42pm

Mine did, after ~4 years or so. They didn’t die, but had to be replaced (within warranty/extended server support). I lowered the L2 write rate because of that.

I’ll have a look at the L2 restrictions. The current setup has zfs_arc_max=128849018880 l2arc_noprefetch=0 l2arc_write_max=2097152 l2arc_write_boost=251658240 l2arc_headroom=8 zfs_arc_min=107374182400.

Based on your post, I might want to drop the SPECIAL vdev and allocate that SSD disk space to CACHE instead?

Assuming some extremely hot file is served over and over again (out of ARC or L2ARC), would it need to be decompressed for every (HTTP based) access?

The current system uses lz4 compression (on some 10 core Xeon), and I didn’t observe any CPU related issues.

NicKF · November 13, 2023, 3:57pm

Taking a look at the graphs provided, you guys have alot of good documentation of your current workload.

Surpassingly, the system is less-busy than I had expected to see given the description. You guys are able to keep with demand with a single RAIDZ1 currently. Disk latency is higher than I would like to see, but if it isn’t affecting your production meh no biggie.

Comparing NEW v OLD

3 vdevs vs 1 vdvev - great improvement
256GB - Doubled up

Without even diving deeper than that, you’re good.

Also ARC is kicking ass in your deployment. pretty solidly 90+% ARC and for L2 60-80% hits, you are doubling up, expect to see even better

NicKF · November 13, 2023, 4:01pm

I edited yeah mb

Exard3k · November 13, 2023, 4:02pm

No. ARC stores the state which is present on the disks. If it is compressed, then ARC stores the compressed state. That’s how everything works. If the same blocks gets requested frequently, ARC has a separate DBUF cache where it stores the uncompressed state so CPU cycles for decompressing won’t be a problem.
L2ARC also stores the compressed state.

With 150GB of ARC you can get a compressed ARC with like 500GB of data in it if all is well compressible.

That’s below the default, interesting. I wouldn’t change that. Also provides resistance to being flooded with read-once data. Enable persistent L2ARC to keep the stuff after a reboot.
L2ARC failing isn’t a problem…it’s just storing copies of data already present on redundant drives. So it’s more a matter of inconvenience and replacing rather than a disaster.

Hard to tell. But judging from your metadata/data ratio and 256GB memory and new ZFS features that introduce MRU/MFU slider for metadata…I don’t think the metadata vdev is very useful. But raising L2ARC by another 800GB has diminishing returns too…I’d test both setups, L2 can be removed any time.

Be careful with that. If other applications need memory, ZFS won’t give back memory to the OS below that threshold.

C-Otto · November 13, 2023, 4:13pm

Not worth it, “up 1009 days”

Luckily, there are no other applications. Memory usage is very stable (and low), aside from ARC.

Exard3k · November 13, 2023, 4:24pm

Very Nice. But most don’t have that luxury and warming up L2 can be long and tedious if l2arc_write_max sits on low levels. You don’t want cold cache and lower performance for days after booting.
But with years of uptime, ARC is very warm and knows exactly what blocks are the best to keep

ZFS prefetches very aggressively and optimizes SATA queues and vdev balancing as good as possible. There are also a lot of tuning paramters for these behaviors that can help. And most can be altered during runtime, so you can test different things if you see problems. With mostly reads, ARC+L2ARC are just your bread and butter to save HDDs from the most repetitive and tedious work.

edit:

About RAIDZ1 and Mirror…Just throwing enough memory at the problem solves a lot of things when talking ZFS.

But with a given amount of disks, IOPS get worse the more disks per RAIDZ there are (higher width).
Instead of a 6-wide RaidZ1, you can also decrease the width to e.g. 3 or 4-wide Z1 and get more vdevs that way, without dropping down to 50% mirror storage efficiency.

So it’s a matter of how much storage efficiency is needed and how important IOPS are. Low-width is also good for just plugging in a new vdev later. I don’t think long resilver times are a problem with 4TB drives…but high width obviously becomes a problem with 20+TB drives.

jode · November 13, 2023, 5:17pm

I think the idea is to serve most data off of ARC/L2ARC. Do you have any insight into the the percentage of data commonly accessed? Does your setup (especially L2ARC) hold that volume?

Some thoughts:

Consider a draid setup. e.g. draid1:5d:19c:1s (raidz1, 5 data drives, total 19, 1 spare)
Consider having the special devices store not only the metadata but also the the small blocks that slow down the hdds. special_small_blocks= e.g. 64k (smaller than recordsize
If reliability is of second concern, consider partitioning the 480GB drives and take a mirror of partitions as special device. Provides full capacity of the 800G drives as L2ARC

Oche Alaaf!

C-Otto · November 13, 2023, 5:22pm

As per the statistics at Server stats the cache hit ratio is rather great: above 90% for ARC, around 60% for L2ARC. This is with the current setup (128 GByte RAM). Issues mostly happen if larger updates (say, a new Fedora release) coincide with busy (read) demand via HTTPs or rsync.

ARC+L2ARC is not enough to serve all of the requested data, though, the disks constantly serve ~15 MByte/sec (each) to fill the gap.

I don’t know enough about draid, but I’ll look into it. So far, it seems rather arcane.

I don’t have enough small files to make this SPECIAL use case worth it.

Kölle alaaf

jode · November 13, 2023, 5:32pm

Exactly the reason to consider it. Your HDDs like reading large blocks of data and slow down whenever they have to read small blocks. I’d move away from this strategy only if you would manage to serve a high percentage of read requests (basically mostly all) from ARC+L2ARC.
If most of the frequently accessed data fits in 256GB+3*800GB then go for this (most activity on SSDs) otherwise consider using SSDs as special devices to improve the speed of HDDs.

Exard3k · November 13, 2023, 5:37pm

It’s some neat IBM thing adopted by ZFS 1-2 years ago. It aims to solve the long resilver times of RaidZ and integrate spare disks and put them to work, having spare capacity rather than spare (cold) disks.

400km safety distance from Cologne is one of my achievements in life