Introduction
ZFS Allocation Classes: It isn’t storage tiers or caching, but gosh darn it, you can really REALLY speed up your zfs pool.
From the manual:
Special Allocation Class
The allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.
A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.
Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value.
ZFS dataset property special_small_blocks=size - This value represents the threshold block size for including small file blocks into the special allocation class. Blocks smaller than or equal to this value will be assigned to the special allocation class while greater blocks will be assigned to the regular class. Valid values are zero or a power of two from 512B to 1M normally, and up to 16M when zfs_max_recordsize
is set accordingly pull#9355. The default size is 0 which means no small file blocks will be allocated in the special class. Before setting this property, a special class vdev must be added to the pool.
VDEV type special - A device dedicated solely for allocating various kinds of internal metadata, and optionally small file blocks. The redundancy of this device should match the redundancy of the other normal devices in the pool. If more than one special device is specified, then allocations are load-balanced between those devices.
The ZFS Special Device can store meta data (where files are, allocation tables, etc) AND, optionally, small files up to a user-defined size on this special vdev. This would be a great use-case for solid-state storage.
This will dramatically speed up random I/O and cut, by half, the average number of spinning-rust I/Os needed to find and access a file. Operations like ls -lR
will be insanely way faster.
The ZFS Special vdev should still have a good level of redundancy (like all Vdevs, if it fails, your pool fails).
In my mind there are two questions:
How much space do you need for Metadata storage?
In the leftover space, how many files below size X can also be stored on the special vdev (for speeeeed).
How much space do you need for Metadata storage?
The rule of thumb is 0.3% of the pool size for a typical workload. It can vary!
I was looking at the existing Level1Techs ZFS dataset for production workloads:
# zdb -Lbbbs elbereth
Traversing all blocks ...
348M completed ( 39MB/s) estimated time remaining: 432hr 31min 50sec
The ASIZE
column should help you. Our dataset was only 1.9%, which makes sense since most of our space is consumed by Very Large Video Files.
How much space do you need for Metadata storage?
What we need is a histogram of file sizes, and how many files are in that size range.
Here’s some CLI magic to give you that
find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
And the Output:
1k: 270448
2k: 72548
4k: 74510
8k: 62799
16k: 36386
32k: 74070
64k: 21528
128k: 30624
256k: 25892
512k: 83264
1M: 140595
2M: 23716
4M: 14846
8M: 4843
16M: 3735
32M: 2996
64M: 4876
128M: 6730
256M: 5312
512M: 3726
1G: 2235
2G: 1819
4G: 908
8G: 700
16G: 277
32G: 42
64G: 10
128G: 1
The metadata special device stops allocating more stuff when it is more than 75% full as well, though this can be tweaked. I’d be more comfortable around 10% +/-.
Looks like if I’ve got 8tb of space on the special device (8tb of flash in other words) I’d be safe storing small files non the flash, no problem.
Technically I’d want ~5tb of space just for metadata with a full pool (~172tb space is ~5tb for metadata).
These calculations seem a bit like voodoo to me, so it is something I’ll be keeping an eye on as pool usage changes.
The “files stored on special vdev” is set on a per-dataset basis.
Adding the special vdev to an existing zpool
#implied stripe of mirrors, for this special device
zpool add elbereth -o ashift=12 special mirror /dev/nvme0n1 /dev/nvme1n1 mirror /dev/nvme2n1 /dev/nvme3n1
Setting the Small Blocks size:
# zfs set special_small_blocks=128K elbereth
Or if you have a “videos” dataset like ours:
# zfs set special_small_blocks=128K elbereth/videos
Also note my ZFS record size is 512kb. The default is 128kb and the default limit of small blocks is 128kb, so you want to set this value smaller than the overall zfs record size because files that are larger are broken into more than one record. This may be non-intuitive – so let me restate another way:
If your zfs recordsize is 128kb, then files larger than that are broken into chunks that big. The magic of special_small_blocks, ideally, is only on files that are smaller than your zfs record size. If you set it to the same size as your zfs record size, then everything will go to the special device (not an ideal situation).
… and now, with that, enjoy blazing fast video editing and directory load/list commands.