Well that was fast, last release had an issue involved with encryption causing errors to be reported back when there weren’t any. This release reverts that particular patch.
Internet says that there is no ZFS option anymore in the recent Lunar Lobster Beta ISO Installer. Seems like Canonical is phasing out ZFS support
So much for the ZFS-on-root easy mode. Ditching flatpak & ZFS at the same time would let me reconsider Ubuntu on the two machines I’m running Ubuntu on.
And still no 6.2 kernel support for ZFS. Really fucks up basically all leading edge distros. My Tumbleweed is on 6.2.x for what feels like ages now.
Can’t wait for 2.1.10 or even 2.2.
New features sound nice and all, like progress on RAIDZ for special vdev or shared L2ARC. But RAIDZ for storing metadata and small stuff just screams “write amplification”. Same problem RAIDZ has with low record/blocksize stuff.
Most interesting stuff for me is an ARC change to seperate data and metadata MFU and MRU which really optimizes eviction policy of metadata and comes with a tunable parameter. This will help many pools to keep their metadata in memory, especially useful for pools with little memory or above average amounts of metadata.
This is a very interesting update, lots of new stuff including block cloning.
As always, I’d strongly suggest waiting a few months for a minor point release (e.g. “2.2.1”) to come out before upgrading unless you have something that can take the risk of testing. There’s often some unpleasant bugs that get found once a major release is made.
Supported Platforms
Linux: compatible with 3.10 - 6.5 kernels
FreeBSD: compatible with releases starting from 12.2-RELEASE
New Features
Block cloning (#13392) - Block cloning is a facility that allows a file (or parts of a file) to be “cloned”, that is, a shallow copy made where the existing data blocks are referenced rather than copied. Later modifications to the data will cause a copy of the data block to be taken and that copy modified. This facility is used to implement “reflinks” or “file-level copy-on-write”. Many common file copying programs, including newer versions of /bin/cp on Linux, will try to create clones automatically.
Linux container support (#12209, #14070, #14097, #12263) - Added support for Linux-specific container interfaces such as renameat(2), support for overlayfs, idmapped mounts in a user namespace, and namespace delegation support for containers.
Scrub error log (#12812, #12355) - zpool status will report all filesystems, snapshots, and clones affected by a shared corrupt block. zpool scrub -e can be used to scrub only the known damaged blocks in the error log to perform a fast, targeted repair when possible.
BLAKE3 checksums (#12918) - BLAKE3 is a modern cryptographic hash algorithm focused on high performance. It is much faster than sha256, sha512, and can be up 3x faster than Edon-R. BLAKE3 is the recommended secure checksum.
Corrective “zfs receive” (#9372) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.
Vdev properties (#11711) - Provides observability of individual vdevs in a programmatic way.
Vdev and zpool user properties (#11680) - Lets you set custom user properties on vdevs and zpools, similar to the existing zfs dataset user properties.
Performance
Fully adaptive ARC (#14359) - A unified ARC which relaxes the artificial limits imposed by both the MRU/MFU distribution and data/metadata distinction. This allows the ARC to better adjust to highly dynamic workloads and minimizes the need for manual workload-dependent tuning.
SHA2 checksums (#13741) - Optimized SHA2 checksum implementation to use hardware acceleration when available.
Edon-R checksums (#13618) - Reworked the Edon-R variants and optimized the code to make several minor speed ups.
ZSTD early abort (#13244) - When using the zstd compression algorithm, data that can not be compressed is detected quickly, avoiding wasted work.
Module options - The default values for the module options were selected to yield good performance for the majority of workloads and configurations. They should not need to be tuned for most systems but are available for performance analysis and tuning. See the module parameters documentation for the complete list of the options and what they control.
Offtopic: It took me way too long to figure out how to align the expand and collapse formatting to the prior bullet points.
Click here for my notes
# Header
* Bullet1 - These have **no** spaces in front of the *
* Bullet2
* Bullet3
<details open>
<summary>Dropdown1 which loads expanded</summary>
* Item1 - This first item must have a blank line above it, and **two** spaces before the *
* Item2
* Item3
* Item4
* Item5
</details>
<details>
<summary>Dropdown2 which loads collapsed</summary>
* Item1
* Item2
* Item3
* Item4
* Item5
</details>
This results in:
Header
Bullet1 - These have no spaces in front of the *
Bullet2
Bullet3
Dropdown1 which loads expanded
Item1 - This first item must have a blank line above it, and two spaces before the *
Is this a revamp? I remember this feature for shortly after some 2.0 release. And I’m pretty sure I used the -c flag in the past. Or is this something new?
Scrubbing 200T+ isn’t trivial. I’m sure many pools will benefit from this. It’s situational. But if you face the situation, you can save yourself days.
I’m not sure what difference and possibilities it will bring. To optimize and tweak heterogeneous vdev arrangement we often see in homelab scenarios? Not sure.
This is really gamebreaking. And I’ve seen the devs talking about this at the OpenZFS Leadership meetings. Gone are the days where your ARC evicts metadata seemingly at random and in a large manner. You could get this under control with tuning parameters, but having this adaptability out of the box is great for us homelabbers and basically removes the need for most special metadata vdevs. Because ARC will take the metadata just as serious as data now, with counters and weight in the algorithm.
You want cached metadata? You can rely on your ARC to do this out of the box. Amazing work from Alexander Motkin and others.
The early abort feature of LZ4 was the holy grail in the last decade. ZSTD is great, but wasted so much CPU on barely compressible stuff.
I’m probably changing all my remaining LZ4 datasets to ZSTD. I only had LZ4 because of the early abort on barely compressible data.
Great news for the average compressratio for basically any pool out there.
To add something amusing but constructive - it turns out that one pass of Brotli is even better than using LZ4+zstd-1 for predicting this, but integrating a compression algorithm just as an early pass filter seemed like overkill. Heh.
It’s actually even funnier than that.
It tries LZ4, and if that fails, then it tries zstd-1, and if that fails, then it doesn’t try your higher level. If either of the first two succeed, it just jumps to the thing you actually requested.
Because it turns out, just using LZ4 is insanely faster than using zstd-1 is insanely faster than any of the higher levels, and you burn a lot of CPU time on incompressible data if you’re wrong; using zstd-1 as a second pass helps you avoid the false positive rate from LZ4 having different compression characteristics sometimes.
I’m not that much of a maths nerd to dig deeper into those algorithms, but I remember seeing LZ4 abort mechanic as a helper algorithm to boost ZSTD performance. If this gets the job done, so be it…LZ4 speed is insane.
Can’t wait for QAT (which is also mentioned in the linked thread) to support LZ4 and ZSTD. Or other standardized hardware offloading, be it CPU,GPU or whatever. If a freaking 6-core Ryzen (or attached iGPU) could offload ZSTD compression like QAT can, that’s your perfect storage server CPU right there. compressing 100Gbit/sec without the need to boost any of the cores.
Those lazy coders…writing code and developing new features just to avoid copy&pasting some URL from your browser into some forums
ZFS Interface for Accelerators (ZIA) is an upcoming feature so we get hardware offloading on a lot of stuff. So if you want QAT or DPUs to turbo-up your CPU-intensive ZFS workload, that’s it.
Compression, Checksumming, RAIDZ are the most prominent benefactors.
Developed for OpenZFS by Los Alamos Laboratories…they got 3.2TB/s of Infiniband and 100PB storage according to the slides. So yeah, you need some offloading if you don’t want to buy and run thousands of CPU cores just for that. With 100PB, every 0.01 compressratio is equal to a diskshelf worth of drives.
Yeah…looks good. It’s been quite the odyssey ever since it was sponsored by the FreeBSD Foundation. But you don’t want to roll out these kinds of things if you are not 100% confident. We’re talking about ZFS after all.
Also keep in mind that RAIDZ expansion leaves a permanent footprint in memory (uses RAM). Gets less over time as you modify and delete stuff…a small price if you consider the next level wizardry necessary to make this even possible while still keeping all of RAIDZ features like fixing RAID5 write hole.
I personally don’t have a use case for this. I don’t think making a wide RAIDZ vdev even wider is a good choice because you eventually end up with a single 20-wide vdev instead of 3x 7-wide. Adding a vdev gets more expensive the more you use RAIDZ expansion.
But if talking about pools of 3-7 disks, it’s great and that’s probably where people use it.
I’m personally of the same inclination, as the feature seems to mostly be demanded by those who want to perform a risky operation, but don’t even have backups to do that age old “backup, check your goddamn backup, destroy, recreate, restore from backup” dance. I’m certainly not against the feature, but I’m hoping people don’t footgun themselves.
What I actually look forward to regarding the feature is there being one less reason that people can complain about pointlessly. Fortunately for those that want to derail online forum discussions, they’ll always have “license issues” as the gift that keeps on giving.
I’ve seen some news a couple of weeks/months ago at Phoronix. Looked convincing, but sacrifices performance as a compromise. I run BTRFS on 3 machines…no need for parity RAID there.
But that ASUS NVMe consumer NAS has ext4 and BTRFS. So we see BTRFS spreading and parity RAID as an option is always good to have. Synology certainly lost their edge with SHR and I bet we see many more BTRFS NAS products in the future. ZFS isn’t for everyone and that’s totally fine.
I admire the dedication…but there is still a lot of work to do in terms of features before I would consider it. I’m a spoiled ZFS kid.
My personal focus is on Ceph right now. And I really want that Crimson-OSD update (soon ). It’s the Ceph equivalent for ZFS Direct_IO, boosting NVMe capability by quite a bit. CPU cycles on NVMe drives are insanity…2-4 cores per NVMe drive on heavy writes. The metric that is used is IOPS/Core and that tells you a lot where the bottlenecks are.