The ZFS updates thread - Latest: 2.2.0

Log · January 21, 2023, 2:32am

This is a thread for posting and discussing ZFS updates and the new added features for those interested, rather than just making a new thread for each one.

Updates:

2.2.0: Big update with block cloning
2.1.9: Bugfix
2.1.8: Linux kernel 6.1 compatibility

Log · January 21, 2023, 2:33am

Compatibility and bug fixes for this one.

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.8

Supported Platforms

Linux: compatible with 3.10 - 6.1 kernels
FreeBSD: compatible with releases starting from 12.2-RELEASE

Changes

change how d_alias is replaced by du.d_alias #14377
Linux ppc64le ieee128 compat: Do not redefine __asm on external headers #14308 #14384
include systemd overrides to zfs-dracut module #14075 #14076
Activate filesystem features only in syncing context #14304 #14252
Illumos #15286: do_composition() needs sign awareness #14318 #14342
dracut: fix typo in mount-zfs.sh.in #13602
removal of LegacyVersion broke ax_python_dev.m4 #14297
FreeBSD: catch up to 1400077 #14328
Fix shebang for helper script of deb-utils #14339
Add quotation marks around $PATH for deb-utils #14339
Documentation corrections #14298 #14307
systemd: set restart=always for zfs-zed.service #14294
Add color output to zfs diff.
libzfs: diff: simplify superfluous stdio #12829
libzfs: diff: print_what() can return the symbol => get_what() #12829
FreeBSD: Remove stray debug printf #14286 #14287
Zero end of embedded block buffer in dump_write_embedded() #13778 #14255
Change ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names #14272
Restrict visibility of per-dataset kstats inside FreeBSD jails #14254
Fix dereference after null check in enqueue_range #14264
Fix potential buffer overflow in zpool command #14264
FreeBSD: zfs_register_callbacks() must implement error check correctly #14261
fgrep → grep -F #13259
egrep → grep -E #13259
Update META to 6.1 kernel #14371
ztest fails assertion in zio_write_gang_member_ready() #14250 #14356
Introduce ZFS_LINUX_REQUIRE_API autoconf macro #14343
linux 6.2 compat: bio->bi_rw was renamed bio->bi_opf #14324 #14331
linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2 #14323 #14331
Linux 6.1 compat: open inside tmpfile() #14301 #14343
ZTS: close in mmapwrite.c #14353
ZTS: limit mmapwrite file size #14277 #14345
skip permission checks for extended attributes
Allow receiver to override encryption properties in case of replication
zed: unclean disk attachment faults the vdev
FreeBSD: Fix potential boot panic with bad label #14291
Add workaround for broken Linux pipes #13309
initramfs: Fix legacy mountpoint rootfs #14274
vdev_raidz_math_aarch64_neonx2.c: suppress diagnostic only for GCC
tests: mkfile: usage: () → (void)
Use Ubuntu 20.04 and remove Ubuntu 18.04 from workflows #14238
dracut: skip zfsexpandknoweldge when zfs_devs is present in dracut #13121

Log · January 26, 2023, 12:51am

Well that was fast, last release had an issue involved with encryption causing errors to be reported back when there weren’t any. This release reverts that particular patch.

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.9

Supported Platforms

Linux: compatible with 3.10 - 6.1 kernels
FreeBSD: compatible with releases starting from 12.2-RELEASE

Changes

linux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Revert “ztest fails assertion in zio_write_gang_member_ready()” #14413

jxdking · March 24, 2023, 5:37pm

Waiting for Ubuntu Lunar, I will get zfs 2.1.9 from there.

Exard3k · April 8, 2023, 5:31am

Internet says that there is no ZFS option anymore in the recent Lunar Lobster Beta ISO Installer. Seems like Canonical is phasing out ZFS support

So much for the ZFS-on-root easy mode. Ditching flatpak & ZFS at the same time would let me reconsider Ubuntu on the two machines I’m running Ubuntu on.

And still no 6.2 kernel support for ZFS. Really fucks up basically all leading edge distros. My Tumbleweed is on 6.2.x for what feels like ages now.

Can’t wait for 2.1.10 or even 2.2.

New features sound nice and all, like progress on RAIDZ for special vdev or shared L2ARC. But RAIDZ for storing metadata and small stuff just screams “write amplification”. Same problem RAIDZ has with low record/blocksize stuff.
Most interesting stuff for me is an ARC change to seperate data and metadata MFU and MRU which really optimizes eviction policy of metadata and comes with a tunable parameter. This will help many pools to keep their metadata in memory, especially useful for pools with little memory or above average amounts of metadata.

Log · April 17, 2023, 7:00pm

Note that while ZFS v2.1.10 has been released, you should hold off on updating as there appears to be a data corruption bug that may be caused by an intermittent race condition, and thus wasn’t caught: Data corruption with 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") · Issue #14753 · openzfs/zfs · GitHub

So you’ll need to revert the specific commit or wait for v2.1.11 for official kernel 6.2 compatibility.

Log · April 25, 2023, 7:00pm

The bug fix release is now out: Release zfs-2.1.11 · openzfs/zfs · GitHub

Log · October 13, 2023, 11:00pm

This is a very interesting update, lots of new stuff including block cloning.

As always, I’d strongly suggest waiting a few months for a minor point release (e.g. “2.2.1”) to come out before upgrading unless you have something that can take the risk of testing. There’s often some unpleasant bugs that get found once a major release is made.

Supported Platforms

Linux: compatible with 3.10 - 6.5 kernels
FreeBSD: compatible with releases starting from 12.2-RELEASE

New Features

Block cloning (#13392) - Block cloning is a facility that allows a file (or parts of a file) to be “cloned”, that is, a shallow copy made where the existing data blocks are referenced rather than copied. Later modifications to the data will cause a copy of the data block to be taken and that copy modified. This facility is used to implement “reflinks” or “file-level copy-on-write”. Many common file copying programs, including newer versions of /bin/cp on Linux, will try to create clones automatically.
Linux container support (#12209, #14070, #14097, #12263) - Added support for Linux-specific container interfaces such as renameat(2), support for overlayfs, idmapped mounts in a user namespace, and namespace delegation support for containers.
Scrub error log (#12812, #12355) - zpool status will report all filesystems, snapshots, and clones affected by a shared corrupt block. zpool scrub -e can be used to scrub only the known damaged blocks in the error log to perform a fast, targeted repair when possible.
BLAKE3 checksums (#12918) - BLAKE3 is a modern cryptographic hash algorithm focused on high performance. It is much faster than sha256, sha512, and can be up 3x faster than Edon-R. BLAKE3 is the recommended secure checksum.
Corrective “zfs receive” (#9372) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.
Vdev properties (#11711) - Provides observability of individual vdevs in a programmatic way.
Vdev and zpool user properties (#11680) - Lets you set custom user properties on vdevs and zpools, similar to the existing zfs dataset user properties.

Performance

Fully adaptive ARC (#14359) - A unified ARC which relaxes the artificial limits imposed by both the MRU/MFU distribution and data/metadata distinction. This allows the ARC to better adjust to highly dynamic workloads and minimizes the need for manual workload-dependent tuning.
SHA2 checksums (#13741) - Optimized SHA2 checksum implementation to use hardware acceleration when available.
Edon-R checksums (#13618) - Reworked the Edon-R variants and optimized the code to make several minor speed ups.
ZSTD early abort (#13244) - When using the zstd compression algorithm, data that can not be compressed is detected quickly, avoiding wasted work.
Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - Extensive I/O prefetching analysis and optimization.
General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) - Numerous performance improvements throughout.

Additional Information

Documentation - OpenZFS documentation for Linux and FreeBSD.
Change log - Complete v2.1.0 - v2.2.0 change log Thanks to 202 contributors
Module options - The default values for the module options were selected to yield good performance for the majority of workloads and configurations. They should not need to be tuned for most systems but are available for performance analysis and tuning. See the module parameters documentation for the complete list of the options and what they control.
New module options
Removed module options
Modified module options

Offtopic: It took me way too long to figure out how to align the expand and collapse formatting to the prior bullet points.

Click here for my notes

# Header
* Bullet1 - These have **no** spaces in front of the *
* Bullet2
* Bullet3
  <details open>
  <summary>Dropdown1 which loads expanded</summary>

  * Item1 - This first item must have a blank line above it, and **two** spaces before the *
  * Item2
  * Item3
  * Item4
  * Item5
  </details>
  <details>
  <summary>Dropdown2 which loads collapsed</summary>

  * Item1
  * Item2
  * Item3
  * Item4
  * Item5
  </details>

This results in:

Header

Bullet1 - These have no spaces in front of the *
Bullet2
Bullet3
Dropdown1 which loads expanded
- Item1 - This first item must have a blank line above it, and two spaces before the *
- Item2
- Item3
- Item4
- Item5
Dropdown2 which loads collapsed
- Item1
- Item2
- Item3
- Item4
- Item5

Exard3k · October 14, 2023, 11:59am

Is this a revamp? I remember this feature for shortly after some 2.0 release. And I’m pretty sure I used the -c flag in the past. Or is this something new?

Scrubbing 200T+ isn’t trivial. I’m sure many pools will benefit from this. It’s situational. But if you face the situation, you can save yourself days.

I’m not sure what difference and possibilities it will bring. To optimize and tweak heterogeneous vdev arrangement we often see in homelab scenarios? Not sure.

This is really gamebreaking. And I’ve seen the devs talking about this at the OpenZFS Leadership meetings. Gone are the days where your ARC evicts metadata seemingly at random and in a large manner. You could get this under control with tuning parameters, but having this adaptability out of the box is great for us homelabbers and basically removes the need for most special metadata vdevs. Because ARC will take the metadata just as serious as data now, with counters and weight in the algorithm.

You want cached metadata? You can rely on your ARC to do this out of the box. Amazing work from Alexander Motkin and others.

The early abort feature of LZ4 was the holy grail in the last decade. ZSTD is great, but wasted so much CPU on barely compressible stuff.
I’m probably changing all my remaining LZ4 datasets to ZSTD. I only had LZ4 because of the early abort on barely compressible data.
Great news for the average compressratio for basically any pool out there.

xzpfzxds · October 14, 2023, 12:42pm

zfs send has always had a -c for compression, zfs receive didn’t?

Exard3k · October 14, 2023, 12:51pm

Yeah I probably confused recv with send. The recv -c stands for corrective instead of compression.

Log · October 14, 2023, 6:24pm

Here’s some interesting and amusing info on how the early abort feature works: edit: I can't tell if you edited this since I read it, or I just skipped the las... | Hacker News

Posted by rincebrain

To add something amusing but constructive - it turns out that one pass of Brotli is even better than using LZ4+zstd-1 for predicting this, but integrating a compression algorithm just as an early pass filter seemed like overkill. Heh.

It’s actually even funnier than that.

It tries LZ4, and if that fails, then it tries zstd-1, and if that fails, then it doesn’t try your higher level. If either of the first two succeed, it just jumps to the thing you actually requested.

Because it turns out, just using LZ4 is insanely faster than using zstd-1 is insanely faster than any of the higher levels, and you burn a lot of CPU time on incompressible data if you’re wrong; using zstd-1 as a second pass helps you avoid the false positive rate from LZ4 having different compression characteristics sometimes.

Source: I wrote the feature.

Exard3k · October 14, 2023, 6:50pm

I’m not that much of a maths nerd to dig deeper into those algorithms, but I remember seeing LZ4 abort mechanic as a helper algorithm to boost ZSTD performance. If this gets the job done, so be it…LZ4 speed is insane.

Can’t wait for QAT (which is also mentioned in the linked thread) to support LZ4 and ZSTD. Or other standardized hardware offloading, be it CPU,GPU or whatever. If a freaking 6-core Ryzen (or attached iGPU) could offload ZSTD compression like QAT can, that’s your perfect storage server CPU right there. compressing 100Gbit/sec without the need to boost any of the cores.

Those lazy coders…writing code and developing new features just to avoid copy&pasting some URL from your browser into some forums

Exard3k · November 1, 2023, 9:28pm

ZFS Interface for Accelerators (ZIA) is an upcoming feature so we get hardware offloading on a lot of stuff. So if you want QAT or DPUs to turbo-up your CPU-intensive ZFS workload, that’s it.

Compression, Checksumming, RAIDZ are the most prominent benefactors.

Developed for OpenZFS by Los Alamos Laboratories…they got 3.2TB/s of Infiniband and 100PB storage according to the slides. So yeah, you need some offloading if you don’t want to buy and run thousands of CPU cores just for that. With 100PB, every 0.01 compressratio is equal to a diskshelf worth of drives.

Sharkbyte · November 2, 2023, 8:18pm

I thought zRAID expansion made it into this update, am I wrong?

Log · November 2, 2023, 9:34pm

It’s only in testing, not in the master repository, as far as I’m aware. Getting closer, but not quite there yet.

Even when it does drop, you’ll want to wait a few more months yet to wait for others to painfully run into edge cases.

Exard3k · November 2, 2023, 9:47pm

Yeah…looks good. It’s been quite the odyssey ever since it was sponsored by the FreeBSD Foundation. But you don’t want to roll out these kinds of things if you are not 100% confident. We’re talking about ZFS after all.

Also keep in mind that RAIDZ expansion leaves a permanent footprint in memory (uses RAM). Gets less over time as you modify and delete stuff…a small price if you consider the next level wizardry necessary to make this even possible while still keeping all of RAIDZ features like fixing RAID5 write hole.

I personally don’t have a use case for this. I don’t think making a wide RAIDZ vdev even wider is a good choice because you eventually end up with a single 20-wide vdev instead of 3x 7-wide. Adding a vdev gets more expensive the more you use RAIDZ expansion.

But if talking about pools of 3-7 disks, it’s great and that’s probably where people use it.

Log · November 2, 2023, 10:15pm

I’m personally of the same inclination, as the feature seems to mostly be demanded by those who want to perform a risky operation, but don’t even have backups to do that age old “backup, check your goddamn backup, destroy, recreate, restore from backup” dance. I’m certainly not against the feature, but I’m hoping people don’t footgun themselves.

What I actually look forward to regarding the feature is there being one less reason that people can complain about pointlessly. Fortunately for those that want to derail online forum discussions, they’ll always have “license issues” as the gift that keeps on giving.

In other FS news, in the upcoming 6.7 kernel

Supposedly BTRFS will be having it’s raid5/6 write hole addressed: Btrfs updates for 6.7 with new raid-stripe-tree | Hacker News but I don’t know enough about what’s involved to say if that’s really true.
BcacheFS is apparently getting merged as well.

Exard3k · November 2, 2023, 10:59pm

I’ve seen some news a couple of weeks/months ago at Phoronix. Looked convincing, but sacrifices performance as a compromise. I run BTRFS on 3 machines…no need for parity RAID there.

But that ASUS NVMe consumer NAS has ext4 and BTRFS. So we see BTRFS spreading and parity RAID as an option is always good to have. Synology certainly lost their edge with SHR and I bet we see many more BTRFS NAS products in the future. ZFS isn’t for everyone and that’s totally fine.

I admire the dedication…but there is still a lot of work to do in terms of features before I would consider it. I’m a spoiled ZFS kid.

My personal focus is on Ceph right now. And I really want that Crimson-OSD update (soon ). It’s the Ceph equivalent for ZFS Direct_IO, boosting NVMe capability by quite a bit. CPU cycles on NVMe drives are insanity…2-4 cores per NVMe drive on heavy writes. The metric that is used is IOPS/Core and that tells you a lot where the bottlenecks are.

Exard3k · November 3, 2023, 4:17pm

RAIDZ Expansion seems to be indeed coming VERY soon:

fun fact: You can’t expand a 255-wide RAIDZ because 255 width is hardcoded as a maximum. So Expansion is not infinite