The ZFS updates thread - Latest: 2.2.0

This is a thread for posting and discussing ZFS updates and the new added features for those interested, rather than just making a new thread for each one.

Updates:

  • 2.2.0: Big update with block cloning
  • 2.1.9: Bugfix
  • 2.1.8: Linux kernel 6.1 compatibility
1 Like

Compatibility and bug fixes for this one.

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.8

Supported Platforms

  • Linux: compatible with 3.10 - 6.1 kernels
  • FreeBSD: compatible with releases starting from 12.2-RELEASE

Changes

  • change how d_alias is replaced by du.d_alias #14377
  • Linux ppc64le ieee128 compat: Do not redefine __asm on external headers #14308 #14384
  • include systemd overrides to zfs-dracut module #14075 #14076
  • Activate filesystem features only in syncing context #14304 #14252
  • Illumos #15286: do_composition() needs sign awareness #14318 #14342
  • dracut: fix typo in mount-zfs.sh.in #13602
  • removal of LegacyVersion broke ax_python_dev.m4 #14297
  • FreeBSD: catch up to 1400077 #14328
  • Fix shebang for helper script of deb-utils #14339
  • Add quotation marks around $PATH for deb-utils #14339
  • Documentation corrections #14298 #14307
  • systemd: set restart=always for zfs-zed.service #14294
  • Add color output to zfs diff.
  • libzfs: diff: simplify superfluous stdio #12829
  • libzfs: diff: print_what() can return the symbol => get_what() #12829
  • FreeBSD: Remove stray debug printf #14286 #14287
  • Zero end of embedded block buffer in dump_write_embedded() #13778 #14255
  • Change ZEVENT_POOL_GUID to ZEVENT_POOL to display pool names #14272
  • Restrict visibility of per-dataset kstats inside FreeBSD jails #14254
  • Fix dereference after null check in enqueue_range #14264
  • Fix potential buffer overflow in zpool command #14264
  • FreeBSD: zfs_register_callbacks() must implement error check correctly #14261
  • fgrep ā†’ grep -F #13259
  • egrep ā†’ grep -E #13259
  • Update META to 6.1 kernel #14371
  • ztest fails assertion in zio_write_gang_member_ready() #14250 #14356
  • Introduce ZFS_LINUX_REQUIRE_API autoconf macro #14343
  • linux 6.2 compat: bio->bi_rw was renamed bio->bi_opf #14324 #14331
  • linux 6.2 compat: get_acl() got moved to get_inode_acl() in 6.2 #14323 #14331
  • Linux 6.1 compat: open inside tmpfile() #14301 #14343
  • ZTS: close in mmapwrite.c #14353
  • ZTS: limit mmapwrite file size #14277 #14345
  • skip permission checks for extended attributes
  • Allow receiver to override encryption properties in case of replication
  • zed: unclean disk attachment faults the vdev
  • FreeBSD: Fix potential boot panic with bad label #14291
  • Add workaround for broken Linux pipes #13309
  • initramfs: Fix legacy mountpoint rootfs #14274
  • vdev_raidz_math_aarch64_neonx2.c: suppress diagnostic only for GCC
  • tests: mkfile: usage: () ā†’ (void)
  • Use Ubuntu 20.04 and remove Ubuntu 18.04 from workflows #14238
  • dracut: skip zfsexpandknoweldge when zfs_devs is present in dracut #13121
1 Like

Well that was fast, last release had an issue involved with encryption causing errors to be reported back when there werenā€™t any. This release reverts that particular patch.

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.9

Supported Platforms

  • Linux: compatible with 3.10 - 6.1 kernels
  • FreeBSD: compatible with releases starting from 12.2-RELEASE

Changes

  • linux 6.2 compat: zpl_set_acl arg2 is now struct dentry
  • Revert ā€œztest fails assertion in zio_write_gang_member_ready()ā€ #14413
1 Like

Waiting for Ubuntu Lunar, I will get zfs 2.1.9 from there.

Internet says that there is no ZFS option anymore in the recent Lunar Lobster Beta ISO Installer. Seems like Canonical is phasing out ZFS support :frowning:

So much for the ZFS-on-root easy mode. Ditching flatpak & ZFS at the same time would let me reconsider Ubuntu on the two machines Iā€™m running Ubuntu on.

And still no 6.2 kernel support for ZFS. Really fucks up basically all leading edge distros. My Tumbleweed is on 6.2.x for what feels like ages now.

Canā€™t wait for 2.1.10 or even 2.2.

New features sound nice and all, like progress on RAIDZ for special vdev or shared L2ARC. But RAIDZ for storing metadata and small stuff just screams ā€œwrite amplificationā€. Same problem RAIDZ has with low record/blocksize stuff.
Most interesting stuff for me is an ARC change to seperate data and metadata MFU and MRU which really optimizes eviction policy of metadata and comes with a tunable parameter. This will help many pools to keep their metadata in memory, especially useful for pools with little memory or above average amounts of metadata.

1 Like

Note that while ZFS v2.1.10 has been released, you should hold off on updating as there appears to be a data corruption bug that may be caused by an intermittent race condition, and thus wasnā€™t caught: Data corruption with 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") Ā· Issue #14753 Ā· openzfs/zfs Ā· GitHub

So youā€™ll need to revert the specific commit or wait for v2.1.11 for official kernel 6.2 compatibility.

1 Like

The bug fix release is now out: Release zfs-2.1.11 Ā· openzfs/zfs Ā· GitHub

1 Like

This is a very interesting update, lots of new stuff including block cloning.

As always, Iā€™d strongly suggest waiting a few months for a minor point release (e.g. ā€œ2.2.1ā€) to come out before upgrading unless you have something that can take the risk of testing. Thereā€™s often some unpleasant bugs that get found once a major release is made.

Supported Platforms

  • Linux: compatible with 3.10 - 6.5 kernels
  • FreeBSD: compatible with releases starting from 12.2-RELEASE

New Features

  • Block cloning (#13392) - Block cloning is a facility that allows a file (or parts of a file) to be ā€œclonedā€, that is, a shallow copy made where the existing data blocks are referenced rather than copied. Later modifications to the data will cause a copy of the data block to be taken and that copy modified. This facility is used to implement ā€œreflinksā€ or ā€œfile-level copy-on-writeā€. Many common file copying programs, including newer versions of /bin/cp on Linux, will try to create clones automatically.
  • Linux container support (#12209, #14070, #14097, #12263) - Added support for Linux-specific container interfaces such as renameat(2), support for overlayfs, idmapped mounts in a user namespace, and namespace delegation support for containers.
  • Scrub error log (#12812, #12355) - zpool status will report all filesystems, snapshots, and clones affected by a shared corrupt block. zpool scrub -e can be used to scrub only the known damaged blocks in the error log to perform a fast, targeted repair when possible.
  • BLAKE3 checksums (#12918) - BLAKE3 is a modern cryptographic hash algorithm focused on high performance. It is much faster than sha256, sha512, and can be up 3x faster than Edon-R. BLAKE3 is the recommended secure checksum.
  • Corrective ā€œzfs receiveā€ (#9372) - A new type of zfs receive which can be used to heal corrupted data in filesystems, snapshots, and clones when a replica of the data already exists in the form of a backup send stream.
  • Vdev properties (#11711) - Provides observability of individual vdevs in a programmatic way.
  • Vdev and zpool user properties (#11680) - Lets you set custom user properties on vdevs and zpools, similar to the existing zfs dataset user properties.

Performance

  • Fully adaptive ARC (#14359) - A unified ARC which relaxes the artificial limits imposed by both the MRU/MFU distribution and data/metadata distinction. This allows the ARC to better adjust to highly dynamic workloads and minimizes the need for manual workload-dependent tuning.
  • SHA2 checksums (#13741) - Optimized SHA2 checksum implementation to use hardware acceleration when available.
  • Edon-R checksums (#13618) - Reworked the Edon-R variants and optimized the code to make several minor speed ups.
  • ZSTD early abort (#13244) - When using the zstd compression algorithm, data that can not be compressed is detected quickly, avoiding wasted work.
  • Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - Extensive I/O prefetching analysis and optimization.
  • General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) - Numerous performance improvements throughout.

Additional Information

Offtopic: It took me way too long to figure out how to align the expand and collapse formatting to the prior bullet points.

Click here for my notes
# Header
* Bullet1 - These have **no** spaces in front of the *
* Bullet2
* Bullet3
  <details open>
  <summary>Dropdown1 which loads expanded</summary>

  * Item1 - This first item must have a blank line above it, and **two** spaces before the *
  * Item2
  * Item3
  * Item4
  * Item5
  </details>
  <details>
  <summary>Dropdown2 which loads collapsed</summary>

  * Item1
  * Item2
  * Item3
  * Item4
  * Item5
  </details>

This results in:

Header

  • Bullet1 - These have no spaces in front of the *

  • Bullet2

  • Bullet3

    Dropdown1 which loads expanded
    • Item1 - This first item must have a blank line above it, and two spaces before the *
    • Item2
    • Item3
    • Item4
    • Item5
    Dropdown2 which loads collapsed
    • Item1
    • Item2
    • Item3
    • Item4
    • Item5
3 Likes

Is this a revamp? I remember this feature for shortly after some 2.0 release. And Iā€™m pretty sure I used the -c flag in the past. Or is this something new?

Scrubbing 200T+ isnā€™t trivial. Iā€™m sure many pools will benefit from this. Itā€™s situational. But if you face the situation, you can save yourself days.

Iā€™m not sure what difference and possibilities it will bring. To optimize and tweak heterogeneous vdev arrangement we often see in homelab scenarios? Not sure.

This is really gamebreaking. And Iā€™ve seen the devs talking about this at the OpenZFS Leadership meetings. Gone are the days where your ARC evicts metadata seemingly at random and in a large manner. You could get this under control with tuning parameters, but having this adaptability out of the box is great for us homelabbers and basically removes the need for most special metadata vdevs. Because ARC will take the metadata just as serious as data now, with counters and weight in the algorithm.

You want cached metadata? You can rely on your ARC to do this out of the box. Amazing work from Alexander Motkin and others.

The early abort feature of LZ4 was the holy grail in the last decade. ZSTD is great, but wasted so much CPU on barely compressible stuff.
Iā€™m probably changing all my remaining LZ4 datasets to ZSTD. I only had LZ4 because of the early abort on barely compressible data.
Great news for the average compressratio for basically any pool out there.

1 Like

zfs send has always had a -c for compression, zfs receive didnā€™t?

1 Like

Yeah I probably confused recv with send. The recv -c stands for corrective instead of compression.

Hereā€™s some interesting and amusing info on how the early abort feature works: edit: I can't tell if you edited this since I read it, or I just skipped the las... | Hacker News

Posted by rincebrain

To add something amusing but constructive - it turns out that one pass of Brotli is even better than using LZ4+zstd-1 for predicting this, but integrating a compression algorithm just as an early pass filter seemed like overkill. Heh.

Itā€™s actually even funnier than that.

It tries LZ4, and if that fails, then it tries zstd-1, and if that fails, then it doesnā€™t try your higher level. If either of the first two succeed, it just jumps to the thing you actually requested.

Because it turns out, just using LZ4 is insanely faster than using zstd-1 is insanely faster than any of the higher levels, and you burn a lot of CPU time on incompressible data if youā€™re wrong; using zstd-1 as a second pass helps you avoid the false positive rate from LZ4 having different compression characteristics sometimes.

Source: I wrote the feature.

Iā€™m not that much of a maths nerd to dig deeper into those algorithms, but I remember seeing LZ4 abort mechanic as a helper algorithm to boost ZSTD performance. If this gets the job done, so be itā€¦LZ4 speed is insane.

Canā€™t wait for QAT (which is also mentioned in the linked thread) to support LZ4 and ZSTD. Or other standardized hardware offloading, be it CPU,GPU or whatever. If a freaking 6-core Ryzen (or attached iGPU) could offload ZSTD compression like QAT can, thatā€™s your perfect storage server CPU right there. compressing 100Gbit/sec without the need to boost any of the cores.

Those lazy codersā€¦writing code and developing new features just to avoid copy&pasting some URL from your browser into some forums :wink:

ZFS Interface for Accelerators (ZIA) is an upcoming feature so we get hardware offloading on a lot of stuff. So if you want QAT or DPUs to turbo-up your CPU-intensive ZFS workload, thatā€™s it.

Compression, Checksumming, RAIDZ are the most prominent benefactors.

Developed for OpenZFS by Los Alamos Laboratoriesā€¦they got 3.2TB/s of Infiniband and 100PB storage according to the slides. So yeah, you need some offloading if you donā€™t want to buy and run thousands of CPU cores just for that. With 100PB, every 0.01 compressratio is equal to a diskshelf worth of drives.

1 Like

I thought zRAID expansion made it into this update, am I wrong?

Itā€™s only in testing, not in the master repository, as far as Iā€™m aware. Getting closer, but not quite there yet.

Even when it does drop, youā€™ll want to wait a few more months yet to wait for others to painfully run into edge cases.

1 Like

Yeahā€¦looks good. Itā€™s been quite the odyssey ever since it was sponsored by the FreeBSD Foundation. But you donā€™t want to roll out these kinds of things if you are not 100% confident. Weā€™re talking about ZFS after all.

Also keep in mind that RAIDZ expansion leaves a permanent footprint in memory (uses RAM). Gets less over time as you modify and delete stuffā€¦a small price if you consider the next level wizardry necessary to make this even possible while still keeping all of RAIDZ features like fixing RAID5 write hole.

I personally donā€™t have a use case for this. I donā€™t think making a wide RAIDZ vdev even wider is a good choice because you eventually end up with a single 20-wide vdev instead of 3x 7-wide. Adding a vdev gets more expensive the more you use RAIDZ expansion.

But if talking about pools of 3-7 disks, itā€™s great and thatā€™s probably where people use it.

1 Like

Iā€™m personally of the same inclination, as the feature seems to mostly be demanded by those who want to perform a risky operation, but donā€™t even have backups to do that age old ā€œbackup, check your goddamn backup, destroy, recreate, restore from backupā€ dance. Iā€™m certainly not against the feature, but Iā€™m hoping people donā€™t footgun themselves.

What I actually look forward to regarding the feature is there being one less reason that people can complain about pointlessly. Fortunately for those that want to derail online forum discussions, theyā€™ll always have ā€œlicense issuesā€ as the gift that keeps on giving.

In other FS news, in the upcoming 6.7 kernel

  1. Supposedly BTRFS will be having itā€™s raid5/6 write hole addressed: Btrfs updates for 6.7 with new raid-stripe-tree | Hacker News but I donā€™t know enough about whatā€™s involved to say if thatā€™s really true.

  2. BcacheFS is apparently getting merged as well.

1 Like

Iā€™ve seen some news a couple of weeks/months ago at Phoronix. Looked convincing, but sacrifices performance as a compromise. I run BTRFS on 3 machinesā€¦no need for parity RAID there.

But that ASUS NVMe consumer NAS has ext4 and BTRFS. So we see BTRFS spreading and parity RAID as an option is always good to have. Synology certainly lost their edge with SHR and I bet we see many more BTRFS NAS products in the future. ZFS isnā€™t for everyone and thatā€™s totally fine.

I admire the dedicationā€¦but there is still a lot of work to do in terms of features before I would consider it. Iā€™m a spoiled ZFS kid.

My personal focus is on Ceph right now. And I really want that Crimson-OSD update (soon :tm: ). Itā€™s the Ceph equivalent for ZFS Direct_IO, boosting NVMe capability by quite a bit. CPU cycles on NVMe drives are insanityā€¦2-4 cores per NVMe drive on heavy writes. The metric that is used is IOPS/Core and that tells you a lot where the bottlenecks are.

1 Like

RAIDZ Expansion seems to be indeed coming VERY soon:

fun fact: You canā€™t expand a 255-wide RAIDZ because 255 width is hardcoded as a maximum. So Expansion is not infinite :wink:

2 Likes