OpenZFS 2.2.0 Silent Data Corruption Bug

If you’re running OpenZFS 2.2.0 upgrade to 2.2.1 2.2.2 immediately.

This was mentioned in another topic but I wanted to repost here as a main line post because it might bite some folks bad.

OpenZFS 2.2.0 appears to have a bad data corruption bug with the new block cloning feature that replaces data with zeroes with no warning, no detection, no indication at all of anything wrong until you try to read your data and realize it’s bad. Scrubs do not fix it.

Issue thread: some copied files are corrupted (chunks replaced by zeros) · Issue #15526 · openzfs/zfs · GitHub

Separate discussion on detecting corruption: How to detect bad block cloning? · Issue #15554 · openzfs/zfs · GitHub

2.2.1 released 4 days ago - this release will turn off the block cloning feature by default to stop further corruption: Release zfs-2.2.1 · openzfs/zfs · GitHub

2.1.x and earlier versions are unaffected.

UPDATED DEC 1

OpenZFS 2.2.2 released late Nov 30th

Summary:

Note : This release contains an important fix for a data corruption bug. Full details are in the issue (#15526) and bug fix (#15571). There’s also a developer’s bug summary that gives a good overview. We recommend everyone either upgrade to 2.2.2 or 2.1.14 to get the fix. The bug can cause data corruption due to an incorrect dirty dnode check. This bug is very hard to hit, and really only came to light due to changes in cp in coreutils 9.x. It’s extremely unlikely that the bug was ever hit on EL7 or EL8, when running cp since they all use coreutils 8.x which performs file copies differently.

5 Likes

Thank god for LTS Kernels and FreeBSD. All my stuff is still running on 2.11 and 2.13. But we already had a thread about the bug…people just like to run the latest and “greatest”.

3 Likes

Does anyone know how early a version of ZFS this corruption bug affects?
It’s now being uncovered that this bug had been around longer than 2.2.0 and still exists after 2.2.0 and that the block cloning feature just made the problem more likely to encounter.

FreeBSD project just sent out an advisory

Dear FreeBSD community,

We want to bring your attention to a potential data corruption issue
affecting multiple versions of OpenZFS. It was initially reported
against OpenZFS 2.2.0 but also affects earlier and later versions.

This issue can be reproduced with targeted effort but it has not been
observed frequently in real-world scenarios. This issue is not related
to block cloning, although it is possible that enabling block cloning
increases the probability of encountering the issue.

It is unclear if the issue is reproducible on the version of ZFS in
FreeBSD 12.4.

https://lists.freebsd.org/archives/freebsd-stable/2023-November/001726.html

1 Like

So after more investigation, this silent data corruption issue with ZFS has been happening in the wild since at least Feb 11, 2021, and was probably introduced into og ZFS back in 2006.

The silent corruption issue became more prevalent with the release of ZFS 2.1.4 back in March of last year, and release 2.2.0 opened the flood gates and it finally got attention after being falsely attributed to the block cloning feature in 2.2.0.

Would be really interesting if someone decided to test for this bug on Solaris.

2 Likes

If there is a script that I can run in Solaris, I’ll test it in Solaris 10 (whatever oldest version of Solaris that I am able to find and install).

(I think that the first time that ZFS was moved from the OpenSolaris project to “mainline” Solaris was in Solaris 10 6/06 (“U2”) release, so if someone has a test script that I can run, I’ll try and see if I can get my hands on a copy of said Solaris 10 install ISO and I’ll run that test script.)

I started using ZFS in Solaris 10, as early sometime between October 2006 and January 2007 because I still have the email thread that dates back to January 2007 where I reported an issue to Sun Microsystems Premium support that I was having with ZFS back then. (The email dates back to Jan 3, 2007.)

2 Likes

here’s the reproducer script to trigger it:

2 Likes

Thank you.

Yeah, I posted the same question onto the Github thread and they’re using a specific version of coreutils (or more specifically, a specific version of the cp command) that I am not sure if that will have the same issue or effect in Solaris 10 6/06 vs. modern day Linuxes.

I’m currently in the process of downloading Solaris 10 6/06 and I am going to try and see if I can at least get a VM up and running where I might be able to test this, but we’ll have to see.

I’ll also be interesting to see whether this issue will manifest if say, I only have one drive (which ZFS would technically treat as a stripped ZFS pool, even though it’s only one drive) or whether I would need a different ZFS configuration to be able to trigger this as well.

…and down this rabbit hole I go…lol…

1 Like

My read on the situation was that a bunch of stuff changing over successive versions just made the issue more and more likely like the zfs_dmu_offset_next_sync default change in 2.1.4 and then the newer version of coreutils most recently.

it seemed like most users could trigger corruption within minutes of running the script but if they changed zfs_dmu_offset_next_sync as a mitigation technique, the corruption wasn’t so fast/likely… but it did still happen— this makes me think it might take along time to invoke the corruption on Solaris 10.

I’m not really a developer but in reading what is going on, it doesn’t sound like they have a definative answer on the root cause, just mitigations and things that are the cause in specific circumstances like the dnode_is_dirty issue.

1 Like

I think that @wendell should make a video about this.

2 Likes

Agreed.

I’m not a programmer/developer neither.

But because I’ve been using ZFS (technically) since at least 2007, maybe even 2006 – it would be interesting to see whether this happens or not.

Yeah, I don’t really know what’s in the code for ZFS in the original “mainlining” of said ZFS into Solaris 10 6/06, vs. what it is now - so hard to tell.

I’ll see whether that reproducer script even runs in Solaris 10 or not. It might not, as some of the commands that might be in there are VERY Linux specific (tools/commands that may not be typically found in a typical Solaris 10 install).

So, we’ll see how that goes.

1 Like

If the bug is old, is OracleZFS immune?

The script enables the blockcloning feature (and other stuff). So you need OpenZFS 2.2.0. There is no blockcloning feature flag in prior versions.

And Solaris 10 shouldn’t be able to run 2.2.0. But you can test this on Illumos, which should have access to 2.2.0

All the stuff is dealing with the block cloning feature flag being enabled (which was added in 2.2.0)

It sounds like versions prior to ~2006 were immune in Sun’s implementation, but still that’s pretty early on in it’s life— that is assuming the “bad” dn_dirty_link code is the source of all the problems which I’m not convinced of yet.
It would be sad/upsetting if this bug was fixed in Oracle’s ZFS after it became closed source and OpenZFS was just left out to dry.

You’re right, that script is relying on 2.2.0 features to quickly produce corruption. I suppose this is why people have been so unsure up until recently about prior versions of ZFS corrupting data, there’s not an easy way to quickly test it.

Here’s a newer reproducer script that should be more universal:

What is interesting is in practice the bug is like a race condition, and that there is a “sweet spot” of load to trigger it often. It would seem that systems with HDDs are affected more than systems with SSDs:

1 Like

I think that depending on which article/media source you read, this can vary.

“However, it now appears, as of 1945 UTC, November 27, that this cloning feature simply exacerbates a previously unknown underlying bug.”

(Source: Data-destroying defect found after OpenZFS 2.2.0 release • The Register)

Furthermore, at the point in time (and in the code) where and when the OpenZFS project was created and split from Sun ZFS, I don’t know enough about coding in general, let alone the state of the code back then, when this split occurred; so it’s difficult to tell how far back this goes.

From the same article from The Register, it looks like that the block cloning feature introduced in OpenZFS 2.2.0 only exacerbated this problem, but this problem might have been lurking in the background all along, where without this block cloning feature, IF the bug was present, the chances of hitting it was significantly lower. The way that I try to understand that is if you take something with a very low probability of occurance, and then make a chance, whereby you increase the probability by say 1e6, then now something that would’ve been statistically unlikely to happen, is now actually happening and noticable. Not being a developer, that’s the most that I can gather, based on people’s comments on the github thread about this issue.

This is a part of the reason why I want to test the “original” ZFS implementation in “mainline” Solaris 10 (6/06 “U2” release) because if it had been there all along, it’s just that now we know how to look for it.

So, again, quoting:

"…as Bronek Kozicki spells out on GitHub:

You need to understand the mechanism that causes corruption. It might have been there for decade and only caused issues in a very specific scenarios, which do not normally happen. Unless you can match your backup mechanism to the conditions described below, you are very unlikely to have been affected by it.

  1. a file is being written to (typically it would be asynchronously – meaning the write is not completed at the time when writing process “thinks” it is)
  2. at the same time when ZFS is still writing the data, the modified part of file is being read from. The same time means “hit a very specific time”, measured in microseconds (that’s millionth of a second), wide window. Admittedly, as a non-developer for ZFS project, I do not know if using HDD as opposed to SSD would extend that time frame.
  3. if it is being read at this very specific moment, the reader will see zeros where the data being written is actually something else
  4. if the reader then stores the incorrectly read zeroes somewhere else, that’s where the data is being corrupted"

I think that the requirement for the ZFS pool to be in async mode is important to note.

I don’t know if that’s the default setting whether it’s for Sun ZFS vs. OpenZFS, but I think that’s an important distinction (probably because if you are using synchronous writes, then this probably won’t be an issue), but it’s just in the effort to try and speed things up by using async writes, this then can lead to this scenario/situation. However, I also think that by using async writes, this risk has ALWAYS been present (with async writes), REGARDLESS of whether you’re using ZFS or not.

I’m aware of this. But I was refering to the script @twin_savage posted. And at this point the only way to replicate the problem was using block cloning. And this just isn’t available for most people.

I learned Unix on Solaris 9 and 10 “back in the days!”. Good times with too expensive hardware, we had x86 later.

There are a LOT of commands/options that aren’t available in Solaris 10 1/13.

log isn’t available, ${BASHPID} isn’t available. zfs version isn’t available.

I’ve tried to comment all of those lines out that calls for that, but I think that it’s also causing other stuff to break as a result of my commenting those lines out.

This is the error message that I am getting right now:

# ./zhammer.sh
./zhammer.sh: line 111: unexpected EOF while looking for matching `"'

And here is a copy of my modified script file:

#!/usr/bin/env bash
set -euo pipefail

default_workdir="."
default_count=1000000
default_blocksize=16k
default_check_every=10000

workdir="$default_workdir"
count="$default_count"
blocksize="$default_blocksize"
check_every="$default_check_every"

if [ $# -ge 1 ]; then
  workdir="$1"
fi

if [ $# -ge 2 ]; then
  count="$2"
fi

if [ $# -ge 3 ]; then
  blocksize="$3"
:q
# cat zhammer.sh
#!/usr/bin/env bash
set -euo pipefail

default_workdir="."
default_count=1000000
default_blocksize=16k
default_check_every=10000

workdir="$default_workdir"
count="$default_count"
blocksize="$default_blocksize"
check_every="$default_check_every"

if [ $# -ge 1 ]; then
  workdir="$1"
fi

if [ $# -ge 2 ]; then
  count="$2"
fi

if [ $# -ge 3 ]; then
  blocksize="$3"
fi

if [ $# -ge 4 ]; then
  check_every="$4"
fi

# log() {
#   echo "[zhammer::${BASHPID}] $1" >&2
# }

print_zfs_version_info() {
  local uname
  local version
  local kmod
  uname="$(uname -a)"
#  version="$(zfs version | grep -v kmod)"
#  kmod="$(zfs version | grep kmod)"

  local module
  local srcversion
  local sha256
#  module="$(modinfo -n zfs || true)"
#  srcversion="$(modinfo -F srcversion zfs || true)"

  if [ -f "$module" ]; then
    local cmd=cat
    if [[ "$module" = *.xz ]]; then
      cmd=xzcat
    fi

    sha256="$("$cmd" "$module" | sha256sum | awk '{print $1}')"
  fi

#   log "Uname: $uname"
#   log "ZFS userspace: $version"
#   log "ZFS kernel: $kmod"
#   log "Module: $module"
#   log "Srcversion: $srcversion"
#   log "SHA256: $sha256"
}

if [ ! -d "$workdir" ] || [ ! "$count" -gt 0 ] || [ -z "$blocksize" ] || [ ! "$check_every" -gt 0 ]; then
  log "Usage: $0 <workdir=$default_workdir> <count=$default_count> <blocksize=$default_blocksize> <check_every=$default_check_every>"
  exit 1
fi

# log "zhammer starting!"

print_zfs_version_info

# log "Work dir: $workdir"
# log "Count: $count files"
# log "Block size: $blocksize"
# log "Check every: $check_every files"

# Create a file filled with 0xff.
cd "$workdir"
# prefix="zhammer_${BASHPID}_"
# dd if=/dev/zero bs="$blocksize" count=1 status=none | LC_ALL=C tr "\000" "\377" > "${prefix}0"
dd if=/dev/zero bs="$blocksize" count=1 | LC_ALL=C tr "\000" "\377" > zhammer_0"

cleanup() {
  rm -f "$prefix"* || true
}

trap cleanup EXIT

for (( n=0; n<=count; n+=check_every )); do
  log "writing $check_every files at iteration $n"
  h=0
  for (( i=1; i<=check_every; i+=2 )); do
    j=$((i+1))
    cp --reflink=never --sparse=always "${prefix}$h" "${prefix}$i" || true
    cp --reflink=never --sparse=always "${prefix}$i" "${prefix}$j" || true
    h=$((h+1))
  done

  log "checking $check_every files at iteration $n"
  for (( i=1; i<=check_every; i++ )); do
    old="${prefix}0"
    copy="${prefix}$i"
    if [ -f "$old" ] && [ -f "$copy" ] && ! cmp -s "$old" "$copy"; then
      log "$old differed from $copy!"
      hexdump -C "$old" > "$old.hex"
      hexdump -C "$copy" > "$copy.hex"

      log "Hexdump diff follows"
      diff -u "$old.hex" "$copy.hex" >&2 || true

      print_zfs_version_info
      exit 1
    fi
  done
done

Yeah, so (again – I’m not a developer, so this is pushing me quite far beyond my limits), but in skimming through the reproducer.sh script file that that author originally used, I would agree with you that it was originally calling for zfs_bclone…

But being that I only learned about this via the article from The Reg, and where it mentioned that it might have been a problem that might have always existed/lurking in the background, this is why I was looking to test a) if it existed in Sun ZFS and b) how bad is it?

I understand that people can replicate the block cloning issue with the script quite easily and quite readily. But now, I am testing to see if this problem exists without said block cloning script.

(more along the lines of a root cause analysis)

Yeah – my dad used to work at one of the banks in Canada, in the server room, so we were running either Solaris 2.6 or Solaris 7 on x86 hardware as early as 1997, I think, IIRC.

I remember running Solaris 8 (originally) on a Pentium MMX 166 because we didn’t have a lot of money, and dad wanted to keep up with his Solaris skills, so he brought the Solaris Certified System Administrator (SCSA) exam books home for him (and I) to work out of, and so I learned a bit of that, just by doing it.

And then, I ended up using and deploying a Solaris web server (which was also my Samba server at the time) and I remember needing the packages from SunFreeware (which was actually free back then) to be able to bring Samba and Apache online because the packages that shipped with the OS didn’t work out of the box, and I also remember that by then, I was teaching my dad how to do this stuff.

Right now, I have Solaris 10 1/13 running on my AMD Ryzen 9 5900HX mini PC via Oracle Virtualbox (and I am using Common Desktop Environment (CDE) whatever version is it now). But yeah, it’s DEFINITELY a throwback for me as well.

So, I’ve pretty much almost always learned it on x86 hardware first and it wasn’t until I got to university (ca. ~2007) where I bought a used Sun Ultra 45 workstation where I got to experience Solaris on SPARC. Good times.

It’s taken me a few seconds for me to remember some of the more basic commands because now, I use Linux wayyyy more than I use Solaris, so I have to “flip my brain” around a little bit, to remember what some of those Solaris commands are and what commands (from Linux) aren’t available, in Solaris.

But yeah, as far as testing that more generic script goes – I am likely going to need some help with that.

1 Like

Check out Illumos. It’s the open-source continuation of OpenSolaris. And it has a modern OpenZFS. All the OpenZFS devs are either FreeBSD or Illumos guys.

1 Like

Thank you.

Yeah, I saw that.

But what I am trying to test is whether this bug has been lurking in the background all along vs. the OpenZFS project changed something when the OpenZFS project was created and split from Sun ZFS.

Or, to put/ask this question in a slightly different way – how safe is my data (where it is currently all residing in Proxmox 7.4-3) or do I need to run an “emergency backup” of all of my data (to tape), deploy Solaris, and then have ZFS be managed under Solaris instead?

That’s the risk assessment that’s really going on through my head right now, that’s driving this Solaris testing.