OpenZFS 2.2.0 Silent Data Corruption Bug

It’s a rare and unusual. You don’t just have some creeping bug causing troubles and no one cares about. I’m not too worried about my Proxmox 7.4-3 and my TrueNAS Core with 2.13.x. But I always have a backup anyway. Backup pool is still an older 2.0.0 or sth like that. Scrub runs fine, all well.

But the attention the problem now has is certainly beneficial in fixing it (and other things in the process). But I doubt every pool on the planet is a ticking bomb.

1 Like

My understanding with this issue is that this silent data corruption will NOT be caught with scrub. I think that this is why people are so concerned about it.

(Because I would also think that if scrub caught it, then it probably wouldn’t be as big of a deal as it is now. But from what I’ve been able to read about this – scrub does NOT catch this.)

Well…that’s the thing – I don’t know.

I mean – that’s the killer about silent data corruption, right? You don’t know that your data has been corrupted silently until you need said data, and find out that it’s been corrupted where you had thought and/or assumed that it was not corrupted all along.

I think that’s where at least some of the major risk lies (is inherent in the silent nature of this data corruption).

Following on the remark that @Bronek made about async writes being a condition for this to occur, ZFS pools under Proxmox, has, by default, set the sync property = standard which from what I’ve been able to read about the ZFS sync property, means that it’s at least A form of sync writes.

What isn’t clear to me is whether this issue also affects ZFS pools that has this sync=standard property, because if it does, then the risk to all ZFS pools worldwide can, potentially, be much greater than we might originally think right now, if this bug/issue exists even without async writes and exists even without the block cloning feature.

This is a part of the reason why I want to run the Solaris tests.

I haven’t seen anything about “silent”. We have checksums for that and this is an easy operation by comparing checksums. If the blocks are wrong, ZFS will notice. And that’s why we know this exists in the first place.

And we know block cloning triggers a potential corruption issue. Not that it isn’t detectable. If it wasn’t, we wouldn’t know about it. The cause is unknown as of today. But we know what the consequences are, we even had people on the forums affected by it. ZFS knows if something is wrong, they didn’t dig into internals or were reading corrupt data. ZFS told them there is a problem.

That’s the default for every ZFS. And async writes are the normal kind of writes. Most stuff is async. There have been Zettabytes written in async this year…it’s fine.

You’re exaggerating. It won’t end the world, there wont be an apocalpyse. It will get fixed and we go on. And if you’re not on 2.2.0, you’re probably fine and get a backport soon. And we have backups with snapshots, so we’re prepared for disaster recovery…just in case.

Just keep calm.

1 Like

In the original issues report on Github, it states:

" ZFS does not see any errors with the pool.

$ zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0

errors: No known data errors

$ zpool status -t
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0  (100% trimmed, completed at Tue 31 Oct 2023 11:15:47 PM GMT)

errors: No known data errors

(Source: some copied files are corrupted (chunks replaced by zeros) · Issue #15526 · openzfs/zfs · GitHub)

(If you do a CTRL+F for “silent”, it shows up.)

I mean – the title of this therad is “OpenZFS 2.2.0 Silent Data Corruption Bug”.

(also cf. here and here)

That’s not what the title of the Github issue that’s been reported states.

(also, skimming through some of the other comments, other users have reported known, actual data corruption issues as well, so it isn’t just a potential. It’s happening, out in the wild/field.)

That’s not what this states:

sync=standard
  This is the default option. Synchronous file system transactions
  (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
  and then secondly all devices written are flushed to ensure
  the data is stable (not cached by device controllers).

According to that, this is like a hybrid between full sync and full async writes because of the existence of the ZFS intent log (whereby, in the event of a sudden loss of power, when the system is powered back on, it will detect that there is a difference between the data’s that’s written out to the devices vs. the ZIL, so it will playback the ZIL to make sure that all of the data is flushed to the devices.

What isn’t as explicitly clear is whether this flush to devices will wait for confirmation from the device, back to ZFS, that the flush is 100% completed and successful before moving on.

I invite you to read the comments from the github issues thread.

…in the meantime… this is still an open problem, and as it has also been noted in the comments on the github issue thread – they don’t really fully understand HOW this happens, but just that it does, and it is exacerbated by the block cloning feature of OpenZFS 2.2.0.

Only if you aren’t backing up and snapshotting corrupted data and/or that you have backups and snapshots PRIOR to said corruption, which may not necessarily be universally true.

Think about it – if the answer to all storage/data corruption problem is “restore from backup”, then NONE of the storage/data corruption problems would be problems if that’s the answer to all storage/data corruption problems.

You shouldn’t have to be constantly restoring from backup and/or reverting back to a snapshot due to persistent (silent) data corruption.

I can’t tell you how many people use ZFS AS their backup solution.

When it comes to potential data corruption issues – I would MUCH rather over call the severity of the issue rather than undercalling said severity of said issue (because in the case of the latter, then you are at a greater risk of ending up with corrupted data than you are with the former).

I can’t see how that trying to make sure that the data isn’t corrupt and safe, is a bad thing.

1 Like

Sync=standard just means ZFS does what the application wants. If its async, then ZFS does async. If application wants a sync, ZFS will sync. sync=disabled or sync=always just makes ZFS only use async or only sync no matter what the application calls for.

If you are afraid of async writes (you shouldn’t), just set sync=always. And have fun with it :joy: And no, I don’t get money from Optane salesmen :confused:

2.2.0, correct. We know this. Don’t use 2.2.0 or block cloning. I’ll check back for Christmas '24, certainly not earlier…I’m not an advocate for rolling-release storage :slight_smile:

I like the article on TheRegister. I read them not that much these days, but I like the whole writing…it’s very objective and journalistic. Thanks for the link!

That’s not true.

sync=standard means that it is a synchronous write to the ZFS intent log (so if the ZIL reports back that the write has been committed to the ZIL, then ZFS thinks it’s good to go to move on to the next operation. It doesn’t wait for the ZIL flush before moving on (which is what sync=always does).

Also, cf.:

I made a new fs with sync=always and mounted it as the Portage TMPDIR, and while it improved the situation (likely by changing the timing), I was still able to reproduce this issue.

/var/tmp/portage/dev-lang/go-1.21.4/image # file usr/lib/go/pkg/tool/linux_amd64/vet
usr/lib/go/pkg/tool/linux_amd64/vet: data

/var/tmp/portage/dev-lang/go-1.21.4/image # hexdump -C usr/lib/go/pkg/tool/linux_amd64/vet
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 69 74 6a 6e  |............itjn|
00000fb0  6f 69 72 46 39 78 39 66  6b 4c 73 59 4d 77 47 50  |oirF9x9fkLsYMwGP|
00000fc0  2f 59 69 50 4c 58 72 47  65 55 53 57 56 65 4c 76  |/YiPLXrGeUSWVeLv|
00000fd0  4d 4c 6f 74 36 2f 44 53  30 46 37 55 4d 66 5f 6c  |MLot6/DS0F7UMf_l|
00000fe0  45 66 72 32 68 6e 6b 62  71 32 2f 47 57 4e 70 78  |Efr2hnkbq2/GWNpx|
00000ff0  5f 68 2d 43 78 38 69 6d  63 34 76 6b 37 74 62 00  |_h-Cx8imc4vk7tb.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003954a0  69 74 6a 6e 6f 69 72 46  39 78 39 66 6b 4c 73 59  |itjnoirF9x9fkLsY|
003954b0  4d 77 47 50 2f 59 69 50  4c 58 72 47 65 55 53 57  |MwGP/YiPLXrGeUSW|
003954c0  56 65 4c 76 4d 4c 6f 74  36 2f 44 53 30 46 37 55  |VeLvMLot6/DS0F7U|
003954d0  4d 66 5f 6c 45 66 72 32  68 6e 6b 62 71 32 2f 47  |Mf_lEfr2hnkbq2/G|
003954e0  57 4e 70 78 5f 68 2d 43  78 38 69 6d 63 34 76 6b  |WNpx_h-Cx8imc4vk|
003954f0  37 74 62 00 00 00 00 00  00 00 00 00 00 00 00 00  |7tb.............|
00395500  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00595460  00 00 00 00 00 00 00                              |.......|
00595467

Edit to add:

/var/tmp/portage/dev-lang/go-1.21.4/image # findmnt /var/tmp/portage/
TARGET           SOURCE              FSTYPE OPTIONS
/var/tmp/portage zroot/local/portage zfs    rw,relatime,xattr,posixacl,casesensitive

/var/tmp/portage/dev-lang/go-1.21.4/image # zfs get sync zroot/local/portage
NAME                 PROPERTY  VALUE     SOURCE
zroot/local/portage  sync      always    local

(via https://github.com/openzfs/zfs/issues/15526#issuecomment-1815149384)

Again, if you skim through the comments on the Github issues page – it’s been stated that they don’t think that this issue is ONLY limited to OpenZFS 2.2.0 with block cloning.

I used to read them more, but now, I find that I don’t have as much time as I used to to be able to read articles about stuff in general.

I like their style of writing, and how they also present the technical information objectively.

You’re welcome.

As TheReg notes, “This issue is not related to block cloning, although it is possible that enabling block cloning increases the probability of encountering the issue.”

Thanks.

It is.

https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#sync

It complies with sync requests. As I mentioned two postings above. And otherwise uses default behaviour (async).
Lying about it and essentially denying the requests is what sync=disabled is.

async doesn’t deal with the ZIL as it isn’t necessary for TXGs. It only comes into play for sync because you need an ad-hoc stable storage. Which is a part of the disks (default) or a written to SLOG (because random writes are fast on NVMe).

And I can confirm my clients not demanding sync calls don’t result into screaming HDD noises. It’s handled in memory. The main reason why async is so great on ZFS.

So sync isn’t a solution. I don’t think I understand the underlying issue, but many administrators will be happy to not have to go sync=always to mitigate it.

Looks like there is an upcoming fix according to the Proxmox forum link. We’ll see about that. I remain cautious and calm. Business as usual and not upgrading to 2.2.x :wink:

1 Like

really hope there is a fix for this and that applies to existing pools. The mitigations aren’t clear, or don’t seem to be comprehensive. That being said, haven’t detected any problems with the zeroing of files. Just migrated everything over from btrfs, could go back but really thought zfs was an improvement in features and speed.

For those that use the proxmox opt-in kernel, the recent release includes backported fixes from ZFS 2.2.1.

https://old.reddit.com/r/Proxmox/comments/1877yaz/bugfix_now_available_for_dataloss_bug_in_zfs/

You will likely need to reboot to reload the module.

3 Likes

I like how the Proxmox team is grabbing code off github and backports stuff. They’re a small company in comparison to e.g. RedHat or SuSE. And the 2.2.0 was opt-in, not like in TrueNAS Scale where everyone got it.

I’m still on Proxmox 7.4 enterprise repo. “Never change a running system”. But this entire debacle will certainly be solved once I get my new Server and I might opt for Proxmox 8. I think Ceph 18.2 (Reef) is a bit too fresh for my liking too. I’m confident that Proxmox will release this for enterprise repo somewhere in Q2 or Q3 '24.

It will fix the bug but it won’t fix the damage that has been done. If you write the wrong stuff with wrong checksums in the first place, you just can’t repair it. Old snapshots should work from my understanding. ZFS not messing with already written stuff can be handy.

I’m not sure how the bugfix handles the block cloning feature flag. Because you can’t just disable it once the pool worked and ran with that feature.

I use both, I like both. I found ZFS to be more easy to work with despite having more features. You’ll have a good time. And with that bug…yeah, you migrated with bad timing. Don’t get used to it, won’t get any worse. ZFS has a good reputation for a reason. But everything has bugs, better have it public and fix it than closed source and “our product is safe, trust me!”.

ZFS users losing the uptime counter internet battle? A bug is bad, but resetting your uptime is worse :wink: (I reboot every 2 weeks, don’t tell anyone)

I don’t think it touches this feature, the fix is just adding more dirty dnode contingencies from my understanding.

​​​ ​ ​

​​​ ​ ​
​​​ ​ ​

​​​ ​ ​
​​​ ​ ​

​​​ ​ ​

This script is nested in the reddit link, but I thought I’d post it directly here because it is going to be very useful (credit to bsdice); it should allow people to scan for any corruption from this specific bug that might have accumulated on their pools:

#!/bin/sh

B2SUM_4K_ZEROS="f18c1ba69b093034faa8ef722382f086547fdd178186ae216b5ae2434a318691032f3e7f55e8acc841d07e7519e33c0f850afba006ec4ef
9e2fc6cd91c7d5dfb"

if [ $# -eq 0 ]; then
    echo "Syntax: $0 <directory>" 2>&1
    exit 1
fi

while IFS= read -r -d 
```\0' file; do
    SUM=$(dd if="$file" bs=4096 count=1 status=none | b2sum -b | cut -f1 -d' ')
    if [ "$SUM" = "$B2SUM_4K_ZEROS" ]; then
        echo "File starts with 4K worth of zeros: $file"
    fi
done < <(find "$1" -type f -print0)

Users of this should note: There will be false positives, and it cannot detect cases where garbage data (not straight zeros) overwrote something.

wait, how does that happen? I thought the way the bug was manifesting was that sparsity was inserted into files because of the dirty dnode fall through race condition.

It’s more like there are legitimate cases where files may have contiguous zeros which would cause false positives.

Also, there is a gist on Github that explains how the bug was introduced. Block cloning was just a red herring that just happened to trigger the bug more frequently because of GNU coreutils 9.0+ changing a default flag causing more frequent use of SEEK_HOLE/DATA operations when block cloning is enabled, for which dirt dnode check was flawed. The original commit in 2013 for SEEK_HOLE/DATA had the flawed check, it’s just they were very rarely used operations until recently due to new features.

1 Like

So if I am interpreting this for the average zfs user and not the file system nerds. This issue is actually kind of difficult to trigger in an average use case particularly before 2.20

Lots of places point to how its not cause for panic

Even on 2.2.0, it requires high throughput parallel operations on primarily slow disks to reliably trigger.

Okay i think its a really good idea to preface this at the top of the original post.

The panic headlines arent helping right now and we have a lot of relatively new to zfs users on the forum

Referencing the bold. A lot of people need not hurry based on what you just said

They shouldnt distro hop to get the latest version. They shouldn’t compile it. they should just patiently wait for the fixes and if they are extra paranoid they can take it easy on their pool for a bit

1 Like

If you read the OpenZFS documentation for ZIL, it also states:

“The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions.”

Therefore; if you put the first statement from the OpenZFS documentation together with the first statement from the same OpenZFS documentation in regards to ZIL – you put two and two together – and that is what you get.

The Oracle ZFS documentation actually combines the two in a more succinct manner:

" |sync|String|standard|Determines the synchronous behavior of a file system’s transactions. Possible values are:

  • standard, the default value, which means synchronous file system transactions, such as fsync, O_DSYNC, O_SYNC, and so on, are written to the intent log.
  • always, ensures that every file system transaction is written and flushed to stable storage by a returning system call. This value has a significant performance penalty.
  • disabled, means that synchronous requests are disabled. File system transactions are only committed to stable storage on the next transaction group commit, which might be after many seconds. This value gives the best performance, with no risk of corrupting the pool.

Caution - This disabled value is very dangerous because ZFS is ignoring the synchronous transaction demands of applications, such as databases or NFS operations. Setting this value on the currently active root or /var file system might result in unexpected behavior, application data loss, or increased vulnerability to replay attacks. You should only use this value if you fully understand all the associated risks.|
| — | — | — |"

(Source: Introducing ZFS Properties - Managing ZFS File Systems in Oracle® Solaris 11.3)

Well…it’s only HALF “lying” about it.

ZIL satifies POSIX requirements for synchronous transactions.

Therefore; if sync=standard and it writes it out to ZIL, by how the OpenZFS has documented this actual mode of operation, this is what helps their first statement about what sync=standard satisifying the POSIX requirements for synchronous transactions = true.

Except for the fact that ZIL isn’t the actual, underlying storage device.

So, by introducing the ZIL, the “standard” “sync” write in ZFS is actually a hybrid in that to the device, it’s asynchronous write, but to the ZIL, it’s synchronous (and that satisifies the POSIX requirement for said synchronous transaction).

The OpenZFS documentation doesn’t connect the two dots for readers automatically, but the Oracle ZFS documentation does.

(And I would imagine that it’s probably been like that ever since ZFS was invented at Sun Microsystems.)

(And also IIRC, this was introduced to help speed up the writes without it being “fully” asynchronous because back then, UltraWide SCSI disks still sucked, especially for random writes. So, if you can collect a bunch of them in ZIL, and then flush the ZIL to disk, then you can help make that process a bit faster.)

I think what’s confusing to people (and even sometimes for myself) is that the separate LOG device is the SECOND (or dedicated) ZFS intent log device. (call it “ZIL2” if you want.) Because that’s what it really, actually is.

“By default, the ZIL is allocated from blocks within the main pool. However, you can obtain better performance by using separate intent log devices such as NVRAM or a dedicated disk.”

(Source: Creating a ZFS Storage Pool With Log Devices - Managing ZFS File Systems in Oracle® Solaris 11.3)

Both the Oracle ZFS documentation and the OpenZFS documentation say this:

“By default, the ZIL is allocated from blocks within the main pool”

(Sources: Creating a ZFS Storage Pool With Log Devices - Managing ZFS File Systems in Oracle® Solaris 11.3, zpoolconcepts.7 — OpenZFS documentation)

So, based on that – the ZFS intent log, by default – takes up space on the devices/virtual devices that makes up the pool.

Therefore; the only thing that the separately (ZFS intent) log device does is that it “sucks up all of the ZFS intent log from the devices in the pool” and moves it onto a separate (ZFS intent) log device. That’s all that it does.

As far as the POSIX compliance for synchronous transfers – given that the ZIL satisfies said POSIX requirement for synchronous transfers – therefore;

|sync|String|standard|Determines the synchronous behavior of a file system's transactions. Possible values are:

* standard, the default value, which means synchronous file system transactions, such as fsync, O_DSYNC, O_SYNC, and so on, are written to the intent log.|
| --- | --- | --- |

To me, at least, it looks like that OpenZFS docs “partially” mirrored (copied?) the Sun ZFS docs, and then revamped it a little bit to make it “their own”, but indoingso, they might have edited out some rather important details about how it works.

This is why, I will often refer to either the Sun/Solaris ZFS Administration Guides for the documentation (given that it started off as an OpenSolaris project, and then got rolled into “mainline” Solaris a little bit later). (I remember looking at the presentations from Jeff Bonwick about ZFS when it was introduced into “mainline” Solaris 10.)

And I also vaguely remember him and two other guys talks about why they invented the ZFS intent log (UltraWide and/or Ultra160 SCSI drives in the Sun Ultra 20 and Sun Ultra 45 workstations back then, were still very slow), so the idea was rather than writing to the drive frequently, especially for random writes – you “collect” the data into the ZFS intent log, and then flush the ZIL to disk periodically, which reduced the disk thrashing that comes from a lot of tiny writes to the disk.

Therefore; if you write it to the ZFS intent log, and that is deemed to meet the POSIX requirements for synchronous transfers, then sync=standard would be synchronous to the ZFS intent log, but asynchronous to the actual (storage) device.

(Again, I refer to the original Solaris ZFS Administration Guide for details since this was how the inventors of ZFS made it.)

Are you using HDDs or SSDs on your clients?

I would think that if you’re using SSDs – it probably isn’t making a bunch of noise (maybe other than some SSDs having above average coil whine) – but if you’re using HDDs – I would think that modern HDDs are built very differently than the UltraWide SCSI drives back when ZFS was invented.

And systems now also typically have more RAM than they did back then as well, which means that between that and how ZFS uses RAM for cache – so the TXG is able to collect more data in the buffer before flushing to disk.

That was certainly NOT the case on Sun Ultra 45 workstations back then.

I agree.

However, again, from reading the comments on the github issue that has been raised – the high level synopsis that I’ve been able to infer via skimming the comments, is that they’re not really 100% sure how it happened and/or how long this issue has existed. The way I think about is that there is a possibility that even in the original Sun ZFS code, that this was present, but the probability may have been like 1 in 1e12, but the block cloning thing made this issue vastly more apparent, but as it has also been noted, that this isn’t strictly a block cloning issue – just that the block cloning feature brought this bug to the forefront of the consciousness.

From what I’ve read so far on github, etc. – the “fix” is that they’re going to disable this feature until the devs understand more about it.

I think that there’s a LOT of speculation about the root cause of the issue, but without the data and the evidence to be able to confirm or deny those theories – they’re theories at this point. (like brainstorming, and then you test each hypothesis individually to see which hypothesis is support by data vs those that aren’t).

But like I said – this is why I am interested in testing this with Sun/Solaris ZFS to see whether it was prevalent in the “original” ZFS code vs. whether this was brought about (or exacerbated) by changes that were made to the OpenZFS code/project. (cf. terse 15526.md · GitHub)

At this point, I don’t think that anybody knows for sure yet.

I’m trying to test it on Solaris. Their test script needs quite a lot of modifications made to it because some of the command calls aren’t available in Solaris, and with me NOT being a programmer/developer – I REALLY don’t know what I am doing when I am trying to change the script to be able to get it to work in Solaris.

I think that IF this issue ONLY manifests with sync=standard or sync=disabled, then I would surmise that sync=always CAN potentially be a solution to this, IF async writes (or standard) is a condition/requirement for this silent data corruption to occur.

But I agree - it’s not a great solution. But until the devs can deep dive this further, and figure out what is the root cause, and then implement a permanent corrective action for it, then there may be an argument that can be made that slow data is better than corrupted data.

So I don’t know.

Right now, I’m just focused on Solaris testing.

Thanks.

(Sorry it took so long for me to reply – I started writing my response early this morning, but then work, then life, got in the way, so I’m finally JUST getting around to finally sending my reply. Thanks.)

1 Like

most definitely, but the script should only be looking for files with 4K of zeros at the beginning which seemed to not occur organically very often, so false positives shouldn’t be as prevent as just looking for holes in files randomly.

I might sound a little sadistic but I think we need more attention, headlines and agitation on this. This issue could have been quashed back in 2021 when evidence of it was first reported but it didn’t get enough people digging into the problem.

1 Like

My comment is based of the summary here by Rincebrain: terse 15526.md · GitHub

Basically, the bug when triggered by the block cloning code seems to mostly express as zeros where they shouldn’t be, but that’s not the only way for the bug to express.