OpenZFS 2.2.0 Silent Data Corruption Bug

My pool is HDDs (Toshiba Enterprise SATA) and they are very noisy on drive access, it’s very old-school :slight_smile: But also tells me immediately if the drives are doing random I/O. Let’s call it a “diagnostic feature”. I had SCSI drives on my 2490-UW controller that were way more subtle in drive noise.

And I’m running with sync=disabled pool-wide. I like my RAM and know what I’m doing (mostly). And I know my drives don’t do random IO with that setting. It’s all being aggregated in ARC. Well random reads that aren’t cache hits obviously make the drives work. But we were talking about writes.

Limit on maximum amount of dirty data in the ARC is a percentage of arc_size. So it depends on how much memory you got. But you can change the percentage if you like. But defaults are good and balanced to not keep too much stuff in memory for too long.

That would be quite the revelation. and 1 in 1e12 isn’t likely, but certainly doesn’t match today’s standards for integrity. Even modern disks advertise 1e16 or 1e18.

Which is tragic but also an opportunity to get this straight. We need to be sure ZFS is writing the same stuff we entrusted ZFS with. Without this being always the case, you can’t rely on other ZFS mechanics anymore that ensure integrity. Writing corrupt data in the first place is just verified/authenticated corruption for the filesystem.

No stress…as 2-page reply takes time. And forums are designed to be read over time and not real-time communication. We got a good discussion here

My comment on this might have been slightly misleading, apparently block cloning is going to be disabled by default in 2.2.2; there is at least a worry that something else is going on when it is enabled that could cause problems outside of the original silent data corruption bug.

1 Like

I can see the for newly created pools…but for existing pools, I can’t see a possibility to disable it after creation. It just doesn’t work. You can’t disable compression either, because you got compressed stuff on the pool and disabling it just makes everything unreadable.

Existing 2.2.x pools certainly don’t have a good time. But I’m glad when 2.2.2 hits, it’s disabled and I can “safely” upgrade from pre-2.2.0 to 2.2.2. I basically get the FreeBSD defaults :wink:

1 Like

Atleast the 2.2.x pools will get the bug fix for the dirty dnode contingency that was the cause of atleast the vast majority of the corruption issues.

1 Like

That’s good. At least you’re somewhat safe. And future updates certainly will include further changes. And upgrading from 2.2.0 won’t make things worse…that’s the good part :slight_smile:

State of Emergency more or less…happens to the best of us. Didn’t expect this for ZFS after 20 years of existence.

“There is no such thing as bad publicity” Everyone needs their 15min of fame apparently.

https://cgit.freebsd.org/src/commit/?h=releng/14.0&id=305be1f36b3e7ec1e4ca3cb34f1704c00b1c8321

FreeBSD 14.0-RELEASE-p1 #0 releng/14.0-n265385-305be1f36b3e: Fri Dec  1 21:08:15 CET 2023

Yey :smiley:

1 Like

You have the rare gift of promoting FreeBSD without being too direct or by being seen as proselytizing. I should spin up some more jails for my stuff in the future.

1 Like

2.2.2 dropped

So I think that at this point, it is important to clear up some misconceptions that people usually and generally have about the adaptive replacement cache (ARC) (Sources: here and here (a.k.a. adaptive responsive cache (cf. here)) – in that it per the blog post, it is almost exclusively a read cache. (Again, consult the Megiddo and Modha (IBM) paper that was published in FAST 2003.)

(If you google it, there will also be a lot of places where they will erroneously call it “adaptive read cache” even though that’s primarily what it is used for.)

Also via the blog post (which was originally hosted here, but no longer exists at that location) is where they describe how the adaptive replacement cache works, pictorially. Again, even the pictures show that it’s primary mode of operation is for reads, rather than for writes.

Yes and no.

For reads, your statement would be correct.

For writes – from what I have learned ever since ZFS was invented back in 2005-2006 – it get a little bit more complicated because writes, again, primarily go to the ZFS intent log rather than to the adaptive replacement cache. (Again, referring to the Megiddo and Modha paper, “A commonly used critierion for evaluating a replacement policy is its hit ratio – the frequency with which it finds a page in the cahce.” (Megiddo and Modha, p.1).

Thus, when it comes to writes, from reading about what people have written about this – I think that this is a HUGE misconception that people have, where a lot of times, people will tend to think of the adaptive replacement cache as being for both reads and writes, when that’s not really how it works (given that writes are sent to the ZFS intent log) rather than to the adaptive replacement cache. (cf. Steve Tunstall, Principal Storage Engineer, Oracle - here)

Also cf. Oracle ZFS Storage Appliance Analytics Guide (here).

Also per Steve Tunstall, quote: “(Logzillas,
on the other hand, are for fast synchronous write acknowledgements, and have nothing to do with ARC at all).”

Likewise, this is reflected on page 51 of the ZFS Administration Guide:

"The ZFS intent log (ZIL) is provided to satisfy POSIX requirements for synchronous transactions. For example, databases often require their transactions to be on stable storage devices when returning from a system call. NFS and other applicatiosn can also use fsync() to ensure data stability.

By default, the ZIL is allocated from blocks withint he main pool. However, better performance might be possible by using separate intent log devices, such as NVRAM or a dedicated disk.

Consider the following points when determining whether setting up a ZFS log device is appropriate for your environment:

  • Log devices for the ZFS intent log are not related to the database log files.

  • Any performance improvement seen by implementing a separate log device depends on the device type, the hardware configuartion of the pool, and the application workload. For preliminary performance information, see this blog:

http://blogs.oracle.com/perrin/entry/slog_blog_or_blogging_on"

Basically, what this means is that the amount “dirty data” that’s sitting in the buffer, for the transaction group, for writes, is sitting in the ZFS intent log, rather than the adaptive replacement cache.

I think that it is important for people to realise and understand the distinction between these two as it is one of the most common misconceptions about ZFS and how it works.

Reads ==> adaptive replacement cache.
Writes ==> ZFS intent log.

Well…that was what I was trying to test for by trying to see if I can hammer a Solaris 10 ZFS install. But I ran into issues running the zhammer.sh as there were some commands/options that were still rather Linux-centric, that isn’t available on Solaris natively.

re: 1 in 1e16 or 1 in 1e18 – that’s just the bERR (bit error read rate) - that means that if you read 1e16 or 1e18 bits, you should expect an error rate of 1 bit out of that to be wrong, out of that many bits read.

But that’s just the storage device itself – and not the FS/OS nor anything else that is running on top of it (and I would also bet that would be analyzed in a lab environment, with who knows how many layers of lead shielding ) vs. an actual deployment environment, where we aren’t trying to advertise the lowest possible number.

Agreed.

Agreed – again – this is why I wanted to test this in ZFS version 32 in Solaris to test whether this has always had this issue, and that this feature/bug just brought this issue to the forefront or if there were other changes (i.e. with OpenZFS since this commit in 2013) along with changes to coreutils (on Linux side) that also changed and/or exacerbated this behaviour/problem that was otherwise previously lurking or unknown.

It’s hard to say at this point as again, there are a lot of theories and hypotheses from a lot of people, but only some of those have been supported via shared data whilst the rest remains theories/hypotheses.

Hard to tell without testing.

Well…for what it’s worth, I’ve been advocating AGAINST the use of ZFS for nearly the entire duration of its existence, and even moreso now that more people are realising/discovering it thanks to OpenZFS (ZFS on *BSD and ZFS on Linux).

Like people were talking about ZFS on root for Ubuntu as being the newest, greatest thing since sliced bread, meanwhile, users (and former users) of Solaris are like “yeah…we’ve had that since October 2008 in the Solaris 10 10/08 (U6) release.” So it’s really NOT that special to us. (And yes, I know that the percentage of people who use Solaris is VERY, VERY, small.)

Part of the reason why I can recall when I started using it was because of the email thread that I still have with Sun Premium Support that I purchased back then, because of an unrecoverable issue that I had with ZFS back on January 3rd, 2007 (where ZFS failed and was causing a kernel panic loop in Solaris) and not even the inventors of ZFS could help me fix my issue - so that was my very first use and experience with ZFS where there were problems with it, which can lead to unrecoverable data loss (and that WAS my backup server at the time), so I couldn’t restore from backup.

I only use it now, because when I did my mass consolidation project in January this year (2022), I was pulling in data, and then the drives, from 4 NAS servers down to a single system – out of convenience, really.

edit
@robn actually has a pretty good explanation over on github that, I think, is a worthwhile read.

I can’t say that I fully understand it all – but it’s still a worthwhile read, nevertheless.

1 Like

To add to this – so I wasn’t able to get a Solaris 10 6/06 (U2) VM up and running in neither Proxmox nor via Oracle VirtualBox, but I was able to get Solaris 10 1/13 (U11) up and running.

I was not able to re-write (or modify) the zhammer.sh script so that I can do the same kind of testing in said Solaris 10 1/13 VM (which I was able to create both via Oracle VirtualBox (which is what I tried first) and then I was able to get it up and running via my Proxmox 7.4-3 server).

Having said that, from reading the latest comments on the github issues thread – it sounds like that this really happens when (Open)ZFS in conjunction with GNU coreutils, whereby coreutils (and any other software that) uses SEEK_HOLE – that’s when this issue will be more likely to be present.

Thus, in my Solaris 10 1/13 VM, where I haven’t been able to run the zhammer.sh script – without having tested it – Solaris gets the cp function, I am guessing, from the Solaris package SUNWadmc “Core software libraries used for system administration”.

(I haven’t been able to explicitly confirm that that’s the package that cp comes through in Solaris, but that’s what I’ve been able to find out about it so far/guestimating that’s where my Solaris VM got cp from.)

Therefore; given this, I am also further guestimating that Solaris ZFS will be likely to be safe from this bug as the interactions from SEEK_HOLE and cp with ZFS doesn’t appear to be present in Solaris 10 1/13.

This is what I’ve been able to find out so far.

And as always, I can always be wrong, and if there are other people who are able to educate me more definitively, that would be always welcomed and greatly appreciated. But as far as I can tell so far, this issue doesn’t appear to be present (or at least it would be highly unlikely) for Solaris ZFS.

(sidebar: I tested 640 million files via the parallel implementation of the zhammer.sh script on my 32-core/64-thread Proxmox server, that’s running Proxmox 7.4-3 with coreutils version 8.32-4+b1 and zfsutils-linux: 2.1.11-pve1 and zfs-2.1.11-pve1 and zfs-kmod-2.1.11-pve1 and so far, I haven’t been able to find nor replicate this issue.)

Why the patch on 2.1.x if this affects 2.2.x? Sure seems confusing with all the talk about the bug in 2.2 but a patch for 2.1 as well.

Sorry for the probably dumb question, but skimming over this wall of posts didn’t help, though I’m sure I missed something important. I do see a lot of talk that this may have been around for years, maybe that’s why, for safety. It’s just not clear at all, at this time, is it?

I suppose the slightly good news is this could be hard to trigger, even if it exists on older systems. Like some have mentioned, I tend to keep production systems behind on ZFS, none are even at v2. I only have a few at home on 2.1.11.

The bug which was made more visible and triggered way more often with 2.2.0 was ultimately within the ZFS code for a long time…during the recent OpenZFS leadership meeting, they talked about 0.6. version and Illumos being affected too. So that’s old stuff and all versions can benefit from the fix. Although the exceedingly rare circumstances prior to 2.2.0. still is a chance.
All my ZFS versions received a backport as far as I’ve seen, so distros and maintainers were quick in response for the fix.

They talk about it in the first 5-10 minutes if you want to know the in-depth OpenZFS developer version and explanation.

1 Like

I would definitely patch if you are using coreutils 9.0 or newer regardless of what version you’re on. Looking back people seemed to be encountering this bug back on 2.0.3, but didn’t understand what in ZFS was breaking.

Thanks for the extra info. I’ll watch that for fun, but it will probably go over my head!

Got all my home Debian systems updated once I realized I didn’t have the apt prefs set for backports. Many of my primary production servers are 0.8 still on CentOS, so hopefully we aren’t going to run into the bug there too. Although it sounds like it might be. I haven’t seen any thing that might be corrupt , but maybe I won’t until I least expect it.

EDIT:
just re-read, saw this. I bet I’m just fine on my older CentOS systems. Yay!

really only came to light due to changes in cp in coreutils 9.x. It’s extremely unlikely that the bug was ever hit on EL7 or EL8

Blockquote

I’m still reading through the issues, but this comment might be of interest when it comes to the failure rate:

So maybe I’m naive, but I think the chance of undetected corrupt data is quite unlikely for most scenarios.

I’m just about done scanning three servers with total of ~160TB of data, with mix of mostly videos and photos. Out of ~2.5 million files I found… 7 broken ones with empty blocks at the start. These are indeed damaged and cannot be open.
For perspective sake it’s 0.00027% corruption.

2 Likes

The corruption is ONLY a PART of the problem.

If you read the original post from the bug report that was raised on Github, ZFS scrub DOES NOT detect this data corruption.

ZFS thinks that the ZEROS at the beginning of the file is VALID.

That, to me, is a MUCH bigger problem – the silent nature of this data corruption is a VASTLY bigger problem than the data corruption in and of itself.

And by the way, that’s only after running either the reproducer.sh script or zhammer.sh ZFS hammering script after they found out what they need to look for so that they would be able to MANUALLY find said data corruption.

The silent nature of the data corrution, is of course, more dangerous than the statistical improbability of corrupting your data/files.

Nah, that’s not really the issue. Scrub doesn’t detect data corruption because technically there is no corruption and the drive stores exactly the same data that was written to it.

The problem arises from reads that return zeros where they shouldn’t, or more precisely from hole detection mechanisms that report holes where there are none.

So as long as you simply store data in ZFS, you are safe. It’s when you start reading/modifying/copying, the problem arises. Which is how the reproducer scripts work - by copying the same file over and over and over.

1 Like

You do realise that these two statements, when juxtaposed together like this, cannot exist simultaneously, right?

(i.e. if it is reading zeros where there shouldn’t be zeros – that’s data corruption (because it would read the data as zeros and then write it back out as zeros). (cf. some copied files are corrupted (chunks replaced by zeros) · Issue #15526 · openzfs/zfs · GitHub)

That REALLY depends given the fact that ZFS works on the copy-on-write principle for editing/modifying files (i.e. the copy operations are inherent to the write operations). Thus, if it is using the same basic method to copy as it has been found in cp in coreutils 9.x, then this problem MAY be inherent with this copy-on-write mechanism for editing files.

(Both reproducer scripts just accelerates this condition in order to make the probability of the occurrence higher within a significantly shorter timeframe.)

Again, I refer to the original github issues/bug report where the OP was trying to build Go and during compile, is when they noticed this issue.

1 Like

Pretty sure they meant, the data isn’t wrong it is missing

perhaps? the ricebrain writeup of it was the most helpful looking one (last link in OP’s post )

1 Like