If you read the OpenZFS documentation for ZIL, it also states:
“The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions.”
Therefore; if you put the first statement from the OpenZFS documentation together with the first statement from the same OpenZFS documentation in regards to ZIL – you put two and two together – and that is what you get.
The Oracle ZFS documentation actually combines the two in a more succinct manner:
" |sync|String|standard|Determines the synchronous behavior of a file system’s transactions. Possible values are:
- standard, the default value, which means synchronous file system transactions, such as fsync, O_DSYNC, O_SYNC, and so on, are written to the intent log.
- always, ensures that every file system transaction is written and flushed to stable storage by a returning system call. This value has a significant performance penalty.
- disabled, means that synchronous requests are disabled. File system transactions are only committed to stable storage on the next transaction group commit, which might be after many seconds. This value gives the best performance, with no risk of corrupting the pool.
Caution - This disabled value is very dangerous because ZFS is ignoring the synchronous transaction demands of applications, such as databases or NFS operations. Setting this value on the currently active root or /var file system might result in unexpected behavior, application data loss, or increased vulnerability to replay attacks. You should only use this value if you fully understand all the associated risks.|
| — | — | — |"
(Source: Introducing ZFS Properties - Managing ZFS File Systems in Oracle® Solaris 11.3)
Well…it’s only HALF “lying” about it.
ZIL satifies POSIX requirements for synchronous transactions.
Therefore; if sync=standard and it writes it out to ZIL, by how the OpenZFS has documented this actual mode of operation, this is what helps their first statement about what sync=standard
satisifying the POSIX requirements for synchronous transactions = true.
Except for the fact that ZIL isn’t the actual, underlying storage device.
So, by introducing the ZIL, the “standard” “sync” write in ZFS is actually a hybrid in that to the device, it’s asynchronous write, but to the ZIL, it’s synchronous (and that satisifies the POSIX requirement for said synchronous transaction).
The OpenZFS documentation doesn’t connect the two dots for readers automatically, but the Oracle ZFS documentation does.
(And I would imagine that it’s probably been like that ever since ZFS was invented at Sun Microsystems.)
(And also IIRC, this was introduced to help speed up the writes without it being “fully” asynchronous because back then, UltraWide SCSI disks still sucked, especially for random writes. So, if you can collect a bunch of them in ZIL, and then flush the ZIL to disk, then you can help make that process a bit faster.)
I think what’s confusing to people (and even sometimes for myself) is that the separate LOG device is the SECOND (or dedicated) ZFS intent log device. (call it “ZIL2” if you want.) Because that’s what it really, actually is.
“By default, the ZIL is allocated from blocks within the main pool. However, you can obtain better performance by using separate intent log devices such as NVRAM or a dedicated disk.”
(Source: Creating a ZFS Storage Pool With Log Devices - Managing ZFS File Systems in Oracle® Solaris 11.3)
Both the Oracle ZFS documentation and the OpenZFS documentation say this:
“By default, the ZIL is allocated from blocks within the main pool”
(Sources: Creating a ZFS Storage Pool With Log Devices - Managing ZFS File Systems in Oracle® Solaris 11.3, zpoolconcepts.7 — OpenZFS documentation)
So, based on that – the ZFS intent log, by default – takes up space on the devices/virtual devices that makes up the pool.
Therefore; the only thing that the separately (ZFS intent) log device does is that it “sucks up all of the ZFS intent log from the devices in the pool” and moves it onto a separate (ZFS intent) log device. That’s all that it does.
As far as the POSIX compliance for synchronous transfers – given that the ZIL satisfies said POSIX requirement for synchronous transfers – therefore;
|sync|String|standard|Determines the synchronous behavior of a file system's transactions. Possible values are:
* standard, the default value, which means synchronous file system transactions, such as fsync, O_DSYNC, O_SYNC, and so on, are written to the intent log.|
| --- | --- | --- |
To me, at least, it looks like that OpenZFS docs “partially” mirrored (copied?) the Sun ZFS docs, and then revamped it a little bit to make it “their own”, but indoingso, they might have edited out some rather important details about how it works.
This is why, I will often refer to either the Sun/Solaris ZFS Administration Guides for the documentation (given that it started off as an OpenSolaris project, and then got rolled into “mainline” Solaris a little bit later). (I remember looking at the presentations from Jeff Bonwick about ZFS when it was introduced into “mainline” Solaris 10.)
And I also vaguely remember him and two other guys talks about why they invented the ZFS intent log (UltraWide and/or Ultra160 SCSI drives in the Sun Ultra 20 and Sun Ultra 45 workstations back then, were still very slow), so the idea was rather than writing to the drive frequently, especially for random writes – you “collect” the data into the ZFS intent log, and then flush the ZIL to disk periodically, which reduced the disk thrashing that comes from a lot of tiny writes to the disk.
Therefore; if you write it to the ZFS intent log, and that is deemed to meet the POSIX requirements for synchronous transfers, then sync=standard
would be synchronous to the ZFS intent log, but asynchronous to the actual (storage) device.
(Again, I refer to the original Solaris ZFS Administration Guide for details since this was how the inventors of ZFS made it.)
Are you using HDDs or SSDs on your clients?
I would think that if you’re using SSDs – it probably isn’t making a bunch of noise (maybe other than some SSDs having above average coil whine) – but if you’re using HDDs – I would think that modern HDDs are built very differently than the UltraWide SCSI drives back when ZFS was invented.
And systems now also typically have more RAM than they did back then as well, which means that between that and how ZFS uses RAM for cache – so the TXG is able to collect more data in the buffer before flushing to disk.
That was certainly NOT the case on Sun Ultra 45 workstations back then.
I agree.
However, again, from reading the comments on the github issue that has been raised – the high level synopsis that I’ve been able to infer via skimming the comments, is that they’re not really 100% sure how it happened and/or how long this issue has existed. The way I think about is that there is a possibility that even in the original Sun ZFS code, that this was present, but the probability may have been like 1 in 1e12, but the block cloning thing made this issue vastly more apparent, but as it has also been noted, that this isn’t strictly a block cloning issue – just that the block cloning feature brought this bug to the forefront of the consciousness.
From what I’ve read so far on github, etc. – the “fix” is that they’re going to disable this feature until the devs understand more about it.
I think that there’s a LOT of speculation about the root cause of the issue, but without the data and the evidence to be able to confirm or deny those theories – they’re theories at this point. (like brainstorming, and then you test each hypothesis individually to see which hypothesis is support by data vs those that aren’t).
But like I said – this is why I am interested in testing this with Sun/Solaris ZFS to see whether it was prevalent in the “original” ZFS code vs. whether this was brought about (or exacerbated) by changes that were made to the OpenZFS code/project. (cf. terse 15526.md · GitHub)
At this point, I don’t think that anybody knows for sure yet.
I’m trying to test it on Solaris. Their test script needs quite a lot of modifications made to it because some of the command calls aren’t available in Solaris, and with me NOT being a programmer/developer – I REALLY don’t know what I am doing when I am trying to change the script to be able to get it to work in Solaris.
I think that IF this issue ONLY manifests with sync=standard
or sync=disabled
, then I would surmise that sync=always
CAN potentially be a solution to this, IF async writes (or standard) is a condition/requirement for this silent data corruption to occur.
But I agree - it’s not a great solution. But until the devs can deep dive this further, and figure out what is the root cause, and then implement a permanent corrective action for it, then there may be an argument that can be made that slow data is better than corrupted data.
So I don’t know.
Right now, I’m just focused on Solaris testing.
Thanks.
(Sorry it took so long for me to reply – I started writing my response early this morning, but then work, then life, got in the way, so I’m finally JUST getting around to finally sending my reply. Thanks.)