Online I read various forum posts regarding that you shouldn’t do block layer writecaching for a BTRFS partition.
But last week I asked a bcache developer about it and he said that it was a no-issue. For checkpoints BTRFS issues FLUSH commands and these are honoured by bcache.
This is necessary for BTRFS to function properly: Hardware considerations — BTRFS documentation
dm-cache, the kernel module used by lvmcache, also mentions in its docs that FLUSHes were honored.
And I guess it’s an “obvious” thing do honour when you implement a caching solution (oh Hector can tell stories about Apple’s “normal” regarding SSD flushing)
I ran an lvchange command to switch my cache mode from writethrough to writeback on my NAS a few days ago. So far there were no BTRFS errors in the logs nor did a scrub reveal any corruption. Even after few reboots and days of uptime I have yet to encounter issues.
Honestly these online comments seem to just be a misunderstanding. You get weird looks when you ask such a newbie question in a room full of kernel developers… Not nice.
What will you all think when my NAS is a single 18TB HDD with bi-anual backups of lesser important data?
But hey since I’m using an lv it shouldn’t be to hard to turn this into a raid1 at some point hehe.
BTRFS only knows about the data you gave it. If the cache sends corrupt stuff, the corrupt data gets checksummed and you now have verified corruption, impossible to fix or detect.
bi-annual? How about every 5 minutes with a retention of 7 days? crontab and snapshots make the world go round.
Yeah well I got nowhere else to put my Terabytes of ISOs. Maybe I should at least create a file index that I can include in my regular updates. It’s not like i don’t do automated backups, but those only Gigabytes of important data.
No, this is not how things work. Checksums and redundancy can only detect changes that happen after you wrote it. If you write bad data in the first place, no mechanic can fix or detect this. Because that’s what you gave it.
This is why we recommend ECC memory and proper cabling and controllers. And are skeptical about external disk caches. To get the data from the application to the disk in a risk-minimized way. Because once it’s there, integrity features will protect it until we read it again.
If you send the snapshot to a backup target, then the snapshot is a backup. It’s a replicated copy. No one is forcing you to store the snapshot on the same subvol (which obviously isn’t a backup because same device)
Replicated storage on another device/machine on a block level as a sequential stream is really the best you can get. Which is why I love BTRFS and ZFS so much. No need for backup tools or manually copying stuff ever again.
Oh the BTRFS send snapshot command. I see i misunderstood, sorry. Yeah that’s a good one.
Are we talking about same thing?
If a write goes bad to the cache undetected and then written as such to the backing storage, either the data checksum or the metadata checksum will indicate that something went wrong. A read will immediately catch that.
If a write goes bad to backing storage undetected vs cache then it’ll take longer to detect. That is until it is evicted from the cache or the cache is disabled or circumvented (which BTRFS doesn’t do). Then the read comes from the backing storage and then we will detect the error.
Cache as in dm-cache cache.
Btw Btrfs tries to verify in memory metadata blocks before they are written as a defence against bugs or system memory corruption. This not equivalent to ECC memory but at least a nice preventive mechanic. https://btrfs.readthedocs.io/en/latest/Tree-checker.html
If you have a parking lot that guarantees you get your car back in the same state you left it, doesn’t mean that you can drive-in a car wreck and expect a shiny new car later.
There is no “I’m wrong” bit in computing. If the application sends a 1010 and your memory,cache,cable,controller writes a 0010 (for some reason) and passes this on to the filesystem, it will stay a 0010 and BTRFS will make sure that you always get a 0010 on a read and nothing else. BTRFS never knew about a 5, it was always a 4. And the 4 is stored and checksummed, so if some error makes the block a 12, BTRFS uses checksum and replica to restore a 4 (because you gave it a 4, not a 5 or 12)
There is no data or information available to BTRFS to do anything other than that.
If you give BTRFS corrupt data, you will only ever get authenticated and verified corruption and no errors.
I can’t explain it any better than that. I hope that helps with your misconception on how this stuff works.
In System memory: Application → dcache and inode cache → filesystem for example BTRFS (update metadata structs and calculate checksums for metadata and data) → struct bio → dm_cache
Write to Cache SSD: dm_cache → bio → SSD
Write to backing HDD: SSD → dm_cache → bio → HDD
Why do you assume that BTRFS will calculate a checksum for corrupt data? Sure RAM bitflips or bugs exist. But apart from that, if you consider how VFS and block layer caching works…
If I interpret this right, your conception is that first you write the data to a medium and then read the data to calculate its checksum? Well it is not. ZFS doesn’t do that either.