SSD caching for btrfs raid?

I have a NAS with 6 HDDs in btrfs RAID1c3 running on an RPi5. I also have a 128GB SSD attached to it. I’m wondering if there is a good way to use the SSD for caching the HDD raid. I know that the “go to” solutions are bcache and lvmcache, but from what I read, both of them require me to use bacache/LVM to manage the physical disks at the low level, which is in contradiction with the recommended way for using a btrfs raid, which is to run it on bare metal, not abstracted volumes. I also want to have the least complexity possible on the devices actually holding the data in case I ever have to deal with disk failures at some point. So I would like some caching mechanism that would allow me to access my data both with and without the caching device without requiring me to do anything aside from accessing a different mount point.

The option that seems to fit the bill is dm-cache. But I’m having a problem with it… It wants to be pointed to a “block device” to enable caching. But I don’t think a btrfs raid has a singular block device associated with it, so I would have to create a cache for each individual disk, and I don’t think it would make much sense. Am I missing something, or does it mean that I cannot use dm-cache for my current application? And if I can, how would I go about it?

The other options that I’ve seen are flashcache and EnhanceIO, but both of them seem to be unmaintained for about a decade, so I haven’t looked much farther into them.

So I’m wondering if and how I could use dm-cache if possible, and if not, what other avenues do you think I could explore?

caches are always more complexity and hacking filesystems with features it’s not designed for, always adds more friction, complexity and also way more sources of errors.

The sane and proper way to do this is choosing storage that supports this by design, which is why people use things like ZFS bcacheFS, Ceph, etc.

The main question is…why do you want a cache and what is it supposed to do? Easiest way to make a quick read cache is to make a ginormous swap space and put swap on the SSD in question. Or add RAM. OS comes with a read cache by default after all.

If it’s intended for tiering…a HDD tier and an SSD tier…well, it’s complicated and rarely efficient or performant and still ends in HDDs being the bottleneck at some point.

BTRFS is just a filesystem. It doesn’t do block storage so you can’t present your storage as a block device to do all kinds of fancy stuff with it.

The best and most performant options are how ZFS and Ceph handles this: (redundant) write (ahead) log and dedicated metadata cache in RAM for reads + caching hot data in RAM too.

If I check the throughput of my 4xHDDs and imagine a single 128GB having to accelerate them…nope. I read at 1GB/s and write at 600MB/s with my setup. That’s what ZFS or Ceph gets you.

3-way mirror is pretty fast by HDD standards…so what stuff is slow? random IO? That 128GB bcache won’t produce 100% cache hits, I can tell you that much. Nor does it aggregate random IO into streams of sequential IO.

And a single SSD as cache doesn’t have the same redundancy as your 3-way mirror. So that’s also bad practice when talking redundancy and data integrity. And 128GB drives are usually old gear too and not too trustworthy as a funnel for every bit of data read and/or written imo.

ZFS doesn’t work for me, because my HDDs vary wildly in size (1TB, 2TB, 2x3TB, 4TB, 5TB), and a ZFS raid is inflexible, I couldn’t just remove or add disks at will, and I’m not sure that the configuration with 6 HDDs is permanent, not am I sure I will never downgrade disk size, which is something that ZFS can’t deal with.

BcacheFS I admit I don’t know much about. But another reason that I picked btrfs over ZFS is built-in support on linux, and BcacheFS is being kicked out of the kernel, so in that sense it’s in the same boat as ZFS. And BcacheFS is also new and not as stable as btrfs.

As for Ceph, I haven’t heard about it before. I’ll look into it. Thanks.

The honest answer is, because I want to try it out, learn how to implement it and see how it behaves. But there is one use case that would benefit from improved random reads right now: VMs that I offloaded to the NAS to make space on my PC’s drive. None of the VMs are bigger than the 128GB SSD, so if I were to launch a VM off of the NAS, it could entirely be loaded onto the SSD (not that it would need to go there in its entirety), and give me better responsiveness when I use that VM.

If this caching thing works out, but I need more space on the cache, I also have a 512GB SSD that I could replace the 128GB one with. I just wanted to use old hardware that I can’t use for anything else on the NAS. And while the 128GB SSD is old, it wasn’t used much, so it should have plenty of life left in it.

In write-through mode that shouldn’t be an issue. It would still improve the reads without any risk to data integrity.

Just adding RAM is not an option, it’s an RPi after all. But if just adding the SSD as swap works, I’d like to hear about it. How does the OS know that the swap is faster than the file system and hence would improve read performance by expanding the read cache there? And would that interfere with the other OS functions? I would want everything that isn’t read cache to remain in RAM instead of being pushed to the swap. Also, would the read cache persist after a reboot of the NAS if done this way?

Distributed Filesystem, Cluster…so probably not what you are looking for. I just mentioned it because I use it and has very good handling of HDDs boosted by DRAM or SSDs.

It doesn’t magically appear on the SSD. Data has to be fetched from the HDD first. So if you start stuff, you read everything from the HDD and then write it to SSD to then read it from there. Unless you only ever read those 128GB, the HDDs still needs to work and everything runs at HDD speed :wink:

yeah, reads are easy. Which is why the OS does read-caching in RAM and swap all the time…it’s just a copy of the data.

If you put swap on an SSD and tell Linux to massively swap all the page cache stuff (swapiness slider), should be an improvement for reads. But read caches have their limits, especially with rewritten/modified blocks.

That’s some proper storage zoo btw. You can do ZFS with a bit of partitioning hacks. Make 1TB partitions, one partition for the 1TB drive and 5 partitions for the 5TB drive. partitions are block devices and ZFS just needs block devices. You have to be really careful with the failure domain on your redundancy though. And it’s not best practice. But will work.

Best solution would be to get a 8TB SSD, put all HDDs into the dumpster and bypass all tiering/caching stuff in the process and get performance+power efficiency+ easy of use too.

(1+2+6+4+5)/3= 6 TB usable storage. This kind of stuff can be made with a single SSD these days, superior in every metric.

But if that’s mostly for tinkering, I understand using all the drives and having fun learning and using it, especially how this all works with a bcache or similar. More power to you then, I used 8x crappy USB flash drives for my ZFS childhood days and it was great fun to work and learn with.

BcacheFS kinda was the only champion to make this kind of approach work inside the kernel as a “just works” filesystem. It still is btw. But I know how non-Kernel (DKMS) feels when running systems you just want to work. BTRFS (and also Ceph) certainly has the edge there.

But ZFS isn’t in the kernel because Oracle and BcacheFS isn’t in the kernel because one/two eccentric men can’t come to terms. Both aren’t because of maturity or technical reasons.

Just to sum things up:

There is no easy or ready-to-go solution. Everything involves either non-Kernel hacky stuff, janky workarounds and/or migrating away from BTRFS. Bcache is what people usually use for their xfs/ext4 from what I’ve heard. BTRFS is great but isn’t perfect and block storage isn’t on the feature list. Synology also hacked their BTRFS with nested ext4 shenanigans to make their SHR work well for their demands.

In the end you can abstract storage to a degree to make single files into a block device and back it up via bcache. But this abstraction isn’t easy to run, adds complexity and sacrifices performance.

I’m pretty sure the guys here in L1T using bcacheFS, bcache and the likes have other opinions and suggestions. I’m just a single ZFS+Ceph guy at the end of the day with a special opinion :wink:

P.S.: Love seeing HDDs and stuff running on SBCs. Making the most out of that 5W CPU and low-power footprint :wink:

P.P.S.: Tiering is a trap

1 Like

I would go with ZFS.
You could use LVM to stitch your harddrive togeter like 1+5, 2+4, 3+3, and run raidz1 on top of it.

2 Likes

bcachefs could be everything one could want in a file system, but it’s not now.

NVRAM journaling support is on the roadmap, which I believe should be another nail in the coffin for the other file systems… if Kent could find it in him to be more political savvy when working with the larger Linux ecosystem.

1 Like

Except redundancy. I’d need at least 3 such SSDs. And even just 1 of them costs more than my entire NAS, especially considering that for this project I didn’t buy any of the HDDs, nor the RPi (the 1 and 2TB ones I used in my old PCs before I upgraded to full SSD storage with the current machine, the 5TB one was an external drive that I used for backups before I replaced it with 1 of those 8TB SSDs, and the 3 and 4TB drives are basically e-waste that I “rescued” from work).

Too much complexity, too little redundancy and no flexibility for future upgrades. If I were to go with ZFS, I’d want RAID-Z2, on bare metal with all 6 drives. Because that would mean I would at least be able to upgrade the drives eventually without reformatting the filesystem and without risking the data getting into an unrecoverable state if something goes wrong while the array is being rebuilt after a disk swap.

Yeah. It’s sad really. I do hope that in a year or two once BcacheFS becomes stable enough to not require major code changes that have to happen ASAP to address bugs lest people lose data, it will return to the kernel.

Maybe ask on Ubiquiti forums or Ubiquiti Reddit if anyone knows the method they are using on the UNAS for SSD caching of their NAS storage in those products. They use BTRFS under the hood, and the caching is just “press button to select mode” easy, so it has to have everything all packaged up and rather simple so it definitely should be possible to do what you want.

1 Like

is it their own thing tho?