A bcached ZFS pool weekend project - What, how and benches

Edit Update: Check the post beneath for further testing with an L2ARC instead. Spoilers: L2ARC far better.

Introduction

So I was reading around the VFIO reddit scene I came across the following thread from last year: https://www.reddit.com/r/VFIO/comments/eryb71/best_storage_for_vfiokvmqemu_experiment_results/

Basically the TL;DR is that this guy “etherael” (henceforth nicknamed /r/vfio guy) used a bcached striped ZFS setup using a 512GB NVMe instead of L2ARC/ZIL/SLOG as the cache, two regular 2TB HDDs and got some blazing fast speeds and high cache ratios. So I decided to copy his setup on my workstation as a weekend project to learn a little more about bcache and ZFS in general.

The goal here was RAIDZ with bcache in writeback and ZFS with the following setup:

  • 48GB of RAM

  • 3x 4TB HDDs (Two 5400rpm Toshiba P300s and a 7200rpm WD Black 4TB)

  • 1x 1TB WD Black SN750 NVMe

  • Arch with ZoL

Unlike the /r/vfio guy I decided to opt out of TrueCrypt/VeraCrypt and go with ZFS’ native encryption instead. Plus RAIDZ1 instead of RAID1 which is a large departure in terms of performance.

Disclaimer:

Before anyone goes screaming about RADIZ1 being crap, it’s a waste of space and resources, that you might as well have no redundancy at all, it’s dumb, it’s bad, you should have bought an extra HDD and gone with RADIZ2, it’s not supported by DELL anymore…

I know.

This is not a NAS. Important data is not staying on the disks otherwise I’d mirror it. So I know exactly the risks and downsides I’m taking by doing this. If one disk dies and I need to resilver, the chances of it annihilating another drive, especially of this size, is immense and that is fine.

With that out of the way

Steps:

This isn’t exactly meant as a complete guide, I’m skipping the general ZFS stuff after creating a dataset. Most of this is just from reading instructions on the Arch wiki and bcache’s manpages.

Started off by wiping the disks in gparted first to only have unallocated space.

Wipefs was needed to wipe the disks clean of any superblocks from previous filesystems, otherwise bcache will complain.

# Wipe all the drives completely clean.
wipefs -a /dev/sdb
wipefs -a /dev/sdf
wipefs -a /dev/sdg
wipefs -a /dev/nvme0n1

Next up was creating the underlying bcache layer.

make-bcache -B /dev/sdb     # "Wyld"   - HDD Backing device
make-bcache -B /dev/sdf     # "Weaver" - HDD Backing device
make-bcache -B /dev/sdg     # "Wyrm"   - HDD Backing device
make-bcache -C /dev/nvme0n1 # "Umbra"  - NVMe Cache

Afterwards I needed the cset.uuid of the caching device to attach it to the backing devices:

bcache-super-show /dev/nvme0n1 | grep cset
# a6641b9e-2238-3119-c779-32aab53621b8

Now with the uuid in hand I attached it to each backing device.

echo a6641b9e-2238-3119-c779-32aab53621b8 > /sys/block/bcache0/bcache/attach
echo a6641b9e-2238-3119-c779-32aab53621b8 > /sys/block/bcache1/bcache/attach
echo a6641b9e-2238-3119-c779-32aab53621b8 > /sys/block/bcache2/bcache/attach

Then I enabled writeback mode (read and write cache) for these devices.

I don’t really care if I lose dirty data if the NVMe dies, as I do have a backup setup of important data and have mostly non-critical data on these drives anyway. So I’m sticking with a single drive instead of the recommended “one cache per backing device”.

echo writeback > /sys/block/bcache0/bcache/cache_mode
echo writeback > /sys/block/bcache1/bcache/cache_mode
echo writeback > /sys/block/bcache2/bcache/cache_mode

At this point we’re ready for ZFS. I created a RAIDZ zpool on the /dev/bcacheX devices

zpool create -f -o ashift=12 -O mountpoint=none Triat raidz /dev/bcache0 /dev/bcache1 /dev/bcache2

The rest is just usual setup of datasets on the ZFS drives.

As an example I did exactly like the /r/vfio guy and created a dataset to hold my VM stuff this way and added encryption plus zstd compression to it instead of lz4:

zfs create -o sync=disabled -o xattr=sa -o atime=off -o relatime=off -o recordsize=64k -o encryption=on -o keyformat=passphrase -o compression=zstd -o mountpoint=/run/media/mechanical/Dreaming Triat/Dreaming

Works like a charm! Next up was setting up ZFS import cache and everything which I won’t go into detail how to here, read the ZFS manpages or wikis instead.

However I noticed an issue. After rebooting the bcache drives were not detected anymore and the pool could not be imported. How come?

The long standing ZFS uberblock bug:

After a lot of research into why the bcache udev rule couldn’t find the bcache drives anymore I found a persistent bug in ZFS On Linux.

It has been reported multiple times before, but is yet to be fixed.

From what I could gather ZFS adds some uberblocks at the end of the disk/partition which makes blkid think the whole disk is a zfs-member type. This means that the usual TYPE=“bcache” that appears in blkid is gone on everything except the cache device which doesn’t have ZFS installed on it. This confuses the bcache udev rules since it cannot find any backing drives with the bcache type.

The required bcache uberblock is not removed mind you. It’s still there, so it can be registered manually just fine. However due to the ZFS blocks at the end of the disk screwing with the signatures, blkid will not report directly that the disks are in fact bcache devices.

zfsblocks

I tried one of the fixes which was creating a zpool on a partition, shrinking the partition by 10MB, adding a small extra partition at the end and wiping the extra partition’s uberblocks to clear them and make ZFS not create new blocks at the end in fdisk. Unfortunately that didn’t work and ZFS recreated the end-blocks after a reboot.

So what I decided to do is change how the udev rule attaches the bcache devices, as registering the bcache devices works fine.

Here’s a replacement 69-bcache.rules (put it in /etc/udev/rules.d/)

# register bcache devices as they come up
# man 7 udev for syntax
SUBSYSTEM!="block", GOTO="bcache_end"
ACTION=="remove", GOTO="bcache_end"
ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", GOTO="bcache_end"
KERNEL=="fd*|sr*", GOTO="bcache_end"

# blkid was run by the standard udev rules
# It recognised bcache (util-linux 2.24+)
ENV{ID_FS_AMBIVALENT}=="other:bcache*", GOTO="bcache_backing_found"
# It recognised something else; bail
ENV{ID_FS_TYPE}=="?*", GOTO="bcache_backing_end"

# Backing devices: scan, symlink, register
IMPORT{program}="probe-bcache -o udev $tempnode"
ENV{ID_FS_AMBIVALENT}=="other:bcache*", GOTO="bcache_backing_end"
ENV{ID_FS_UUID_ENC}=="?*", SYMLINK+="disk/by-uuid/$env{ID_FS_UUID_ENC}"

LABEL="bcache_backing_found"
RUN{builtin}+="kmod load bcache"
RUN+="/usr/lib/udev/bcache-register $tempnode"
LABEL="bcache_backing_end"

# Cached devices: symlink
DRIVER=="bcache", ENV{CACHED_UUID}=="?*", \
SYMLINK+="bcache/by-uuid/$env{CACHED_UUID}"
DRIVER=="bcache", ENV{CACHED_LABEL}=="?*", \
SYMLINK+="bcache/by-label/$env{CACHED_LABEL}"
LABEL="bcache_end"

The big difference here is this part:

ENV{ID_FS_AMBIVALENT}=="other:bcache*"

blkid does report the ID_FS_AMBIVALENT signature which says it is bcache and zfs_member. So with this it can tell that it is a bcache device and still attach it as usual.

Now how the hell /r/vfio guy did this, I have no idea. Maybe he used an older version of ZoL without this bug? ZoL didn’t have persistent L2ARC at the time of writing, so I bet he was using an older version of ZFS on Linux.

Benches:

You’re probably interested in the benches, right? How is this thing actually performing? I’m no Wendell so I don’t exactly have much benchmarking experience, but I did the same benchmarks the /r/vfio guy did. I can probably run some more if needed.

I don’t have any comparisons with an L2ARC setup, so I can’t vouch that it performs better than L2ARC or anything like that. The reason the /r/vfio guy used bcache is due to L2ARC’s lack of persistency which it now has today unlike back then and low hit rate. So using bcache might have been dumb for all I know.

The I/O scheduler used was mq-deadline.

Bare metal disk performance:

These are done straight on the filesystem, no VM.

sysbench fileio --file-total-size=10G prepare

sysbench-prepare

sysbench fileio --file-test-mode=rndrd --file-total-size=10G --file-block-size=4K --threads=4 --time=60 run

dd tests:

dd-results

Windows 10 VM’s CrystalMark performance:

This is my gaming VM with the 180GB raw image with Windows on and a 1TB raw image in a dataset with the options mentioned last in the Steps section.

I’m using Virtio SCSI for this one. io=threads and cache=none

MB/s:
Screenshot 2021-06-07 235910

IOPS:
Screenshot 2021-06-07 235944

Cache stats after a relatively heavy day

Most of this was already cached. Playing a few hours WoW TBC Classic, a few hours Fallout 3 and some Adobe Premiere editing.

cache-results

Normally the hit ratio is around 90 unless something new is being installed.

Conclusion

Very fun project to do and I’ve learned a lot. Plus the results are actually quite promising.

Almost everything I use regularly is cached in the NVMe which means a ridiculously high 90ish hit ratio on the cache.

My Windows VM and the stuff I commonly use in the separate 1TB data image is almost always cached. It does not cache the entire image, just the data it needs. The write speeds are also excellent since the NVMe is getting written to first and then it writes the HDDs afterwards. Though loading a game that is not in the cache is of course like running it from an HDD temporarily until things get cached, the next time it’s booted it’s fast as all hell.

I regret not testing L2ARC to get some data to compare and check whether bcache is a good alternative or if the “new” persistent L2ARC is the better alternative. /r/vfio guy was not impressed by the L2ARC and you can read his reasoning in the reddit post. Might need to do another weekend project at some time and check L2ARC out just to give it a spin.

Other than that this is more or less exactly what I wanted from the setup.

Would I recommend this for any critical data? Probably not. If so I’d have RAIDZ2 or mirror and a NVMe/SSD cache per backing device and even then this is kinda experimental.

Thanks a lot to the /r/vfio guy etherael for even coming up with and testing this mad scientist project in the first place. I just stole what I saw.

Feel free to speak up if this is incredibly stupid, dumb, useless and wasteful. Hopefully this helped or was informative to some people anyway.

11 Likes

Hello again, sorry for resurrecting this, but I wanted to set some things straight!

I did a few extra checks now by cleanly detaching the cache NVMe, cleaning it completely and creating a mirrored SLOG + mirrored Special vdev + L2ARC on it (mirrors are on a secondary consumer SSD) instead to compare the performance between pure persistent L2ARC and Bcache.

To start off, I did crystalmark benchmarks on all of these inside the same VM and they are all about the same, even without the cache. It probably has to do with the ARC, thus I won’t even bother to post the CrystalMark benches here as they’re all almost identical.

Though the rest of my findings were interesting for a more “real world scenario”.

Bcache

VM Bootup with a warm cache, these are all with dynamic hugepage allocation time which means it takes a while before it even begins to read the raw storage.

  • 1 minute 15 seconds

Without the allocation the boot time is about 30 seconds.

The “Welcome” screen in Windows lasted:

  • 7 seconds.

Starting World of Warcraft TBC Classic (From “Play” to character screen):

  • 12 seconds

From “Enter world” to character control:

  • 15 seconds

No Cache

VM Bootup with no cache:

  • 2 minutes 11 seconds

The “Welcome” screen in Windows lasted:

  • 26 seconds.

Starting World of Warcraft TBC Classic (From “Play” to character screen):

  • 23 seconds

From “Enter world” to character control:

  • 42 seconds

Persistent L2ARC + SPECIAL + SLOG, first run

VM Bootup with a cold cache:

  • 2 minutes 13 seconds

The “Welcome” screen in Windows lasted:

  • 29 seconds.

Starting World of Warcraft TBC Classic (From “Play” to character screen):

  • Did not start WoW

From “Enter world” to character control:

  • Did not start WoW

Persistent L2ARC + SPECIAL + SLOG, second run

VM Bootup with a warm cache:

  • 1 minute 20 seconds

The “Welcome” screen in Windows lasted:

  • 7 seconds.

Starting World of Warcraft TBC Classic (From “Play” to character screen):

  • 22 seconds

From “Enter world” to character control:

  • 42 seconds

Remember that WoW was not cached here since I didn’t run it before now.

Persistent L2ARC + SPECIAL + SLOG, third run

L2ARC hit rate: 89.6%.

L2ARC size: 4.8GiB

VM Bootup with a warm cache:

  • 1 minute 10 seconds

The “Welcome” screen in Windows lasted:

  • 7 seconds.

Starting World of Warcraft TBC Classic (From “Play” to character screen):

  • 12 seconds

From “Enter world” to character control:

  • 13 seconds

L2ARC Tuning

You should add the noprefetch=0 tuning option to ZFS. This makes ZFS cache everything in the L2ARC and not just async reads that is evicted from the ARC. Otherwise you’ll only get fast random I/O on things that are tossed out ARC (of memory) and sync reads are ignored. I needed to do this before creating the L2ARC, I’m not sure why, but I did it after creating it and for some reason it kept only caching things that was evicted and not sync reads either.

Another option you probably would want is to increase the l2arc_write_max and l2arc_write_boost options. They’re by default at 8mb/s which means it’ll limit the cache writing on the NVMe to 8mb/s maximum. Considering an NVMe can read and write significantly faster you should probably bump that up significantly. From what I could read not to your maximum theoretical write though as it probably needs to read at the same time.

To /etc/modprobe.d/zfs.conf

options zfs l2arc_rebuild_enabled=1         # Persistent L2ARC, is 1 by default so this option shouldn't be necessary.
options zfs l2arc_noprefetch=0              # NoPrefetch = False to also cache synchronous reads.
options zfs l2arc_write_max=536870912       # Increase the maximum write from 8mb/s to 500mb/s
options zfs l2arc_write_boost=1073741824    # At startup it'll boost to 1GB/s if needed.

There’s a common recommendation to not go overboard with the L2ARC size due to the tables where the cached items are is staying in RAM and thus taking up space that ARC could use. This seems to be remedied significantly a few versions ago where the headers take up less space which means a larger L2ARC probably isn’t a big issue.

Conclusion

Persistent L2ARC’s performance is excellent and is more or less exactly the same as Bcache. You probably won’t get the same write speeds even with a dedicated SLOG, but you get cache compression and in general a smarter cache and you don’t need to lie to ZFS about the layers which is far safer. Your entire array won’t die by the SSD dying either.

I think the tests show that it’s far better to use L2ARC in this case as ZFS will be on bare metal like it’s designed to be without a underlying filesystem layer beneath it.

So the TL;DR is do NOT use Bcache with ZFS. Use L2ARC!

7 Likes

Thanks for this breakdown! Especially the configuration for the write_max tuning. Have some small ssds 250/500g that don’t I rarely use due to size. I’ve now migrated my SteamLibraries over to some hdds they are caching and it has worked great!

Good tuning. With both L2 and special, consider

 l2arc_exclude_special=1

You don’t want to waste L2ARC capacity by caching stuff that’s already present on other NVMe (assuming L2 and special are both NVMe or both SATA SSD)

I’m not a fan of l2arc_write_max because you end up with basically useless data on your L2ARC after backup/installs/lots of reads. L2 gets all the evicted MRU stuff that is only accessed once. The throttle is not only for saving drive endurance, but also improves scan resistance. L2ARC doesn’t have an adaptive MRU/MFU “slider” like the ARC and is more primitive in general.

You can bypass this problem via l2arc_mfuonly=1 to exclude any eligible MRU data from the L2. But this may be counter productive if you have a considerable fraction of MRU in normal operation.

And I also use l2arc_trim_ahead=100 so L2 gets TRIMmed. But be careful with the percentage as it is derived from your l2arc_write_max and can result in worse NVMe performance as the background TRIM obviously puts load on the drive.

1 Like

This is a great experiment! Do you have any grid for how write performance compared in the two setups?

I’ve read a couple posts about using LVM cache plus ZFS as well. But Bcache write-back mode is superior to LVM cache write-back performance as LVM only caches hot writes unless you’re in Writecache mode (which gives no read cache). But Bcache write-back sends all new writes to cache first giving a big boost to slow backing disks.

In a few years we will likely see a stable BcacheFS in the kernel and more of these discussions will be relics of the past.

But in the mean time so many of us want the benefits of ZFS and the high performance of NVMe cache for writes and reads.

Thank you for this detailed writeup and the follow up notes on L2arc testing. I loved how you even demonstrated how to skip the “64gb RAM ARC requirement” most state as the cost of entry to L2arc.

I’m trying to set this right now and I must be missing something. I’m on Proxmox 8

Created /etc/modprobe.d/zfs.conf (didn’t exist) and added a single line and rebooted.

options zfs l2arc_exclude_special=1

Reboot

cat /sys/module/zfs/parameters/l2arc_exclude_special
0

I checked syntax, directory and parameter name, looks correct.

I don’t run ZFS with altered parameters on Proxmox. I use it on FreeBSD (TrueNAS Core) and Ubuntu (23.10, which has all my parameters set during boot via zfs.conf). Might be a Debian or Proxmox quirk. I remember asking in the Linux problem megathread here a similar thing (can’t find my posting anymore, just checked) and touching and editing the zfs.conf in another directory solved the problem. /etc/modprobe.d/zfs.conf may be the reference implementation, but I’ve seen problems myself with this.

There is always the option to set the correct value ( echo in /sys/module/…) manually during init process or via cron job. The zfs.conf is just a more proper and straight forward way to do things that you always want for your pool.

I’d start by checking if the modprobe.d drop-in is even parsed, you should see something in syslog/journalctl. You can, for instance, add an error like setting an invalid property and checking if modprobe throws an error.

I’m not using Proxmox so the following is a wild guess, but it’s possible that the zfs module is being loaded even before your filesystem is loaded and thus the modprobe options do not apply. Possibly updating initramfs is enough. Setting the property in kernel params should also be an option.