Will ZFS ever support NVMe as cache like bcache/dm-cache?

I have a little bit of performance problem with ZFS. I have a system with mixed hard drive and some spare NVMe (although just consumer grade and can be a red flag even though I will be running them in RAID1/mirror), but far as I remember:

ZIL Intent Log (otherwise known as SLOG) is not a cache; it is just a temporary buffer to store sync transaction logs (edit: thanks @Ghan for correction)

L2ARC, now that it was persistent, is an interesting one. However still, the working mechanism of L2ARC requires a long-term use to determine which ARC entries can be evicted to L2ARC (which is optimally Optane/NVMe/Flash storage).

So, in reality L2ARC really just worked more like a tiered read cache and does not have the capability of a traditional sense of writeback cache, which is the thing that I want.

And metadata device is just for storing filesystem information. Not really related to IO at all.

I’m going to evaluate using LVM on ZFS, more specifically ZVOL to see if the performance can be helped. There is an obvious problem though as both the functionalities of LVM and ZFS overlaps so much and I really hope ZFS to bake in something like bcache/dm-cache so much. If that’s so I can have speed up my iSCSI so much better like QNAP/Synology level with NVMe cache that I helped my company to speed up with.


I have tried to setup a 10Gbe home network, directly connecting my 544+FLR, which is originally a 40gbe/50gbe card into a 10gbe connection using a passive DAC cable. It doesn’t work in the beginning well, but now after some tinkering, it worked like a charm! The speed up is phenomenal, and I will have to redesign some of my cable management to cater for the new 10gbe switch. Unfortunately, that 10gbe switch is of TP-Link brand with a rather value-oriented positioning, so I cannot setup LACP to get further speed up by multiplexing the connection. My ZFS setup is untouched after setting up the cards. Here’s a quick CB to show that it indeed had immediate improvement:

Here’s the original speed over 1Gbe:
image

Here’s the new speed using the setup I mentioned above:
image

I will be trying to tune ZFS to get the NVMe “wire speed” as much as possible, though this will have many more factors such as SLC caching, memory throughput, TXG time, latency fluctuation… all sorts of stuffs need to taken into account for maximum performance. I don’t have time recently, but I will call this a small victory. The materials shown above is for reference and relative comparison only, and I will try to a more intensive benchmark and research over the coming weeks in free time.

The saga continues…

1 Like

Review Wendel’s post on ZFS special devices and look for the ‘special_small_blocks’ parameter.

When set to >0 the special device stores data chunks up to that size. If the special devices are significantly faster than the main storage media for accessing these smaller chunk sizes (e.g. nvme vs. hdd) this setup will result in a significant performance gain.

It should at least be higher than a construct of LVM on top of ZFS ZVols.

1 Like

This is correct.

This is not correct.
The ZIL is always used (even if there is no SLOG device defined) for synchronous I/O write requests. The high level data flow for a sync write is as follows:

  1. Write I/O is issued to the OS.
  2. Data is written to both the RAM buffer and the ZFS ZIL.
  3. Once the data is persisted on the ZIL, ZFS returns the write operation as complete on disk.
  4. ZFS then comes back and flushes the RAM buffer to the durable pool VDEVs.

Using fast NVMe devices as a SLOG (Separate Log Device) to store the ZIL can improve performance because it eliminates the need to double-write the data, and it reduces the performance penalty if the SLOG device is faster than the pool itself (latency). However, it will not help for asynchronous writes, and it will not help if there is a huge sequential write workload on the pool. It is primarily to improve IOPS responsiveness for random I/O that is able to fit in the buffer by smoothing out the writes and writing data to the pool in larger chunks.

So the question for your request is: What kind of workload do you have, and what performance optimization are you trying to achieve? You mention iSCSI, but do you have synchronous writes? Are you trying to improve IOPS or pool throughput? Write latency? Read latency?

5 Likes

I’m storing gaming footage, some 4K movies and some project code, maybe edit a few 4K videos or two over the 40Gbe network I piggybacked onto a 10Gbe switch.

I even want to use iSCSI to boot my operating systems since I work on Linux and Windows simultaneously, and sometimes I want my main battle-workstation (yes…I run workstation stuff on my gaming PC…) to be a bare metal K8S node that runs Kubeflow without having to use WSL2 because the performance can sometimes be abysmal in case especially on memory and raw device access performance.

I know this could work because I used to add iSCSI entries to UEFI menu in TianoCore powered VM. I found it quite insane to be working out of the box without having the OS to notice though I know it is unsafe because once the network turned belly up so will the OS but I don’t really care because most of my stuff is on Git anyway

[email protected]:~# zfs get special_small_blocks rpool2
NAME    PROPERTY              VALUE                 SOURCE
rpool2  special_small_blocks  128K                  local

So I would have to tune it…

The performance is blasphemous though:

[email protected]:~# fio     --name=randrw     --ioengine=io_uring     --rw=randrw     --bs=4k     --numjobs=1     --size=4g     --iodepth=1     --runtime=60     --time_based     --end_fsync=1
randrw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=932KiB/s,w=932KiB/s][r=233,w=233 IOPS][eta 00m:00s]
randrw: (groupid=0, jobs=1): err= 0: pid=710136: Thu Jan  5 01:10:00 2023
  read: IOPS=303, BW=1212KiB/s (1242kB/s)(71.1MiB/60017msec)
    slat (usec): min=2, max=643255, avg=1447.85, stdev=8909.71
    clat (nsec): min=60, max=40362, avg=223.63, stdev=403.57
     lat (usec): min=2, max=643260, avg=1448.29, stdev=8909.83
    clat percentiles (nsec):
     |  1.00th=[   80],  5.00th=[   90], 10.00th=[   90], 20.00th=[  100],
     | 30.00th=[  110], 40.00th=[  120], 50.00th=[  131], 60.00th=[  141],
     | 70.00th=[  191], 80.00th=[  290], 90.00th=[  442], 95.00th=[  724],
     | 99.00th=[  972], 99.50th=[ 1560], 99.90th=[ 3216], 99.95th=[ 3792],
     | 99.99th=[10944]
   bw (  KiB/s): min=    8, max=52992, per=100.00%, avg=1325.72, stdev=5657.19, samples=109
   iops        : min=    2, max=13248, avg=331.43, stdev=1414.30, samples=109
  write: IOPS=303, BW=1214KiB/s (1243kB/s)(71.1MiB/60017msec); 0 zone resets
    slat (nsec): min=600, max=415986, avg=2502.02, stdev=9030.00
    clat (nsec): min=140, max=1934.8M, avg=1843585.48, stdev=26762311.91
     lat (usec): min=9, max=1934.8k, avg=1846.17, stdev=26763.01
    clat percentiles (usec):
     |  1.00th=[     13],  5.00th=[     17], 10.00th=[     27],
     | 20.00th=[     34], 30.00th=[     36], 40.00th=[     38],
     | 50.00th=[     40], 60.00th=[     44], 70.00th=[     62],
     | 80.00th=[    225], 90.00th=[   6390], 95.00th=[   8586],
     | 99.00th=[  10421], 99.50th=[  11207], 99.90th=[  51119],
     | 99.95th=[ 429917], 99.99th=[1937769]
   bw (  KiB/s): min=    8, max=53320, per=100.00%, avg=1328.59, stdev=5663.81, samples=109
   iops        : min=    2, max=13330, avg=332.15, stdev=1415.95, samples=109
  lat (nsec)   : 100=5.99%, 250=31.56%, 500=8.62%, 750=1.37%, 1000=2.02%
  lat (usec)   : 2=0.29%, 4=0.17%, 10=0.13%, 20=3.21%, 50=29.55%
  lat (usec)   : 100=4.90%, 250=2.37%, 500=0.34%, 750=0.08%, 1000=0.02%
  lat (msec)   : 2=0.04%, 4=1.89%, 10=6.58%, 20=0.80%, 50=0.03%
  lat (msec)   : 100=0.02%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=0.10%, sys=1.20%, ctx=22155, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=18192,18209,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=1212KiB/s (1242kB/s), 1212KiB/s-1212KiB/s (1242kB/s-1242kB/s), io=71.1MiB (74.5MB), run=60017-60017msec
  WRITE: bw=1214KiB/s (1243kB/s), 1214KiB/s-1214KiB/s (1243kB/s-1243kB/s), io=71.1MiB (74.6MB), run=60017-60017msec

I’m not sure if I hit the SLC cache of my NVMe drive or what but the IOPS is terrible like I’m still running on my rusty spinning disk despite I have both ZIL and L2ARC, binding to my ZFS pool with my NVMe drives

Yes, this is true in general for good zfs performance.

While the defaults work for “general” or “average” use, zfs allows to adjust functionality to specific workloads and present all this in a potentially single file system.

The default recordsize in ZFS is 128k. The special_small_blocks parameter should be smaller than that - or all the data for this dataset will be stored on the special device.

Please note that the special device is not a cache, but rather diverts small record size storage to a different set of vdevs.

For a video/movie dataset you want to tune zfs to use a large recordsize, no compression and a special device that avoids HDDs slowing down when accessing small chunks.

btw: your fio test is simulating the exact opposite use case.

1 Like

ZFS is not going to support traditional promotion of hot data to a SSD pool anytime soon, and I currently am not aware of any current viable effort to implement this. To do this you’d basically set up an SSD and HDD pool, and use something like bcache over top of them. Or the reverse

As an alternative that has worked great for me, if you have special vdevs made of (mirrored) SSDs that are 500GB-1TB in size, you can basically set special_small_blocks to 64K across the pool to put both metadata and your worst performing data on the SSDs instead of the HDDs (after redoing the pool/moving the data first, since it will only apply to new data written) automatically without worrying about space until you have a ridiculous amount of data (tens of TB’s) or special storage needs. This is why I greatly prefer regular flash special vdevs over using optane due to capcity/cost vs benefit gained. When implementing special vdevs, I recommend backing up and remaking the pool from scratch with special vdevs from the beginning, in order to prevent strange problems.

special_small_blocks can currently be set to 1M (at least I’m pretty sure, never tested it. Previously it was limited to 128K), but you’ll need to adjust the recordsize on the datasets to be higher than whatever the special_small_blocks is for that dataset. Otherwise it’ll all go to the special vdev.

If you need a particular set of data accelerated in it’s entirety, put it on a dataset, and set the special_small_blocks to match the recordsize of that particular dataset.

This allows you to send/recv or otherwise manage your pools without having to worry about what data is where.

Eventually I think bcachefs will satisfy most of the desires of advanced linux users once it matures. That’s a ways off yet, but it’s something to watch.

2 Likes

There’s bcache (different from bcachefs) currently which you could slip under zfs, but I’ve read some horror stories on the internet, and my setup isn’t that stable so I gave up on bcache.

Instead, because I have a specific workload, I move data not recently modified from SSD to HDD using find … | sort … | rsync … shell script


There’s another thread on the forums where someone is asking about caching NFS content locally and someone mentioned fs-cache.

Maybe that’s an option for you too?


Back to block level caching, there’s dmcache

for zfs datasets yes

the default size for zvols is 8k last i checked.

1 Like

I use memory as write cache. sync disabled and 40GB of dirty data maximum along with several other tunables geared towards “hey let’s keep it in RAM and let HDDs write at sequential speeds via TXG sync”. Works for me with 10Gbit networking and transfers up to around 100GB before performance isn’t wire-speed anymore and converges to HDD sequential writes.

Reads…well, get more ARC+L2. Improving reads via ARC and special vdevs is far more straight forward.

3 Likes

Losing special vdevs would cause pool failure though.
Losing L2ARC vdevs wouldn’t.

Planning for special vdevs is not that straight forward for me (can’t just throw an old SSD to it)