ZVOL vs. file as VM backing - huge performance difference on nvme-based zpool

oegat · February 15, 2022, 9:17pm

TL;DR

QCOW2 (and raw) volumes on top of a gen4 nvme ZFS pool are much slower than ZVOLs (and QCOW2 on ext4) for backing a Windows 10 VM, I did not expect that.

Background

I’ve got 4x 1Tb nvme:s (WD SN850, gen4) for backing VMs and /home on my Linux workstation. I’m trying out and benchmarking different storage models for the VMs, and I’m seeing (for me) unexpected differences in performance between ZVOLs on one hand, and QCOW2/RAW files on the other. So I’m asking the experts here if I’m doing something wrong, or if conventional wisdom simply has changed with new hardware.

In the past I’ve been mostly using ZVOLs for VMs, on both hdd- and ssd-based pools. However, ever since reading about Jim Salter’s tests of ZVOL vs. qcow2 for VMs, where he found the performance loss with QCOW2 to be tiny, I’ve been considering using QCOW2-on-ZFS for my next machine (this one). However, the performance hit with QCOW2 for my Windows 10 VM is much larger than is observed (with Linux VMs) in the linked blog post. I’m going to present representative tests here, and see what L1Techs has to say.

Test setup

System: EPYC 7252 (8c 3.1-3.2GHz), 64Gb RAM in 4 channels, Supermicro H12SSL-I
Host: Linux 5.13.0-28-generic #31~20.04.1-Ubuntu SMP (Kubuntu 20.04.3)
Storage: 4x WD SN850 1Tb (4k format, striped mirrors zpool, ashift = 12)
ZFS version: zfs-0.8.3-1ubuntu12.13, zfs-kmod-2.0.6-1ubuntu2
KVM Guest (via libvirt): Windows 10 Pro, virtio-scsi storage driver (0.1.190)
NTFS cluster size: 4k (default)
Libvirt details:

Device model = virtio-scsi
Cache mode = none
IO mode = native
Discard = unmap
Detect Zeroes = off

Test software (in guest): CrystalDiskMark 8, "nvme ssd" preset (for exact tests see results below)

Zdb info

nvme:
    version: 5000
    name: 'nvme'
    state: 0
    txg: 258406
    pool_guid: 1480964832145129513
    errata: 0
    hostid: 255178082
    hostname: 'snucifer'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 1480964832145129513
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 9182964367811329375
            metaslab_array: 146
            metaslab_shift: 33
            ashift: 12
            asize: 1000118681600
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 7877337056442423979
                path: '/dev/disk/by-id/nvme-WDS100T1X0E-00AFY0_2139DK441905-part1'
                devid: 'nvme-WDS100T1X0E-00AFY0_2139DK441905-part1'
                phys_path: 'pci-0000:81:00.0-nvme-1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 8728541461798927995
                path: '/dev/disk/by-id/nvme-WDS100T1X0E-00AFY0_21413J443312-part1'
                devid: 'nvme-WDS100T1X0E-00AFY0_21413J443312-part1'
                phys_path: 'pci-0000:84:00.0-nvme-1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
        children[1]:
            type: 'mirror'
            id: 1
            guid: 6136475571192416831
            metaslab_array: 135
            metaslab_shift: 33
            ashift: 12
            asize: 1000118681600
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 132
            children[0]:
                type: 'disk'
                id: 0
                guid: 15888088810331018339
                path: '/dev/disk/by-id/nvme-WDS100T1X0E-00AFY0_2139DK445807-part1'
                devid: 'nvme-WDS100T1X0E-00AFY0_2139DK445807-part1'
                phys_path: 'pci-0000:82:00.0-nvme-1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 133
            children[1]:
                type: 'disk'
                id: 1
                guid: 7038756282300661913
                path: '/dev/disk/by-id/nvme-WDS100T1X0E-00AFY0_21413J448714-part1'
                devid: 'nvme-WDS100T1X0E-00AFY0_21413J448714-part1'
                phys_path: 'pci-0000:83:00.0-nvme-1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 134
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

The tests

I tried a lot of different options, and I’m not reviewing all here. The results below are representative. ZVOLs are much faster than QCOW2 except for 4k IO at tiny queue depths. I tried with ARC both on and off (primarycache=all vs. metadata).

NB: In my limited testing with RAW files, they seem to perform pretty much as QCOW2. So these tests can generally be seen as ZVOL vs. File-backed IO, rather than QCOW2 specifically.

Use the drop-down lists below to compare any two sets of results.

ZVOL Tests

ZVOL properties

NAME             PROPERTY              VALUE                  SOURCE
nvme/testvol64k  type                  volume                 -
nvme/testvol64k  creation              ons dec 15 21:45 2021  -
nvme/testvol64k  used                  4,90M                  -
nvme/testvol64k  available             1,74T                  -
nvme/testvol64k  referenced            4,90M                  -
nvme/testvol64k  compressratio         4.81x                  -
nvme/testvol64k  reservation           none                   default
nvme/testvol64k  volsize               10G                    local
nvme/testvol64k  volblocksize          64K                    -
nvme/testvol64k  checksum              on                     default
nvme/testvol64k  compression           lz4                    inherited from nvme
nvme/testvol64k  readonly              off                    default
nvme/testvol64k  createtxg             42056                  -
nvme/testvol64k  copies                1                      default
nvme/testvol64k  refreservation        none                   default
nvme/testvol64k  guid                  9876788460306791810    -
nvme/testvol64k  primarycache          (varies, see below)
nvme/testvol64k  secondarycache        all                    default
nvme/testvol64k  usedbysnapshots       0B                     -
nvme/testvol64k  usedbydataset         4,90M                  -
nvme/testvol64k  usedbychildren        0B                     -
nvme/testvol64k  usedbyrefreservation  0B                     -
nvme/testvol64k  logbias               latency                default
nvme/testvol64k  objsetid              174                    -
nvme/testvol64k  dedup                 off                    default
nvme/testvol64k  mlslabel              none                   default
nvme/testvol64k  sync                  standard               default
nvme/testvol64k  refcompressratio      4.81x                  -
nvme/testvol64k  written               4,90M                  -
nvme/testvol64k  logicalused           23,3M                  -
nvme/testvol64k  logicalreferenced     23,3M                  -
nvme/testvol64k  volmode               default                default
nvme/testvol64k  snapshot_limit        none                   default
nvme/testvol64k  snapshot_count        none                   default
nvme/testvol64k  snapdev               hidden                 default
nvme/testvol64k  context               none                   default
nvme/testvol64k  fscontext             none                   default
nvme/testvol64k  defcontext            none                   default
nvme/testvol64k  rootcontext           none                   default
nvme/testvol64k  redundant_metadata    all                    default
nvme/testvol64k  encryption            off                    default
nvme/testvol64k  keylocation           none                   default
nvme/testvol64k  keyformat             none                   default
nvme/testvol64k  pbkdf2iters           0                      default

Test results (primarycache=all)

[Read]
  SEQ    1MiB (Q=  8, T= 1):  8780.844 MB/s [   8374.1 IOPS] <   945.28 us>
  SEQ  128KiB (Q= 32, T= 1):  5294.938 MB/s [  40397.2 IOPS] <   787.53 us>
  RND    4KiB (Q= 32, T=16):   217.263 MB/s [  53042.7 IOPS] <  4945.57 us>
  RND    4KiB (Q=  1, T= 1):    47.461 MB/s [  11587.2 IOPS] <    85.98 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):  3399.262 MB/s [   3241.8 IOPS] <  2431.31 us>
  SEQ  128KiB (Q= 32, T= 1):  2871.813 MB/s [  21910.2 IOPS] <  1455.00 us>
  RND    4KiB (Q= 32, T=16):   162.814 MB/s [  39749.5 IOPS] < 10739.13 us>
  RND    4KiB (Q=  1, T= 1):    41.492 MB/s [  10129.9 IOPS] <    98.27 us>

...
 OS: Windows 10 Professional N [10.0 Build 19043] (x64)
Comment: vioscsi zvol64k 4k

Test results (primarycache=metadata)

[Read]
  SEQ    1MiB (Q=  8, T= 1):  4657.319 MB/s [   4441.6 IOPS] <  1793.65 us>
  SEQ  128KiB (Q= 32, T= 1):  3700.376 MB/s [  28231.6 IOPS] <  1131.19 us>
  RND    4KiB (Q= 32, T=16):   178.914 MB/s [  43680.2 IOPS] <  8398.90 us>
  RND    4KiB (Q=  1, T= 1):    15.363 MB/s [   3750.7 IOPS] <   265.91 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):  2981.145 MB/s [   2843.0 IOPS] <  2783.38 us>
  SEQ  128KiB (Q= 32, T= 1):  2202.196 MB/s [  16801.4 IOPS] <  1895.05 us>
  RND    4KiB (Q= 32, T=16):   131.413 MB/s [  32083.3 IOPS] < 13802.99 us>
  RND    4KiB (Q=  1, T= 1):    14.597 MB/s [   3563.7 IOPS] <   279.71 us>

...    
 OS: Windows 10 Professional N [10.0 Build 19043] (x64)
Comment: vioscsi zvol64k 4k

ZVOL Comments
Here performance is around what I’d expect for zvols. I used a 64k volblocksize in these tests, but the default 16k did not make much difference. (I generally tried lots of other options that I don’t show here).

QCOW2 Tests

ZFS properties

NAME             PROPERTY              VALUE                  SOURCE
nvme/testbed64k  type                  filesystem             -
nvme/testbed64k  creation              tis dec 14 20:35 2021  -
nvme/testbed64k  used                  8,79M                  -
nvme/testbed64k  available             1,74T                  -
nvme/testbed64k  referenced            8,79M                  -
nvme/testbed64k  compressratio         4.07x                  -
nvme/testbed64k  mounted               yes                    -
nvme/testbed64k  quota                 none                   default
nvme/testbed64k  reservation           none                   default
nvme/testbed64k  recordsize            64K                    local
nvme/testbed64k  mountpoint            /nvme/testbed64k       default
nvme/testbed64k  sharenfs              off                    default
nvme/testbed64k  checksum              on                     default
nvme/testbed64k  compression           lz4                    inherited from nvme
nvme/testbed64k  atime                 off                    inherited from nvme
nvme/testbed64k  devices               on                     default
nvme/testbed64k  exec                  on                     default
nvme/testbed64k  setuid                on                     default
nvme/testbed64k  readonly              off                    default
nvme/testbed64k  zoned                 off                    default
nvme/testbed64k  snapdir               hidden                 default
nvme/testbed64k  aclinherit            restricted             default
nvme/testbed64k  createtxg             39146                  -
nvme/testbed64k  canmount              on                     default
nvme/testbed64k  xattr                 sa                     inherited from nvme
nvme/testbed64k  copies                1                      default
nvme/testbed64k  version               5                      -
nvme/testbed64k  utf8only              off                    -
nvme/testbed64k  normalization         none                   -
nvme/testbed64k  casesensitivity       sensitive              -
nvme/testbed64k  vscan                 off                    default
nvme/testbed64k  nbmand                off                    default
nvme/testbed64k  sharesmb              off                    default
nvme/testbed64k  refquota              none                   default
nvme/testbed64k  refreservation        none                   default
nvme/testbed64k  guid                  16607139284652682178   -
nvme/testbed64k  primarycache          (varies, see below)
nvme/testbed64k  secondarycache        all                    default
nvme/testbed64k  usedbysnapshots       0B                     -
nvme/testbed64k  usedbydataset         8,79M                  -
nvme/testbed64k  usedbychildren        0B                     -
nvme/testbed64k  usedbyrefreservation  0B                     -
nvme/testbed64k  logbias               latency                default
nvme/testbed64k  objsetid              168                    -
nvme/testbed64k  dedup                 off                    default
nvme/testbed64k  mlslabel              none                   default
nvme/testbed64k  sync                  standard               default
nvme/testbed64k  dnodesize             legacy                 default
nvme/testbed64k  refcompressratio      4.07x                  -
nvme/testbed64k  written               8,79M                  -
nvme/testbed64k  logicalused           35,3M                  -
nvme/testbed64k  logicalreferenced     35,3M                  -
nvme/testbed64k  volmode               default                default
nvme/testbed64k  filesystem_limit      none                   default
nvme/testbed64k  snapshot_limit        none                   default
nvme/testbed64k  filesystem_count      none                   default
nvme/testbed64k  snapshot_count        none                   default
nvme/testbed64k  snapdev               hidden                 default
nvme/testbed64k  acltype               off                    default
nvme/testbed64k  context               none                   default
nvme/testbed64k  fscontext             none                   default
nvme/testbed64k  defcontext            none                   default
nvme/testbed64k  rootcontext           none                   default
nvme/testbed64k  relatime              off                    default
nvme/testbed64k  redundant_metadata    all                    default
nvme/testbed64k  overlay               off                    default
nvme/testbed64k  encryption            off                    default
nvme/testbed64k  keylocation           none                   default
nvme/testbed64k  keyformat             none                   default
nvme/testbed64k  pbkdf2iters           0                      default
nvme/testbed64k  special_small_blocks  0                      default

Test results (primarycache=all)

[Read]
  SEQ    1MiB (Q=  8, T= 1):  2692.726 MB/s [   2568.0 IOPS] <  3099.41 us>
  SEQ  128KiB (Q= 32, T= 1):  2545.741 MB/s [  19422.5 IOPS] <  1643.67 us>
  RND    4KiB (Q= 32, T=16):   165.651 MB/s [  40442.1 IOPS] < 10757.95 us>
  RND    4KiB (Q=  1, T= 1):    77.860 MB/s [  19008.8 IOPS] <    52.32 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):  1507.677 MB/s [   1437.8 IOPS] <  5528.01 us>
  SEQ  128KiB (Q= 32, T= 1):  1269.009 MB/s [   9681.8 IOPS] <  3295.35 us>
  RND    4KiB (Q= 32, T=16):   117.596 MB/s [  28710.0 IOPS] < 16458.89 us>
  RND    4KiB (Q=  1, T= 1):    59.874 MB/s [  14617.7 IOPS] <    68.01 us>
...
     OS: Windows 10 Professional N [10.0 Build 19043] (x64)
Comment: vioscsi fs64k qcow2 4k

Test results (primarycache=metadata)

[Read]
  SEQ    1MiB (Q=  8, T= 1):   513.418 MB/s [    489.6 IOPS] < 16144.88 us>
  SEQ  128KiB (Q= 32, T= 1):   377.859 MB/s [   2882.8 IOPS] < 11042.59 us>
  RND    4KiB (Q= 32, T=16):    21.714 MB/s [   5301.3 IOPS] < 93358.63 us>
  RND    4KiB (Q=  1, T= 1):    16.399 MB/s [   4003.7 IOPS] <   249.20 us>

[Write]
  SEQ    1MiB (Q=  8, T= 1):   466.400 MB/s [    444.8 IOPS] < 17852.84 us>
  SEQ  128KiB (Q= 32, T= 1):   326.851 MB/s [   2493.7 IOPS] < 12753.79 us>
  RND    4KiB (Q= 32, T=16):    25.728 MB/s [   6281.2 IOPS] < 79446.86 us>
  RND    4KiB (Q=  1, T= 1):    15.033 MB/s [   3670.2 IOPS] <   271.44 us>

...
     OS: Windows 10 Professional N [10.0 Build 19043] (x64)
Comment: vioscsi fs64k qcow2 4k

QCOW2 Comments
Here is where it gets much slower than expected, except for Q1T1 4k writes. In essence, QCOW2 backed volumes performs < 50% of ZVOLs for sequential IO, 50-60% for large queue-depth 4k random IO, and 150% for low queue-depth 4k random IO.

The performance hit for primarycache=metadata (vs. all) is much larger on QCOW2 compared to on the ZVOL. This suggests to me that qemu’s IO subsystem really struggles unless it can pool together operations and access larger chunks of data. I used a recordsize of 64k (rather than default 128k) in order to align with the 64k QCOW2 cluster size.

Other observations

CPU Utilization
I did not fully monitor CPU utilization with every test. However my impression was that the CPU had to work more with ZVOL compared to QCOW2, suggesting against the CPU being a bottleneck.

Adjustments I tried, that did not change the pattern
compression=off (vs. lz4)
recordsize=8k,128k (vs. 64k)
volblocksize=8k,128k (vs. 64k)
NTFS cluster size = 8k,64k (vs. 4k)
virtio disk model (vs virtio-scsi)
All disks performed as expected on their own
QCOW2 on Ext4 performs well
I’m pretty sure I tried mounting the QCOW2 file on the host and test with fio, and that it performed well, but I don’t find the results right now
Last but not least, RAW files gave the same result as QCOW2 as far as I investigated

Adjustments not tried
ashift = 13 (vs. 12)
Linux guest (vs. Windows 10)
Changing libvirt options except device model (IO mode, cache mode, …)
raidz (vs. striped mirrors)

The question

So what’s going on here? The performance loss for file-backed IO is surprisingly large, especially looking at Jim Salter’s results from 2018. The fact that RAW files give similar bad results, suggests to me it could be an issue with how libvirt interacts with ZFS.

Thanks for reading this far! Any input on whether my expectations are reasonable, and if so what I could try next?

Log · February 15, 2022, 9:52pm

My answer is lol dunno, however this sounded familiar to things that have come up in the past. This thread might have something relevant: https://www.reddit.com/r/zfs/comments/bedbwb/remarkable_performance_boost_after_switching_from/

oegat · February 16, 2022, 1:16pm

Thanks, I had not seen that Reddit post before! It seems to be about a partially opposite problem though, a performance hit to random writes compared to the rest, while using a writeback cache mode. While I see most problems with sequential IO, and chache mode = none. But yeah, the discussion about cache modes seems it could be relevant.

I get similar %-differences in performance (as in my original post) if I repeat the tests on a pool consisting of a single SATA3 SSD (Samsung 860 Evo 1Tb). This drive is in the same ballpark as those Jim Salter used in his experiment, which makes me think this is an issue with my virtualization setup and/or MS Windows.

What I’ll probably try to do later today, is to test with a Linux VM keeping everything else equal. To check whether it could be a MS Windows thing.

oegat · April 20, 2022, 8:20pm

I haven’t updated here in a while. I did some more testing, it seems that the problem appears mainly when using the “io=native” setting in libvirt. Using “io=threaded” yields much less discrepancy between file vs. ZVOL backing, however at the cost of more CPU usage overall.

I have some benchmark numbers, but not in an organized format so I wont post them right now. I will try to sort this issue out, but now that a new Ubuntu LTS is coming in a day I’ll probably upgrade the OS first, and see if the discrepancies persist. I’m not ruling out software issues at this point.

(Its not that I’m getting too slow performance for my needs, I’m just concerned that I might have introduced some unnecessary overhead somewhere).

oegat · October 3, 2022, 10:44pm

Here is a late update on this issue. As mentioned above, I’ve kind of pinpointed the problem to be related to Linux native AIO, or in QEMU/libvirt terms, io=native.

Here are some recent representative data (zvol left, qcow2 right, expand to view):

io=threads

io=native

The orange frame the discrepant performance.

As seen above, performance tanks for qcow2 when io=native (but not when io=threads). Note that the test settings differ slightly from those in my first post, but the pattern is clear either way.

Detailed storage options (hidden for readability)

NTFS allocation unit size: 4k
qcow2 flags: cluster_size=64k,preallocation=metadata,lazy_refcounts=on
virtual disk options (apart from io): device model=virtio-scsi,cache=none, discard=unmap
iothreads: 1 per virtual scsi controller (effectively 1 per virtual disk)
zfs/zvol options: recordsize/volblocksize=64k,compression=on,primarycache=metadata,(xattr=sa,atime=off when applicable)
file/volume size 10G

I tried to match the parameters of the zvol+ntfs and the dataset+qcow2+ntfs complexes as much as possible. Again, raw files on dataset gives similar (bad) performance as qcow2 with io=native, compared to io=threads.

(I’m aware that the 4k/64k mismatch present on both storage models gives rise to some write amplification - both in theory and as verified by zpool iostat - however this does not explain any variance for the issue discussed here)

Explanation?

I’ve seen from various sources that native IO are reported to have problems with sparsely provisioned storage. E.g. in this link:

When space efficient image files are used (QCOW2 without pre-allocation, or sparse raw images) the default of io=‘threads’ may be better suited. This is because writing to not yet allocated sectors may temporarily block the virtual CPU and thus decrease I/O performance.

and in this link (thanks to @Dratatoo and @vitoyxyz in another thread):

Sources of blocking in io_submit(2) depend on the file system and block devices being used. There are many different cases but in general they occur because file I/O code paths contain synchronous I/O (for metadata I/O or page cache write-out) as well as locks/waiting (for serializing operations). This is why the io_submit(2) system call can be held up while submitting a request.

This means io_submit(2) works best on fully-allocated files, volumes, or block devices. Anything else is likely to result in blocking behavior and cause poor performance.

Blocking is a reasonable characterisation of the behaviours I see. When monitoring with zpool iostat, I do see pauses with no disk activity together with the observed performance dips for native io. However, something is still off with this explanation - also my zvols are sparse, they are allocated with the -s -V flags to zfs create. And for them I don’t see any problems.

Moreover, I mentioned above that I did see the same issues with raw files on top of datasets. Though they were created with qemu-img create -f raw ..., which creates sparse raw files by default.

So I tested creating a preallocated raw file, using dd if=/dev/zero of=file.img .... In addition, I put it on a dataset with compression=off, as I know that zfs would otherwise compress the 0s to almost nothing. I wasn’t sure what this compression would imply in terms of sparseness. The result is a file as un-sparse as one could possibly get it on a zfs dataset. And still, the results look like those for qcow2 files, i.e. bad:

Preallocated raw image on top of zfs dataset, io=native

Slightly better than qcow2 file under the same conditions, but nowhere near what we see for threaded IO (regardless of storage model).

One final point of observation, is that when specifying the virtual disks with virt-manager, then io=native is chosen as default for zvols, whereas io=threads is chosen as default for files on datasets. So the issue observed here seems to be known by the libvirt/virt-manager devs.

Consequences?

As I wrote in a previous post, i saw minor to medium increases in CPU usage for io=threads compared to io=native (in situations where performance was comparable). However, I’ve also came across combinations of options where the CPU load difference seemed minor, but I haven’t investigated it systematically. There are a lot of optimization that I haven’t tried, that could mitigate the drawbacks.

Conclusion: threaded IO for files, native IO for zvols

Only io=threads works reasonably well for files on datasets, regardless of whether the files are sparse or not. io=native works slightly better for zvols, sparse or not. Why this is the case is not entirely clear to me.

cowphrase · October 3, 2022, 11:39pm

How is performance on a traditional filesystem, like xfs or ext4? (And out of curiosity how does btrfs go?). Also I wonder if io_uring would show the same slowness as io=native. IIRC this is now the default in proxmox.

oegat · October 4, 2022, 7:53am

Excellent questions!

file on ext4: native worked as well as threads for files, i.e. no problem
io_uring: same behaviour as native for files on datasets, i.e. same problem

I haven’t tried btrfs, I don’t know enough about it to see whether it would be a quick operation to test.

cowphrase · October 4, 2022, 8:41am

Ah sounds like the results are as expected, though I’m surprised io_uring doesn’t help. BTRFS is copy on write with no ZVOL equivalent, so I’d expect the same behaviour with qcow2/raw acting badly with io=native.

Log · October 4, 2022, 10:20am

Fascinating testing, thank you for looking into this.

oegat · October 4, 2022, 8:35pm

Ok, I tested btrfs as a control condition, I hope I did the device creation right!

Preparations and setup

root@(redacted):/dev/disk/by-id# mkfs.btrfs -d raid10 -m raid10 -f nvme-WDS100T1X0E-00AFY0_(redacted) nvme-WDS100T1X0E-00AFY0_(redacted) nvme-WDS100T1X0E-00AFY0_(redacted) nvme-WDS100T1X0E-00AFY0_(redacted)
btrfs-progs v5.16.2
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM nvme-WDS100T1X0E-00AFY0_(redacted) (931.51GiB) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Performing full device TRIM nvme-WDS100T1X0E-00AFY0_(redacted) (931.51GiB) ...
Performing full device TRIM nvme-WDS100T1X0E-00AFY0_(redacted) (931.51GiB) ...
Performing full device TRIM nvme-WDS100T1X0E-00AFY0_(redacted) (931.51GiB) ...
Label:              (null)
UUID:               0c784342-7400-4868-9ea1-cc006c36190f
Node size:          16384
Sector size:        4096
Filesystem size:    3.64TiB
Block group profiles:
  Data:             RAID10            2.00GiB
  Metadata:         RAID10          512.00MiB
  System:           RAID10           16.00MiB
SSD detected:       yes
Zoned device:       no
Incompat features:  extref, skinny-metadata, no-holes
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1   931.51GiB  nvme-WDS100T1X0E-00AFY0_(redacted)
    2   931.51GiB  nvme-WDS100T1X0E-00AFY0_(redacted)
    3   931.51GiB  nvme-WDS100T1X0E-00AFY0_(redacted)
    4   931.51GiB  nvme-WDS100T1X0E-00AFY0_(redacted)

root@(redacted):/mnt# mount /dev/disk/by-uuid/0c784342-7400-4868-9ea1-cc006c36190f /mnt
root@(redacted):/mnt# dd if=/dev/zero of=non-sparse.img bs=1M count=4096
root@(redacted):/mnt# qemu-img create -f qcow2 -o preallocation=metadata,lazy_refcounts=on testdisk.qcow2 10G
Formatting 'testdisk.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off preallocation=metadata compression_type=zlib size=10737418240 lazy_refcounts=on refcount_bits=16
root@(redacted):/mnt# qemu-img create -f qcow2 -o preallocation=metadata,lazy_refcounts=on,extended_l2=on,cluster_size=128k  testdisk_el2.qcow2 10G
Formatting 'testdisk_el2.qcow2', fmt=qcow2 cluster_size=131072 extended_l2=on preallocation=metadata compression_type=zlib size=10737418240 lazy_refcounts=on refcount_bits=16

root@(redacted):/mnt# ls -lsh
total 4.1G
4.0G -rw-r--r-- 1 libvirt-qemu kvm  4.0G Oct  4 21:44 non-sparse.img
1.8M -rw-r--r-- 1 libvirt-qemu kvm   11G Oct  4 21:44 testdisk_el2.qcow2
1.8M -rw-r--r-- 1 libvirt-qemu kvm   11G Oct  4 21:44 testdisk.qcow2

Is this a correct setup for a raid10 volume? “Filesystem size: 3.64TiB” is not what I expected for 4 1Tb drives in raid10. Either way the raid topology should not affect the issue we are investigating!

I tried three types of backing files: preallocated raw (with dd), qcow2 with 64k clusters, and qcow2 with 128k clusters and 4k subclusters (extended_l2=on).

Results

files on btrfs, io=threads

files on btrfs, io=native

Interestingly, read speeds are lower for io=native, but write speeds are generally the same or faster. Although the baseline speed (for io=threads) is generally lower for 1M blocksize, compared to zfs. Looking at only 1M reads it seems like the same problem appears with btrfs, but not based on the other tests. I don’t know what to conclude, it is harder to say when the baseline speeds are so much lower. (Why? Did I in fact fail to create a raid10-style volume?)

For 4k it’s the opposite, btrfs seems faster than zfs. The 4k advantage for btrfs could have to do with the fact that my btrfs ended up with 4k “Sector size” by default, whereas my zfs setups always had at least 64k recordsize (I rather not go lower as that would impact compression negatively). There are surely options to play around with here.

cowphrase · October 4, 2022, 11:12pm

Thanks for taking the time to run the tests.

I’m not too surprised. IIRC the RAID1 mode on BTRFS can’t read multiple drives to increase read speeds unless you have multiple processes involved. I’ve also found it a little lacking in performance when multiple writers are involved.

It does have some uses though. It’s RAID1 mode is really two copy, so you can use it with multiple drives for unraid-like storage. It also has reflink support which many programs can take advantage of (though XFS now has it, and ZFS will eventually too).

BTRFS has less tunables than ZFS in general. The only relevant one I can think of here is turning off CoW for a file - but that kind of defeats the purpose of using BTRFS.

oegat · October 6, 2022, 7:53am

Right - sounds very useful for testing the impact of COW while keeping the rest constant, though. Will definitely try this and report back

Edit: I tried mounting with the nodatacow option, and created new backing files. There were some performance gains for writes on standard qcow2 images (without ext_l2), but no other interesting differences, other image types included. However, I also found that the speeds I reported above are not super stable, when reverting to a cow-enabled backing file I got up to 30% faster speeds than before for some conditions.

Here are the data from re-running the btrfs tests from a couple of days ago with the fs mounted with the nodatacow option, and new backing files created after that remount. BUT I would take them with a grain of salt because of the above mentioned variability in results, and my general inexperience with btrfs (did I get all options right? etc.).

files on btrfs with nodatacow option, io=threads

files on btrfs with nodatacow option, io=native

(any patterns seen above needs independent confirmation from other testers)

As I’m unlikely to try btrfs in production anytime this decade, I won’t go further into this rabbit hole. Now back to recreating the zpool and start investigating blocksizes (which I think will become another thread).

1ms_hand2eye · January 7, 2024, 5:21am

Any idea how the performance would compare with running a Windows guest from a baremetal install from an NVMe? That is, you can import into qemu a windows install that was used on baremetal by selecting the nvme device in qemu. I wonder how that compares to the qcow2 guest image performance on ZFS.

I’ll be using ZFS either way, but the native windows install has the benefit of being able to be booted directly into it if something about its guest behavior isn’t good enough. Of course the windows guest as a qcow2 has the benefit of better snapshotting and an having the image on the host.

oegat · January 10, 2024, 10:04pm

I would expect very near bare-metal performance in the case you describe. I’ve never tried it with an NVMe device, but I’ve done what you describe with sata ssd:s in the past, with no discernible difference in performance compared to bare-metal. (@Wendell coined a term for using a device for both virtual and bare-metal boot: “duel booting”. There is at least one thread or article about it from some years back).

With an NVMe you would technically have two options: either pass it as a block device (/dev/*) or with PCIe passthrough. I don’t know which method would be best.

The reason I don’t pass raw devices anymore is precisely the one you mention, convenience with snapshotting and backups using ZFS.

Janos · January 12, 2024, 3:14am

There seems to be a problem, your results should be better, I mean you have 4 Gen4 NVME, regardless of whether it is in Raid10 or a Stripe.

That’s a zvol on a single PM9A3, sure I disabled sync, so my writes are a bit inflated…

##zvol conf

<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native" discard="unmap"/>
  <source dev="/dev/zvol/tank02/win11" index="2"/>
  <backingStore/>
  <target dev="sdf" bus="scsi"/>

# zfs get all tank02/win11
NAME          PROPERTY              VALUE                    SOURCE
tank02/win11  type                  volume                   -
tank02/win11  creation              Di Aug  8 16:45 2023     -
tank02/win11  used                  142G                     -
tank02/win11  available             4.56T                    -
tank02/win11  referenced            65.4G                    -
tank02/win11  compressratio         1.04x                    -
tank02/win11  reservation           none                     default
tank02/win11  volsize               120G                     local
tank02/win11  volblocksize          8K                       -
tank02/win11  checksum              on                       default
tank02/win11  compression           lz4                      inherited from tank02
tank02/win11  readonly              off                      default
tank02/win11  createtxg             472                      -
tank02/win11  copies                1                        default
tank02/win11  refreservation        none                     default
tank02/win11  guid                  3178970186585371231      -
tank02/win11  primarycache          all                      default
tank02/win11  secondarycache        all                      default
tank02/win11  usedbysnapshots       76.3G                    -
tank02/win11  usedbydataset         65.4G                    -
tank02/win11  usedbychildren        0B                       -
tank02/win11  usedbyrefreservation  0B                       -
tank02/win11  logbias               latency                  default
tank02/win11  objsetid              3080                     -
tank02/win11  dedup                 off                      default
tank02/win11  mlslabel              none                     default
tank02/win11  sync                  disabled                 inherited from tank02
tank02/win11  refcompressratio      1.07x                    -
tank02/win11  written               15.7G                    -
tank02/win11  logicalused           147G                     -
tank02/win11  logicalreferenced     69.9G                    -
tank02/win11  volmode               default                  default
tank02/win11  snapshot_limit        none                     default
tank02/win11  snapshot_count        none                     default
tank02/win11  snapdev               hidden                   default
tank02/win11  context               none                     default
tank02/win11  fscontext             none                     default
tank02/win11  defcontext            none                     default
tank02/win11  rootcontext           none                     default
tank02/win11  redundant_metadata    all                      default
tank02/win11  encryption            off                      default
tank02/win11  keylocation           none                     default
tank02/win11  keyformat             none                     default
tank02/win11  pbkdf2iters           0                        default
tank02/win11  snapshots_changed     Sa Dez 16  3:42:46 2023  -

win11_default-io-native

1ms_hand2eye · January 23, 2024, 9:42pm

Thanks, I did find some threads referencing “duel” booting:

Though I’m still unsure if having windows on its own device (i.e. NVMe) separate from the host NVMe is required for this - or if you can just pass through/add a partition rather than a device (so the host and guest share an NVMe device.

I agree the ZFS way is simplest, but unfortunately the lack of simplicity and reliability of VFIO still keeps me wanting to boot into windows sometimes - driver quality / performance / etc for content creation.