ZFS Performance

I am doing some experimentation with ZFS for an NFS file server following ewwhite’s HA ZFS guide and I have run into some problems. The performance of the pool was considerably less than I was expecting so I’ve been working backwards to isolate the problem. First I tried changing mpath settings and disabling it entirely but that didn’t make a noticeable difference.
Now I am testing different raid configurations in ZFS and I have found that no matter what configuration I use the performance is always considerably lower than the speed of a single drive. I have mostly tried with these settings for the zpool -o ashift=12 -o autoexpand=on -o autoreplace=on -o cachefile=none but have tried changing the ashift a bit and without any settings as well.

These numbers are from fio with this command. I have heard that testing zfs is weird so I’m not sure if this test might be inaccurate for zfs.

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --readwrite=randrw --rwmixread=75 --size=4G --filename=

Single Drive XFS

   READ: bw=307MiB/s (322MB/s), 307MiB/s-307MiB/s (322MB/s-322MB/s), io=3070MiB (3219MB), run=9995-9995msec
  WRITE: bw=103MiB/s (108MB/s), 103MiB/s-103MiB/s (108MB/s-108MB/s), io=1026MiB (1076MB), run=9995-9995msec

Single Drive ZFS

   READ: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=3070MiB (3219MB), run=27616-27616msec
  WRITE: bw=37.2MiB/s (38.0MB/s), 37.2MiB/s-37.2MiB/s (38.0MB/s-38.0MB/s), io=1026MiB (1076MB), run=27616-27616msec

Stripe ZFS

   READ: bw=92.8MiB/s (97.3MB/s), 92.8MiB/s-92.8MiB/s (97.3MB/s-97.3MB/s), io=3070MiB (3219MB), run=33089-33089msec
  WRITE: bw=31.0MiB/s (32.5MB/s), 31.0MiB/s-31.0MiB/s (32.5MB/s-32.5MB/s), io=1026MiB (1076MB), run=33089-33089msec

Raidz 4 Disks

   READ: bw=91.1MiB/s (95.5MB/s), 91.1MiB/s-91.1MiB/s (95.5MB/s-95.5MB/s), io=3070MiB (3219MB), run=33711-33711msec
  WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=1026MiB (1076MB), run=33711-33711msec

Specifications:
Dual 2667 v3
256GB RAM
LSI SAS3008
Dell MD1420 Enclosure
8xSamsung PM1633a

Any help would be appreciated.

that’s odd for an all flash pool.
I mean I have disabled sync, but I have normal disks and you NVME

Specifications:
E5-2660 v3
256GB RAM
LSI SAS3008
10x TOSHIBA_MG07ACA12TE and TOSHIBA_MG08ACA16TE
Rotation Rate: 7200 RPM

Metadata
Raid10 4x 1TB ADATA SX8200PNP

Raidz 10 HDDs

   READ: bw=471MiB/s (494MB/s), 471MiB/s-471MiB/s (494MB/s-494MB/s), io=3070MiB (3219MB), run=6523-6523msec
  WRITE: bw=157MiB/s (165MB/s), 157MiB/s-157MiB/s (165MB/s-165MB/s), io=1026MiB (1076MB), run=6523-6523msec

root@truenas[/mnt/store01]# zfs get all store01/bench
NAME           PROPERTY               VALUE                  SOURCE
store01/bench  type                   filesystem             -
store01/bench  creation               Thu Jul 27  3:43 2023  -
store01/bench  used                   40.7G                  -
store01/bench  available              44.3T                  -
store01/bench  referenced             40.7G                  -
store01/bench  compressratio          1.00x                  -
store01/bench  mounted                yes                    -
store01/bench  quota                  none                   default
store01/bench  reservation            none                   default
store01/bench  recordsize             512K                   inherited from store01
store01/bench  mountpoint             /mnt/store01/bench     default
store01/bench  sharenfs               off                    default
store01/bench  checksum               on                     default
store01/bench  compression            lz4                    inherited from store01
store01/bench  atime                  off                    inherited from store01
store01/bench  devices                on                     default
store01/bench  exec                   on                     inherited from store01
store01/bench  setuid                 on                     default
store01/bench  readonly               off                    inherited from store01
store01/bench  zoned                  off                    default
store01/bench  snapdir                hidden                 inherited from store01
store01/bench  aclmode                discard                inherited from store01
store01/bench  aclinherit             discard                local
store01/bench  createtxg              24423773               -
store01/bench  canmount               on                     default
store01/bench  xattr                  sa                     local
store01/bench  copies                 1                      local
store01/bench  version                5                      -
store01/bench  utf8only               off                    -
store01/bench  normalization          none                   -
store01/bench  casesensitivity        sensitive              -
store01/bench  vscan                  off                    default
store01/bench  nbmand                 off                    default
store01/bench  sharesmb               off                    default
store01/bench  refquota               none                   default
store01/bench  refreservation         none                   default
store01/bench  guid                   6499519791997436066    -
store01/bench  primarycache           all                    default
store01/bench  secondarycache         all                    default
store01/bench  usedbysnapshots        0B                     -
store01/bench  usedbydataset          40.7G                  -
store01/bench  usedbychildren         0B                     -
store01/bench  usedbyrefreservation   0B                     -
store01/bench  logbias                latency                default
store01/bench  objsetid               33352                  -
store01/bench  dedup                  off                    inherited from store01
store01/bench  mlslabel               none                   default
store01/bench  sync                   disabled               inherited from store01
store01/bench  dnodesize              legacy                 default
store01/bench  refcompressratio       1.00x                  -
store01/bench  written                40.7G                  -
store01/bench  logicalused            40.7G                  -
store01/bench  logicalreferenced      40.7G                  -
store01/bench  volmode                default                default
store01/bench  filesystem_limit       none                   default
store01/bench  snapshot_limit         none                   default
store01/bench  filesystem_count       none                   default
store01/bench  snapshot_count         none                   default
store01/bench  snapdev                hidden                 default
store01/bench  acltype                posix                  local
store01/bench  context                none                   default
store01/bench  fscontext              none                   default
store01/bench  defcontext             none                   default
store01/bench  rootcontext            none                   default
store01/bench  relatime               off                    default
store01/bench  redundant_metadata     all                    default
store01/bench  overlay                on                     default
store01/bench  encryption             off                    default
store01/bench  keylocation            none                   default
store01/bench  keyformat              none                   default
store01/bench  pbkdf2iters            0                      default
store01/bench  special_small_blocks   128K                   inherited from store01

ZFS Stripe 2x NVME

   READ: bw=375MiB/s (394MB/s), 375MiB/s-375MiB/s (394MB/s-394MB/s), io=3070MiB (3219MB), run=8180-8180msec
  WRITE: bw=125MiB/s (132MB/s), 125MiB/s-125MiB/s (132MB/s-132MB/s), io=1026MiB (1076MB), run=8180-8180msec

That is quite the difference and does reinforce my hunch that something is wrong with my particular setup. I tried the benchmark with sync disabled and it made no difference. These are just SAS 12Gb drives, rather than nvme, but I don’t think performance should be nearly this bad.
I have also tried two 4 disk raidz vdevs and I still get the same 90/30 as any other zfs configuration I’ve tried.

Have you changed the record size in the zpools you are testing from the default?

That did it! I had tried mostly with the default size. I tried a couple other options like 1M, 512k etc, which made performance worse, but I changed it to 4k and it gives me about the same performance as the single disk. I’ll have look into tuning for specific workloads when I get further into testing.
Thanks for the help!

The config of each dataset really should reflect the needs of the data that’s in there. 1M is faster for sequential (generally) but the parameters in your tests were high randoms so could/should* perform better with 32 or 64. The OpenZFS default of 128k is there as it’s the best balance but there is no hard and fast rule.

i think your test and pool is fine
3 12tb stripped zfs pool

   READ: bw=1156KiB/s (1183kB/s), 1156KiB/s-1156KiB/s (1183kB/s-1183kB/s), io=44.4MiB (46.5MB), run=39337-39337msec
  WRITE: bw=387KiB/s (397kB/s), 387KiB/s-387KiB/s (397kB/s-397kB/s), io=14.9MiB (15.6MB), run=39337-39337msec

i canceled the test, your pool looks fine.

even ssd’s would “break”

Run status group 0 (all jobs):
   READ: bw=469MiB/s (492MB/s), 469MiB/s-469MiB/s (492MB/s-492MB/s), io=3070MiB (3219MB), run=6545-6545msec
  WRITE: bw=157MiB/s (164MB/s), 157MiB/s-157MiB/s (164MB/s-164MB/s), io=1026MiB (1076MB), run=6545-6545msec

Disk stats (read/write):
  nvme0n1: ios=783736/262229, merge=0/240, ticks=257699/38057, in_queue=296030, util=98.57%

only optane perfoms better.

So first thing is first. You can’t have High Availability with a single server. If your server dies then nothing takes over and that’s not higly available and that guide is for HA. So why follow a HA guide?

Check the ashift with zdb -C if it is 12 it’s fine. You can’t change it once you set it even by running commands.

Here’s a quote " The ashift property is per-vdevnot per pool, as is commonly and mistakenly thought!—and is immutable, once set." comes from this article

What is the OS?

Do these disks have data on them? Can you destroy the pool and create a new one?

If you can destroy the pool recreate a pool with two vdevs of 2 mirrors. Also before that can you run this command and do sequential test, this here is random read and writes which will be slower anyway. After you run the sequential test with this command then run zfs send | zfs receive command and send some movie files and look at the speed and how long it takes. You might see read writes using zpool status 1 but if you have lots of RAM and you send a 1GB file you won’t see any write activity on the Disks because it all might be stored in RAM first due to optimization ZFS does with RAM. These are asynchronous writes. For your NFS you do need synchronous writes so the Optane could help if you had HDDs but you have SSDs so it might not help that much