Help: Tuning ZFS Write Bottleneck

ciaduck · February 19, 2024, 2:20am

I’ve been banging my head on this for a while (2 months), and I can’t figure it out.

All my clients are NFS, but I don’t think it’s a write sync issue. I notice I get really decent performance on smaller files. The instant I hit something larger than 4GB, performance falls off a cliff.

Using FIO locally on the server I can also force the system into a bad state using a 4GB test file, where it has so much work queued it blocks until everything clears. A 1GB test file will complete in seconds. A 4GB file takes hours.

This is a pretty big NAS server, I have 2 RAIDZ2 vdevs (10+8), they are connected with a NetApp DS4246 via an LSI 9207-8e HBA. I was messing around with using a SLOG, but it didn’t do anything to improve performance, and only added more variables to test.

These Seagate Ironwolf NAS drives have a theoretical max write speed of 280MB/s. With 1Gb networking being the bottleneck, it would max out around 125MB/s. I’m pretty happy that it’s getting 50MB/s average, until about 60% through a 6.6GB file, then it just tanks to 2 or 3 MB/s.

I’m running ZFS on Proxmox (zfs-2.1.14-pve1). I’ve currently got 16GB allocated to ARC, but that has nothing to do with write performance, and I’ve never seen arc_summary report using that much.

If anyone could give me any clues at what to look at, any advice would be appreciated. I’m guessing there is tgx_sync tuning, or queue tuning I’m missing. I did look into maybe trying to tweak zfs_vdev_sync_write_(min|max)_active, but I couldn’t come up with a good test strategy, and was a bit worried about messing something up.

I’ve seen Adam Leventhal’s blog on the ZFS write throttle, but I haven’t been able to fully digest it. If anyone has an easier explanation, or can outline a process to follow that is a little more straight forward, that would be awesome.

yucko · February 19, 2024, 4:54am

idk, but maybe you wanna paste your pool and dataset props?

also, you should mount another physical filesystem. try copying some files locally (not over nfs) from that to your pool and see if you have the same write throttling.

ciaduck · February 19, 2024, 5:36am

I was using FIO to test locally for that exact purpose. I don’t think it’s NFS getting in the way. I can write a test case using sync with FIO that trashes it as well.

Other symptopms, “iowait” on the proxmox summary graph spikes to 50%, then lowers to 10% for several minutes after.

Here is some data if you want to squint at it. I appreciate the time. Really at my wits end.

FIO Testing

# fio --filename=/star/iostat --size=6g --runtime=5m --name=SEQ1MQ8T1 --sync=1 --bs=128k --iodepth=8 --numjobs=1 --rw=readwrite --name=SEQ1MQ1T1 --sync=1 --bs=1m --iodepth=1 --rw=readwrite --numjobs=1
SEQ1MQ8T1: (g=0): rw=rw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=8
SEQ1MQ1T1: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.25
Starting 2 processes

SEQ1MQ8T1: (groupid=0, jobs=1): err= 0: pid=2617438: Sun Feb 18 22:16:27 2024
  read: IOPS=32, BW=4189KiB/s (4290kB/s)(1227MiB/300004msec)
    clat (nsec): min=8015, max=53330, avg=16833.47, stdev=5631.01
     lat (nsec): min=8065, max=54092, avg=16999.47, stdev=5757.35
    clat percentiles (nsec):
     |  1.00th=[ 9536],  5.00th=[10048], 10.00th=[10560], 20.00th=[11456],
     | 30.00th=[12224], 40.00th=[12992], 50.00th=[16512], 60.00th=[19072],
     | 70.00th=[20608], 80.00th=[21888], 90.00th=[23424], 95.00th=[25728],
     | 99.00th=[31616], 99.50th=[33536], 99.90th=[43264], 99.95th=[48896],
     | 99.99th=[53504]
   bw (  KiB/s): min=  256, max=11008, per=29.63%, avg=4267.59, stdev=1709.76, samples=588
   iops        : min=    2, max=   86, avg=33.34, stdev=13.35, samples=588
  write: IOPS=32, BW=4183KiB/s (4283kB/s)(1225MiB/300004msec); 0 zone resets
    clat (msec): min=11, max=5441, avg=30.58, stdev=57.17
     lat (msec): min=11, max=5441, avg=30.58, stdev=57.17
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   18], 10.00th=[   19], 20.00th=[   21],
     | 30.00th=[   24], 40.00th=[   26], 50.00th=[   27], 60.00th=[   29],
     | 70.00th=[   32], 80.00th=[   36], 90.00th=[   43], 95.00th=[   51],
     | 99.00th=[   80], 99.50th=[  106], 99.90th=[  215], 99.95th=[  321],
     | 99.99th=[ 5470]
   bw (  KiB/s): min=  256, max= 5632, per=28.45%, avg=4252.76, stdev=873.66, samples=589
   iops        : min=    2, max=   44, avg=33.22, stdev= 6.82, samples=589
  lat (usec)   : 10=2.18%, 20=30.54%, 50=17.31%, 100=0.02%
  lat (msec)   : 20=8.08%, 50=39.22%, 100=2.39%, 250=0.22%, 500=0.04%
  lat (msec)   : >=2000=0.01%
  cpu          : usr=0.02%, sys=0.36%, ctx=30307, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=9819,9803,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
SEQ1MQ1T1: (groupid=0, jobs=1): err= 0: pid=2617439: Sun Feb 18 22:16:27 2024
  read: IOPS=20, BW=20.5MiB/s (21.5MB/s)(2991MiB/145578msec)
    clat (usec): min=68, max=234, avg=107.80, stdev=22.81
     lat (usec): min=68, max=235, avg=107.98, stdev=22.95
    clat percentiles (usec):
     |  1.00th=[   74],  5.00th=[   77], 10.00th=[   80], 20.00th=[   85],
     | 30.00th=[   88], 40.00th=[   93], 50.00th=[  115], 60.00th=[  122],
     | 70.00th=[  126], 80.00th=[  130], 90.00th=[  135], 95.00th=[  141],
     | 99.00th=[  155], 99.50th=[  161], 99.90th=[  194], 99.95th=[  198],
     | 99.99th=[  235]
   bw (  KiB/s): min= 2048, max=55296, per=100.00%, avg=22034.42, stdev=9818.67, samples=278
   iops        : min=    2, max=   54, avg=21.52, stdev= 9.59, samples=278
  write: IOPS=21, BW=21.7MiB/s (22.7MB/s)(3153MiB/145578msec); 0 zone resets
    clat (msec): min=22, max=5441, avg=46.05, stdev=98.78
     lat (msec): min=22, max=5441, avg=46.07, stdev=98.78
    clat percentiles (msec):
     |  1.00th=[   26],  5.00th=[   28], 10.00th=[   30], 20.00th=[   33],
     | 30.00th=[   36], 40.00th=[   39], 50.00th=[   41], 60.00th=[   43],
     | 70.00th=[   46], 80.00th=[   52], 90.00th=[   60], 95.00th=[   71],
     | 99.00th=[  111], 99.50th=[  199], 99.90th=[  359], 99.95th=[  481],
     | 99.99th=[ 5470]
   bw (  KiB/s): min= 2048, max=30720, per=100.00%, avg=22972.58, stdev=4569.25, samples=281
   iops        : min=    2, max=   30, avg=22.43, stdev= 4.46, samples=281
  lat (usec)   : 100=23.01%, 250=25.67%
  lat (msec)   : 50=39.93%, 100=10.71%, 250=0.55%, 500=0.11%, >=2000=0.02%
  cpu          : usr=0.04%, sys=0.86%, ctx=32281, majf=0, minf=15
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2991,3153,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=14.1MiB/s (14.7MB/s), 4189KiB/s-20.5MiB/s (4290kB/s-21.5MB/s), io=4218MiB (4423MB), run=145578-300004msec
  WRITE: bw=14.6MiB/s (15.3MB/s), 4183KiB/s-21.7MiB/s (4283kB/s-22.7MB/s), io=4378MiB (4591MB), run=145578-300004msec

# zpool iostat -qly 10
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
star        73.5T   145T    169  1.90K  2.29M  38.9M   17ms    2ms   11ms    2ms  149us  642ns   17ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    149  1.90K  2.71M  38.4M   18ms    2ms   13ms    2ms  256us  660ns   10ms  144us      -      -      0      1      0      3      0      0      0      0      0      0      0      0
star        73.5T   145T    111  1.88K   894K  38.5M   11ms    2ms   11ms    2ms  837ns  634ns  463ns      -      -      -      0      1      0      1      0      0      0      0      0      0      0      0
star        73.5T   145T    148  1.88K  1.81M  38.2M   14ms    2ms   12ms    2ms  807ns  635ns    4ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    144  1.90K  2.86M  38.8M   19ms    2ms   13ms    2ms  516us  648ns   15ms      -      -      -      0      1      0      1      0      0      0      0      0      0      0      0
star        73.5T   145T    150  1.92K  2.70M  39.3M   26ms    2ms   12ms    2ms  815us  644ns   39ms      -      -      -      0      1      0      0      0      3      0      0      0      0      0      0
star        73.5T   145T    122  1.91K  1.72M  39.2M   16ms    2ms   13ms    2ms  807ns  633ns   13ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    134  1.82K  1.84M  37.3M   17ms    2ms   12ms    2ms  769ns  643ns   15ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    113  1.85K  1014K  37.9M   12ms    2ms   12ms    2ms  793ns  628ns  485us      -      -      -      0      1      0      3      0      0      0      0      0      0      0      0
star        73.5T   145T    147  1.79K  2.52M  36.6M   19ms    2ms   12ms    2ms  822ns  630ns   16ms      -      -      -      0      1      0     10      0      0      0      0      0      0      0      0
star        73.5T   145T    167  14.7K  3.67M   119M   21ms    5ms   12ms    2ms  126us    8us   16ms    2ms      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    148  1.81K  2.03M  37.0M   17ms    2ms   12ms    2ms   13us  642ns   18ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    152  1.67K  1.71M  34.2M   13ms    2ms   11ms    2ms  824ns  643ns    4ms      -      -      -      0      4      0     11      0      0      0      0      0      0      0      0
star        73.5T   145T    122  1.27K  1.12M  26.0M   12ms    1ms   11ms    1ms  805ns  647ns    6ms  661ns      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    149    413  2.42M  6.85M   17ms  463us   11ms  460us   25us    2us   17ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    141    416  1.83M  6.79M   19ms  552us   12ms  551us   92us  531ns   30ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    205    423  3.73M  6.99M   14ms  498us   10ms  497us   65us  521ns   10ms  490ns      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    157    430  2.92M  7.12M   17ms  434us   12ms  433us  868ns  476ns   16ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    178    415  3.44M  6.88M   16ms  429us   11ms  427us   54us  578ns   11ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    129    433  1.00M  7.15M   10ms  422us   10ms  421us  832ns  497ns  158us  394ns      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    165    425  2.23M  6.63M   16ms  540us   11ms  539us   96us  505ns   13ms      -      -      -      0      6      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    293    435  3.34M  6.60M   11ms  617us   10ms  616us  695ns  559ns    4ms      -      -      -      0      1      0      0      0      0      0      0      0      0      0      0
star        73.5T   145T    191    455  4.28M  6.83M   18ms  585us   11ms  584us   77us  492ns   14ms  554ns      -      -      0      1      0      0      0      0      0      0      0      0      0      0

Settings

# zpool get all
NAME  PROPERTY                       VALUE                          SOURCE
star  size                           218T                           -
star  capacity                       33%                            -
star  altroot                        -                              default
star  health                         ONLINE                         -
star  guid                           14090505498423290915           -
star  version                        -                              default
star  bootfs                         -                              default
star  delegation                     on                             default
star  autoreplace                    off                            default
star  cachefile                      -                              default
star  failmode                       wait                           default
star  listsnapshots                  off                            default
star  autoexpand                     off                            default
star  dedupratio                     1.54x                          -
star  free                           145T                           -
star  allocated                      73.5T                          -
star  readonly                       off                            -
star  ashift                         12                             local
star  comment                        -                              default
star  expandsize                     -                              -
star  freeing                        0                              -
star  fragmentation                  24%                            -
star  leaked                         0                              -
star  multihost                      off                            default
star  checkpoint                     -                              -
star  load_guid                      7069352757609303911            -
star  autotrim                       off                            default
star  compatibility                  off                            default
star  feature@async_destroy          enabled                        local
star  feature@empty_bpobj            active                         local
star  feature@lz4_compress           active                         local
star  feature@multi_vdev_crash_dump  enabled                        local
star  feature@spacemap_histogram     active                         local
star  feature@enabled_txg            active                         local
star  feature@hole_birth             active                         local
star  feature@extensible_dataset     active                         local
star  feature@embedded_data          active                         local
star  feature@bookmarks              enabled                        local
star  feature@filesystem_limits      enabled                        local
star  feature@large_blocks           active                         local
star  feature@large_dnode            enabled                        local
star  feature@sha512                 enabled                        local
star  feature@skein                  enabled                        local
star  feature@edonr                  enabled                        local
star  feature@userobj_accounting     active                         local
star  feature@encryption             enabled                        local
star  feature@project_quota          active                         local
star  feature@device_removal         enabled                        local
star  feature@obsolete_counts        enabled                        local
star  feature@zpool_checkpoint       enabled                        local
star  feature@spacemap_v2            active                         local
star  feature@allocation_classes     enabled                        local
star  feature@resilver_defer         enabled                        local
star  feature@bookmark_v2            enabled                        local
star  feature@redaction_bookmarks    enabled                        local
star  feature@redacted_datasets      enabled                        local
star  feature@bookmark_written       enabled                        local
star  feature@log_spacemap           active                         local
star  feature@livelist               enabled                        local
star  feature@device_rebuild         enabled                        local
star  feature@zstd_compress          enabled                        local
star  feature@draid                  enabled                        local

# arc_summary | grep active
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_rebuild_max_active                                    3
        zfs_vdev_rebuild_min_active                                    1
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scrub_max_active                                      3
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1

yucko · February 19, 2024, 6:28am

fio test

oh right. missed that somehow.

if you can tolerate the downtime, can you baremetal boot some other distro like fedora, install zfs on it real quick, import the pool, and run the fio test again?

edit: or actually how about truenas core or scale, probably even faster to get going with for a quick test

yeah, that is a bad problem for your beautiful 218TB pool. i’d be pissed if i were you

ciaduck · February 19, 2024, 6:34am

Sorry. That’s not really an option. This array serves a lot of media, and I unfortunately have some network critical LXC containers like pihole running on proxmox.

In the future I want to separate the storage from the server cluster. Maybe buy a few cheap optiplex systems and create a real HA proxmox cluster, move the NAS aspect over to TrueNAS. I just haven’t gotten around to doing that yet.

I would be surprised if there is anything dirsto based going on. ZFS on Linux is generally OS agnostic. I suppose different versions of the kernel could be a factor. Proxmox being downstream of Debian, the kernel seems pretty fresh at 5.15.131.

yucko · February 19, 2024, 6:35am

yeah, i was just doing caveman troubleshooting since i have no idea what could be causing this

if you get desperate enough, it could be worth a shot with a fresh kernel, newer openzfs (2.2.2), or maybe an already-tuned system like truenas

ciaduck · February 19, 2024, 6:43am

Hey I appreciate the effort. At the very least talking about it may generate new ideas.

I at least have a consistent test methodology. I think I’m going to try to tune min and max active for sync writes.

Theoretically, if it’s hitting max, it’ll block/prioritize other IO. Sync reads could block writes in that scenario. I’ll need to see if the iostat reflects that.
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html

This seems like a good blog with testing info as well.

I could try to throw more RAM at ARC. The problem with that is it doesn’t speed up writes.

yucko · February 19, 2024, 6:45am

way more than 16GB would be nice for a pool that size depending on your workload ofc, but i am pretty skeptical that this could be the cause of your write throttle problem

yucko · February 19, 2024, 6:51am

did you try creating a new dataset with sync=disabled? try a write test on that?

ciaduck · February 19, 2024, 7:02am

I have not. I suppose I can try that. That will help confirm that it’s sync writes though, which I feel like I already know.

I’m not really comfortable with leaving sync disabled permanently either. I care about my data enough to avoid that.

yucko · February 19, 2024, 7:05am

if it is a sync write only problem that you can’t tune your way out of, you could always try adding an ssd and using that for sync log. maybe it could help.

so far, i am running wild with sync=disabled on all my datasets not used for anything transactional.

Hydrosaure · February 19, 2024, 10:28am

I see a

You are running with deduplication turned on?

cowphrase · February 19, 2024, 12:26pm

With the information posted so far, I would be pointing at “running out of memory” as the most likely cause here. Can you post an arc summary while mostly idle, while copying a 1GB file (that’s fast) and copying a 4GB file (that’s slow), along with vmstat output for each one.

And I’ll point out that this kind of ridiculous slow down can be caused by dedupe memory tables and running out of RAM. Dedupe will eat memory.

CJRoss · February 19, 2024, 12:57pm

Doesn’t nfs mount async by default? Also, why are your vdevs different sizes? If you added one later you won’t necessarily be using both vdevs until they get closer in usage.

Even a small pool should be plenty to saturate 1G with a sequential write. I’m using TrueNAS, but I can’t imagine they did any special tuning just for 1G.

jxdking · February 19, 2024, 1:31pm

First, verify you don’t run out of memory.

What is the output of the command below? Run it before and after the fio test.

cat /proc/spl/kstat/zfs/dmu_tx

ciaduck · February 20, 2024, 5:30am

Thanks for the feedback everyone. It is likely an issue running out of write cache.

To answer some questions:

Yes I have dedupe on. No it hasn’t really been a problem for me. I definitely require this due to some of my users.
No, NFS clients are syncronous by default, but the client can request sync or async. I believe SMB is async. I’ve explicitly used NFS because I was having so many issues with samba. It did help quite a bit in performance. Maybe because ZFS IO Sched treats sync with higher priority.

I was watching arcstat while some large writes were going on. You can watch the memory usage climb, and metadata misses with it.

I think I’ll give it all the RAM I can afford at the moment, and consider weather I want to create a special metadata device. I know wendel had a post/howto on that a while back.

Thanks for all the help!

Arcstat

22:13:28   138     1      1     0    0     1  100     0    0   12G   16G    33G
22:13:38   144     0      0     0    0     0    0     0    0   12G   16G    33G
22:13:48   148     1      1     0    0     1  100     0    0   12G   16G    33G
22:13:58   137     0      0     0    0     0    0     0    0   12G   16G    33G
22:14:08   143     1      1     0    0     1  100     0    0   12G   16G    33G
22:14:18   145     0      0     0    0     0    0     0    0   12G   16G    33G
22:14:28   190     3      2     1    1     1  100     0    0   12G   16G    33G
22:14:38   165     1      1     1    1     0    0     0    0   12G   16G    33G
22:14:48   151     1      1     0    0     1  100     0    0   12G   16G    33G
22:14:58   148     0      0     0    0     0    0     0    0   12G   16G    33G
22:15:08   333     2      1     1    1     1  100     1    1   12G   16G    33G
22:15:18   140     1      1     1    1     0    0     1    1   12G   16G    33G
22:15:28   149     0      0     0    0     0    0     0    0   12G   16G    33G
22:15:38   145     1      1     0    0     1  100     0    0   12G   16G    33G
22:15:48   148     1      1     0    0     1  100     0    0   12G   16G    33G
22:15:58   145     0      0     0    0     0    0     0    0   12G   16G    33G
22:16:08  1.2K    78      7    29    4    49   16    73    7   12G   16G    33G
22:16:18   187     8      5     7    4     0    0     7    4   12G   16G    33G
22:16:28   158     0      0     0    0     0    0     0    0   12G   16G    33G
22:16:38   157     1      1     0    0     1  100     0    0   12G   16G    33G
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
22:16:48   498     2      1     1    1     1   13     1    1   13G   16G    33G
22:16:58   750    78     11    78   11     0    0    78   11   14G   16G    32G
22:17:08  1.0K    63      7    63    7     0    0    63    7   14G   16G    31G
22:17:18   761    84     12    83   11     1  100    83   11   15G   16G    30G
22:17:28   778    88     12    87   12     1  100    87   12   16G   16G    30G
22:17:38   800    89     12    89   12     0    0    89   12   16G   16G    30G
22:17:48   677    88     13    87   13     1  100    86   13   16G   16G    30G
22:17:58   387    73     19    73   19     0    0    73   19   16G   16G    30G
22:18:08   441    92     21    89   21     3  100    88   21   16G   16G    30G
22:18:18   597   123     21   123   21     0    0   123   21   16G   16G    30G
22:18:28   689   144     21   142   21     1  100   141   21   16G   16G    30G
22:18:38   729   149     21   149   21     0    0   149   21   16G   16G    30G
22:18:48   761   157     21   153   21     4  100   153   21   16G   16G    30G
22:18:58   773   158     21   158   21     0    0   158   21   16G   16G    30G
22:19:08   814   164     21   164   21     0    0   164   21   16G   16G    30G
22:19:18   810   166     21   162   21     3  100   162   21   16G   16G    30G
22:19:28   832   170     21   166   21     4  100   166   21  8.0G   16G    38G

dmu_tx

# cat /proc/spl/kstat/zfs/dmu_tx
11 1 0x01 13 3536 39858637551 2980090270779709
name                            type data
dmu_tx_assigned                 4    71370484
dmu_tx_delay                    4    0
dmu_tx_error                    4    0
dmu_tx_suspended                4    0
dmu_tx_group                    4    18
dmu_tx_memory_reserve           4    0
dmu_tx_memory_reclaim           4    0
dmu_tx_dirty_throttle           4    238
dmu_tx_dirty_delay              4    7740723
dmu_tx_dirty_over_max           4    768
dmu_tx_dirty_frees_delay        4    18
dmu_tx_wrlog_delay              4    0
dmu_tx_quota                    4    0

ciaduck · February 20, 2024, 5:44am

A couple of questions I have.

Is there a way to reduce metadata size so there are fewer misses in ARC? Does recordsize affect this?
Is it possible to cause tx’s to flush more frequently to require less RAM for writing very large files?

Thanks!

cowphrase · February 20, 2024, 5:48am

AFAIK ZFS Dedupe metadata never decreases, and ZFS basically has no way of reducing current metadata (unless you copy files between datasets). So I’d expect the dedupe memory issues to get worse over time.

As for the special device, new metadata can definitely be written to it, but I’m not sure if current metadata can.

As you noted, the simplest fix at this point is to install more RAM. The normal recommendation for ZFS / Dedupe is 1GB per 1TB of data - though I have no idea how much truth there is behind that.

CJRoss · February 20, 2024, 12:51pm

Dedupe adds a whole 'nother layer of complexity into things. Can you create a dataset/pool without dedupe in order to test with? I forget which level lets you specify dedupe as I don’t use it. Additionally, you need more ram if you’re using dedupe. 16G isn’t nearly enough.

Regarding NFS, what do you get when running the following command?

cat /proc/mounts | grep nfs

None of my nfs mounts show sync and I don’t have any extra options configured.

Depending on your setup, NFS performance can be a lot higher due to multithreading. When I was originally testing on 10G my speeds on SMB were much slower because it would peg a single core.

ciaduck · February 20, 2024, 4:36pm

Neither my windows or linux NFS clients specify sync on the mount.

Looks like when unset, sync is left upto the application client side. NFS uses “memory pressue” to figure out when to flush.

So I was wrong about that.

The sync mount option

The NFS client treats the sync mount option differently than some other file systems (refer to mount(8) for a description of the generic sync and async mount options). If neither sync nor async is specified (or if the async option is specified), the NFS client delays sending application writes to the server until any of these events occur:
Memory pressure forces reclamation of system memory resources.

An application flushes file data explicitly with sync(2), msync(2), or fsync(3).

An application closes a file with close(2).

The file is locked/unlocked via fcntl(2).

In other words, under normal circumstances, data written by an application may not immediately appear on the server that hosts the file.

If the sync option is specified on a mount point, any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space. This provides greater data cache coherence among clients, but at a significant performance cost.

Applications can use the O_SYNC open flag to force application writes to individual files to go to the server immediately without the use of the sync mount option.

As for dedupe, I don’t really have the resources right now to test it out. Once I found out that I had uses creating several copies of the same file, I enabled it about a year ago. I suppose I can rethink my strategy and use something like hardlink instead. But running something like that against the whole file system is very resource intensive.