ZFS 12-drive Experiments

I am just making some notes about some ZFS experiments I am running on a 45 Drives Q30 chassis running Rocky Linux 8.9 and ZFS 2.1

Possibly for a future video because “I Have 12 Disks And I Want The Best ZFS Performance” comes up quite a bit. And it’s juuust on the edge where folks burst in and start recommended dRaid… (I tend to think draid is good when you have… a lot… of disks. maybe.)

[root@storinator ~]# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:01:36 with 0 errors on Sat Feb 10 10:57:49 2024
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            sdc1     ONLINE       0     0     0
            sdd1     ONLINE       0     0     0
            sde1     ONLINE       0     0     0
            sdf1     ONLINE       0     0     0
            sdh1     ONLINE       0     0     0
            sdg1     ONLINE       0     0     0
          raidz1-1   ONLINE       0     0     0
            sdc2     ONLINE       0     0     0
            sdd2     ONLINE       0     0     0
            sde2     ONLINE       0     0     0
            sdf2     ONLINE       0     0     0
            sdh2     ONLINE       0     0     0
            sdg2     ONLINE       0     0     0
          raidz1-2   ONLINE       0     0     0
            sdi1     ONLINE       0     0     0
            sdj1     ONLINE       0     0     0
            sdk1     ONLINE       0     0     0
            sdl1     ONLINE       0     0     0
            sdn1     ONLINE       0     0     0
            sdm1     ONLINE       0     0     0
          raidz1-3   ONLINE       0     0     0
            sdi2     ONLINE       0     0     0
            sdj2     ONLINE       0     0     0
            sdk2     ONLINE       0     0     0
            sdl2     ONLINE       0     0     0
            sdn2     ONLINE       0     0     0
            sdm2     ONLINE       0     0     0
        special
          mirror-4   ONLINE       0     0     0
            nvme0n1  ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0

errors: No known data errors

With Recordsize=1M
image

and
image
(notice the speed and load average here vs 128k record size)

and

With Recordsize=128k
image

and
image
(my gut is this is lower than expected! Given the write speed of individual devices in the vdevs are higher than this. Possibly the clue to why is in the load average. )

and

Writing

Since this is a fresh pool, iostats show background write hum even when the pool is idle. Consider I/O top:

vs

The pool is empty except for one dataset which has a dnode size of “Auto” . Dedupe and other exotic options are off. txg_sync is changed from 5 to 60 seconds.

It might be Saturday morning brain fog but I don’t remember new zfs pools having this much “idle” write bandwidth, but I do have a vague recollection of this and forgot what I knew then. I have only copied test data into this pool with different record sizes, and I’ve done one scrub which is now finished.

dRAID

I decided to try dRAID even though I kinda don’t think this is enough disks. I added 4 more matching disks and re-created the array.

[root@storinator ~]# zpool status
  pool: tank
 state: ONLINE
config:

        NAME                  STATE     READ WRITE CKSUM
        tank                  ONLINE       0     0     0
          draid2:8d:16c:0s-0  ONLINE       0     0     0
            sdc1              ONLINE       0     0     0
            sdd1              ONLINE       0     0     0
            sde1              ONLINE       0     0     0
            sdf1              ONLINE       0     0     0
            sdg1              ONLINE       0     0     0
            sdh1              ONLINE       0     0     0
            sdi1              ONLINE       0     0     0
            sdj1              ONLINE       0     0     0
            sdk1              ONLINE       0     0     0
            sdl1              ONLINE       0     0     0
            sdm1              ONLINE       0     0     0
            sdn1              ONLINE       0     0     0
            sdo1              ONLINE       0     0     0
            sdp1              ONLINE       0     0     0
            sdq1              ONLINE       0     0     0
            sdr1              ONLINE       0     0     0
          draid2:8d:16c:0s-1  ONLINE       0     0     0
            sdc2              ONLINE       0     0     0
            sdd2              ONLINE       0     0     0
            sde2              ONLINE       0     0     0
            sdf2              ONLINE       0     0     0
            sdg2              ONLINE       0     0     0
            sdh2              ONLINE       0     0     0
            sdi2              ONLINE       0     0     0
            sdj2              ONLINE       0     0     0
            sdk2              ONLINE       0     0     0
            sdl2              ONLINE       0     0     0
            sdm2              ONLINE       0     0     0
            sdn2              ONLINE       0     0     0
            sdo2              ONLINE       0     0     0
            sdp2              ONLINE       0     0     0
            sdq2              ONLINE       0     0     0
            sdr2              ONLINE       0     0     0

Even with the added drives we only added a little bit more capacity.

At 64k record size, with a simple samba file copy, the performance is about the same as raidz1:

image

1mb record size however is nicer:
image

… perhaps with the lowest cpu utilization I’ve seen yet? though I think that is a bit unexpected, and this is sustained writes for more than a minute

… smaller record size (128k), lot more cpu intensity when you have a LOT of small records to write:

cdm performance at 64k is pretty good though with draid
image

One thing I like to do to check to see what processes are actually waiting on disk I/O. iostat doesn’t really help you with this, that I’m aware of

(ps aux | awk '$8 ~ /D/  { print $0 }')  

fortunately PS can tell us

This is what it looks like with a small record size:

… and with a large record size:

4 Likes

It would be interesting to see what performance you’d get using FreeBSD 14 (or newer) in comparison.

As well as TrueNAS Scale. I am using that on a 45Homelab HL15.

It would be really useful to compare the performance with other aspects such as network and CPU load. I coincidentally have 12 disks in raidz2. With a 2x10Gb LAGG and some E5-2670 CPUs, the ZFS is not the bottleneck for my use case as a file server.

Many try for “the fastest” rather than use case appropriate options. Many discussions and guides ignore the application requirements.

the software platform "shouldn"t make that much difference, really. I am kinda surprised the cpu utilization is that much higher, only 512kb and 1m record sizes are reasonable at these throughputs… but the write amplification is undesirable.

Just goes to show effective use of datasets and record sizes is important with zfs. (ie.e. vm hosting vs db hosting vs bulk storage).

1 Like

What’s the point in adding 2 partitions per disk like that? Are those the dual actuator drives or just normal HDDs?

Because if these are not dual actuator drives it’s likely the constant seek between writing stripes.

Otherwise, what’s the output of (non-zfs) iostat -xyh?

Oh, and if it’s a dumb copy from/to the same pool, my gut feeling is that with smaller block size there are way too many seeks between reads and writes.
Perhaps a caching issue?

thats the point yes, two independent partitions, no pentalty; the io to the pool is write only, seeks for read dont apply here

@storinator ~]# iostat -xyh
Linux 4.18.0-513.11.1.el8_9.x86_64 (storinator.hq.local)        02/10/2024      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.1%    0.0%    0.4%    2.3%    0.0%   97.2%

     r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util Device
    0.11    0.65      3.5k      7.6k     0.00     0.00   0.0%   0.0%    0.03    0.02   0.00    31.6k    11.6k   0.10   0.0% nvme1n1
    0.11    0.66      3.6k      7.6k     0.00     0.00   0.0%   0.0%    0.03    0.02   0.00    32.7k    11.6k   0.10   0.0% nvme0n1
    0.18    0.74     10.7k      3.9k     0.00     0.04   0.5%   4.7%    0.42    0.49   0.00    60.0k     5.3k   0.58   0.1% sda
    0.72    0.74     36.9k      3.9k     0.00     0.04   0.1%   4.7%    0.33    0.50   0.00    51.3k     5.3k   0.50   0.1% sdb
    0.02    0.00      2.4k      0.0k     0.00     0.00   0.0%   0.0%    0.95    0.75   0.00   133.0k     4.0k   0.44   0.0% md127
    0.01    0.00      0.3k      0.0k     0.00     0.00   0.0%   0.0%    0.38    0.00   0.00    25.3k     0.0k   0.32   0.0% md126
    0.78    0.39     43.7k      3.6k     0.00     0.00   0.0%   0.0%    0.34    1.39   0.00    55.9k     9.3k   0.34   0.0% md125
   10.15   22.37   1021.6k      1.8M     0.00     0.12   0.0%   0.5%    2.65    4.80   0.13   100.6k    83.8k   0.94   3.1% sdd
    9.96   22.13      1.0M      1.8M     0.00     0.14   0.0%   0.6%    2.78    4.93   0.14   103.0k    85.0k   1.00   3.2% sde
    9.79   22.13   1020.7k      1.8M     0.00     0.12   0.0%   0.5%    2.84    4.85   0.14   104.2k    85.1k   0.96   3.1% sdf
    9.98   22.28   1020.9k      1.8M     0.00     0.11   0.0%   0.5%    2.76    4.83   0.14   102.3k    84.4k   0.96   3.1% sdg
    9.81   22.24   1022.5k      1.8M     0.00     0.13   0.0%   0.6%    2.85    4.89   0.14   104.2k    84.9k   0.97   3.1% sdh
    9.75   22.21   1022.3k      1.8M     0.00     0.14   0.0%   0.6%    2.83    4.86   0.14   104.9k    84.8k   0.97   3.1% sdc
   10.12   22.44      1.0M      1.8M     0.00     0.12   0.0%   0.5%    2.80    4.74   0.13   101.9k    83.3k   0.94   3.1% sdj
   10.02   22.19      1.0M      1.8M     0.00     0.14   0.0%   0.6%    2.81    4.77   0.13   102.8k    84.1k   0.95   3.1% sdi
    9.79   22.12   1021.6k      1.8M     0.00     0.13   0.0%   0.6%    2.87    4.83   0.14   104.3k    84.4k   0.97   3.1% sdk
   10.09   22.12   1022.0k      1.8M     0.00     0.11   0.0%   0.5%    2.71    4.84   0.13   101.3k    84.4k   0.96   3.1% sdm
   10.16   22.14      1.0M      1.8M     0.00     0.13   0.0%   0.6%    2.71    4.86   0.14   100.8k    84.5k   0.96   3.1% sdn
    9.80   22.33   1020.6k      1.8M     0.00     0.12   0.0%   0.5%    2.86    4.84   0.14   104.1k    83.5k   0.97   3.1% sdl
    7.69   16.15    551.5k      1.3M     0.00     0.00   0.0%   0.0%    1.93    4.84   0.09    71.7k    79.9k   0.89   2.1% sdo
    7.80   15.86    554.0k      1.3M     0.00     0.00   0.0%   0.0%    1.90    4.86   0.09    71.0k    81.1k   0.88   2.1% sdp
    8.28   15.79    554.5k      1.3M     0.00     0.00   0.0%   0.0%    1.72    4.85   0.09    67.0k    81.5k   0.87   2.1% sdq
    8.33   15.80    549.4k      1.3M     0.00     0.00   0.0%   0.0%    1.70    4.89   0.09    66.0k    81.1k   0.87   2.1% sdr

This is when iotop shows 0/0 pretty reliably and the network has been idle for ~5 mins.

iostat -yv looks far more reasonable during a copy:


                        capacity     operations     bandwidth 
pool                  alloc   free   read  write   read  write
--------------------  -----  -----  -----  -----  -----  -----
tank                   381G   262T      0  15.3K      0   345M
  draid2:8d:16c:0s-0   192G   131T      0  5.79K      0   174M
    sdc1                  -      -      0    155      0  2.44M
    sdd1                  -      -      0    155      0  2.44M
    sde1                  -      -      0    155      0  2.44M
    sdf1                  -      -      0      0      0      0
    sdg1                  -      -      0    779      0  12.2M
    sdh1                  -      -      0    155      0  2.44M
    sdi1                  -      -      0  1.07K      0  17.1M
    sdj1                  -      -      0  1.37K      0  95.7M
    sdk1                  -      -      0    624      0  9.75M
    sdl1                  -      -      0      0      0      0
    sdm1                  -      -      0      0      0      0
    sdn1                  -      -      0      0      0      0
    sdo1                  -      -      0    623      0  9.75M
    sdp1                  -      -      0      0      0      0
    sdq1                  -      -      0      0      0      0
    sdr1                  -      -      0    312      0  4.88M
  draid2:8d:16c:0s-1   189G   131T      0  7.01K      0   151M
    sdc2                  -      -      0      0      0      0
    sdd2                  -      -      0      0      0      0
    sde2                  -      -      0      0      0      0
    sdf2                  -      -      0      0      0      0
    sdg2                  -      -      0      0      0      0
    sdh2                  -      -      0      0      0      0
    sdi2                  -      -      0    468      0  7.33M
    sdj2                  -      -      0    939      0  24.5M
    sdk2                  -      -      0    783      0  22.0M
    sdl2                  -      -      0      0      0      0
    sdm2                  -      -      0    312      0  4.89M
    sdn2                  -      -      0    469      0  14.7M
    sdo2                  -      -      0      0      0      0
    sdp2                  -      -      0      0      0      0
    sdq2                  -      -      0  1.83K      0  44.0M
    sdr2                  -      -      0      0      0      0
--------------------  -----  -----  -----  -----  -----  -----

interestingly, during the initial txg in-flight period (60s) it does actually show higher write rates to the pool, which I wouldn’t expect if iostat reflrects the actual i/o and not whats going into the ram write cache

the vdev imbalance suggests possibly seek stall related, but its all a bit unusual

edit: the part I forgot, but then remebered after I posted earlier, is that iostat needs a time interval or y to give you not-accumulated stats.

init on alloc?

It is not really needed in 85% of scenarios. It will zero before allocating. Let’s try to claw back some performance; zfs is slow enough as it is.

init_on_alloc=0

is the way.

1 Like

Gotcha. But if the pool is write only here, then what’s the source of the copy? Because now it seems the copy might actually be read stalled and not write stalled.

I was confused because in your screenshots we see:

  1. copy from backups to backups (and the file name suggests just duplicating within the same directory)
  2. again copy from backups to backups
  3. from Desktop to backups (filename w/out "Copy Copy Copy)
  4. again copy from backups to backups (filename again w/out "Copy Copy Copy)

I’d benchmark with fio on the pool first, no Samba, no Windows, nothing, but with larger payload (say size=200G rw=write). I don’t know how Samba handles block sizes and the like, but I suspect this might be relevant here.

well, we don’t have to guess at what is i/o stall related or not, ps will tell us that, so I added that info to the op.

the “wait iostat doesn’t look right” note in the op was because its accumulated; need -y

even if you read and write from the same place the reads will come from cache and, for cached reads, record size and “a few” extra i/os won’t make a huge difference in this context. but in this case they’re named the same because I keep copy pasting back and forth from the network and the name thing is to avoid overrwiting existing files which has a different i/o behavior…

Mhm, I also forgot it was accumulated by default and though something was amiss, but iostat -xyh never lied to me and so it’s clear it’s not the drives being pegged, but rather system/ZFS.

I still suspect smbd is doing something awful with write patterns but can’t tell you what right now.
hi z_wr_iss while copying to non compressed drive · Issue #3077 · openzfs/zfs · GitHub ← this seems to be somewhat relevant

holy crap, check this out:

…this is with 128k record size and lz4 compression on. with lz4 compression as the only change. these files are not really that compressible.

It can still get into a weird io storm state:

image

this peaked around 1.5 gigabytes/sec then tanked. in about a 2 minute copy

…for file copies that really do come from samba I can cheat a bit here and use samba’s readahead vfs objects to read ahead at that layer to make zfs less busy too, and that does seem to help as well.

I ran fio as well:

Tool,Test Type,Record Size,Date,Reads (MB/s),Writes (MB/s),Random Reads (MB/s),Random Writes (MB/s),Reads (IOPS),Writes (IOPS),Random Reads (IOPS),Random Writes (IOPS)
fio,throughput,1M,"2/10/2024, 5:51:37 PM",2415.94,381.25,2386.26,485.90,4719,745,4661,949

at 128k record size I could produce decent size I/O storms, but still not as bad as samba. and fio perf was “higher” for max throughput like cdm.

this was fio size of 10 gigabytes, 16 queues, run for 2 mins.

Tool,Test Type,Record Size,Date,Reads (MB/s),Writes (MB/s),Random Reads (MB/s),Random Writes (MB/s),Reads (IOPS),Writes (IOPS),Random Reads (IOPS),Random Writes (IOPS)
fio,throughput,1M,"2/10/2024, 5:54:17 PM",2685.89,183.28,2617.06,173.56,5246,358,5111,339
2 Likes

Yes, it would indeed.

Funny you should make this post… I recently deployed a 12 drive setup for home with 10gbe making use of both SMB and iSCSI for VMs on another system.

RAIDz3, 128k block size, lz4 on, no dedup. Running on TrueNAS Core. 12x 4TB Seagate Constellation ES3 Drives.

At this time the bulk thorughput limiting factor is the 10gbe link and I am not using jumbo frames.

TrueNAS SMB → VFIO Passthrough SSD
image

iSCSI drive benchmarks.
image
image

I have ordered some parts to create a special device for the metadata but they are yet to arrive. Super interested in the performance after the change.

Note that out of the box TrueNas 13 (possibly 12 also) could not even get close to these numbers due to the mps driver being patched to default to a maximum io pages of 32 from the original auto detected value.

They did this to prevent a crash with certain tape drives, but it absolutely tanks performance. You can restore the original behaviour by setting the following boot argument.

hw.mps.max_io_pages=-1

32 equates to a 128kb IO transfer size limit, setting this back to -1 restored the old behaviour of a 4MB transfer size limit bringing my read performance back up from ~20MB/s (direct from the disk using dd), to ~140MB/s.

Edit: Setting CrystalDiskMark to use 8 threads results in the following:
2024-02-11-193005_492x364_scrot
2024-02-11-192957_502x373_scrot

Edit2: Enabling Jumbo Frames (MTU 9000)
2024-02-11-204606_488x358_scrot
2024-02-11-204620_489x360_scrot

5 Likes

Not 12-drive, but may be relevant:

6-wide raid1, default opts:

admin@zfserver:~ $ sudo fio --rw=write --bs=8M --threads=4 --iodepth=32 --ioengine=libaio --size=32G --name=/mnt/zfs/tank/fio-test
/mnt/zfs/tank/fio-test: (g=0): rw=write, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=392MiB/s][w=49 IOPS][eta 00m:00s]
/mnt/zfs/tank/fio-test: (groupid=0, jobs=1): err= 0: pid=27340: Sun Feb 11 17:31:00 2024
  write: IOPS=46, BW=372MiB/s (390MB/s)(32.0GiB/88082msec); 0 zone resets
    slat (usec): min=1414, max=1959.5k, avg=21480.65, stdev=42791.37
    clat (usec): min=22, max=4447.8k, avg=666406.29, stdev=436681.85
     lat (msec): min=24, max=4503, avg=687.89, stdev=446.93
    clat percentiles (msec):
     |  1.00th=[   51],  5.00th=[   53], 10.00th=[  157], 20.00th=[  617],
     | 30.00th=[  634], 40.00th=[  651], 50.00th=[  667], 60.00th=[  684],
     | 70.00th=[  701], 80.00th=[  726], 90.00th=[  760], 95.00th=[  944],
     | 99.00th=[ 3440], 99.50th=[ 3775], 99.90th=[ 4396], 99.95th=[ 4396],
     | 99.99th=[ 4463]
   bw (  KiB/s): min=16384, max=4292608, per=100.00%, avg=391617.19, stdev=354907.10, samples=170
   iops        : min=    2, max=  524, avg=47.79, stdev=43.32, samples=170
  lat (usec)   : 50=0.02%
  lat (msec)   : 50=0.88%, 100=7.91%, 250=2.86%, 500=0.61%, 750=76.54%
  lat (msec)   : 1000=7.32%, 2000=2.34%, >=2000=1.51%
  cpu          : usr=2.30%, sys=11.76%, ctx=6792, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s), io=32.0GiB (34.4GB), run=88082-88082msec

5-wide raidz1 pool (same default opts)

admin@zfserver:~ $ sudo fio --rw=write --bs=8M --threads=4 --iodepth=32 --ioengine=libaio --size=32G --name=/mnt/zfs/tank/fio-test
/mnt/zfs/tank/fio-test: (g=0): rw=write, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=libaio, iodepth=32
fio-3.36
Starting 1 process
/mnt/zfs/tank/fio-test: Laying out IO file (1 file / 32768MiB)
Jobs: 1 (f=1): [W(1)][98.8%][w=368MiB/s][w=46 IOPS][eta 00m:01s]m:41s]
/mnt/zfs/tank/fio-test: (groupid=0, jobs=1): err= 0: pid=27866: Sun Feb 11 17:34:15 2024
  write: IOPS=51, BW=413MiB/s (433MB/s)(32.0GiB/79338msec); 0 zone resets
    slat (usec): min=1861, max=449479, avg=19347.54, stdev=11743.60
    clat (usec): min=26, max=1712.3k, avg=599976.66, stdev=230376.44
     lat (msec): min=21, max=1735, avg=619.32, stdev=236.91
    clat percentiles (msec):
     |  1.00th=[   78],  5.00th=[   89], 10.00th=[  140], 20.00th=[  584],
     | 30.00th=[  600], 40.00th=[  617], 50.00th=[  634], 60.00th=[  659],
     | 70.00th=[  684], 80.00th=[  709], 90.00th=[  760], 95.00th=[  835],
     | 99.00th=[ 1083], 99.50th=[ 1569], 99.90th=[ 1703], 99.95th=[ 1703],
     | 99.99th=[ 1720]
   bw (  KiB/s): min=16384, max=2867200, per=99.35%, avg=420187.67, stdev=295938.28, samples=158
   iops        : min=    2, max=  350, avg=51.27, stdev=36.13, samples=158
  lat (usec)   : 50=0.02%
  lat (msec)   : 50=0.05%, 100=6.62%, 250=6.71%, 500=1.25%, 750=75.02%
  lat (msec)   : 1000=6.86%, 2000=3.47%
  cpu          : usr=2.38%, sys=12.36%, ctx=12008, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4096,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=413MiB/s (433MB/s), 413MiB/s-413MiB/s (433MB/s-433MB/s), io=32.0GiB (34.4GB), run=79338-79338msec

Also stalled at [z_wr_iss] but only periodically

Performance below expected, but not as much as in your case. The interesting part is that 4+1 is faster than 5+1 and not only per-drive, it just… is faster. And the 6th drive should have a highest linear throughput (also swapping it for another drive doesn’t influence the results).

@wendell can you compare 5+1 raidz1 with 4+1 raidz1 or even 4+2 raidz2? The remaining drives can always be used as spares :wink:

Hmm thst could be interesting since essentially i automatically get double vdevs since dual actuator

Do you know if TrueNAS Scale works the same way?

From my guts, it might be related to the stripe size and record size.
For example, draid2 14 + 2, the stripe size is 4k * 14 = 56k. To write a 128k record size, it has to write out 3 stripes, 168k in total. (parity is not included yet)

If it’s FreeBSD and not TrueNAS on Linux, it’s extremely likely as it’s a change made to the driver upstream.

Scale is Debian if my knowledge is correct. SO yes, Linux.

theres for sure a lot of datapoints washing out of this that are 1) be mindful of draid/vdev/stripe size (as it works different than raidz*) and 2) draid really is not making sense to me for any use case below 20-30 drives. I am trying to make sure its not “grr this is new grrrrr” crustiness but I kinda like it maybe for a 96-drive pool (only 2tb drives, lulz, but fun for experiments).

otherwise just a little care and planning a raidz1/2 looks good. I am also starting to feel like raidz3 has lost its lustre (haha pun intended) given resilvering times aren’t that bad for these not-absurdly-huge-drive-count-and-almost-full pools.

but still churning on the data here. draid in some sizings is much LESS cpu utilization, it seems, for the same # bytes transferred over the wire. Now to test that agains raidz, which I suspect will be more consistent.