Linux mdraid NVME issues

I have had a plethora of NVME related performance issues with RAID and Linux. I have been following various issues as far back as Wendell helping Linus with his server NVME/ZFS/Intel issues. At the time I had 8 of the same Intel model.

I am trying to tune this system for PostgreSQL. Since all my issues appear storage related I am starting from scratch, so nothing postgres specific in here. I am moving from a dual Xeon Gold 5120 system with 8 x 850 EVO. In things like pg_dump, replicaiton, etc. I am experiencing performance degratations. 2 hour on old system vs 8 hour on new system for a pg_dump of a specific database. At this point, I get better (fio) performance with a single drive than with 8 in RAID10 with mdadm.

specs

Ubuntu 20.04
2 x AMD EPYC 7601
384GB of DDR4 2666 MT/s , 16GB sticks.
8 x KCM6XVUL1T60 Kioxia NVME drives.

I have NVME formatted the disks

LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  2 : Metadata Size: 0   bytes - Data Size: 1  bytes - Relative Performance: 0 Best
LBA Format  3 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
LBA Format  4 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  5 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best

Single drive setup

mkfs.xfs /dev/nvme8n1
mount /dev/nvme8n1 /mnt/single
# xfs_info /dev/nvme8n1
meta-data=/dev/nvme8n1          isize=512    agcount=4, agsize=97675862 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=390703446, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=190773, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

8 in RAID10 setup

sudo mdadm --create --verbose /dev/md127 --level=10 --raid-devices=8 \
    /dev/nvme0 \
    /dev/nvme1 \
    /dev/nvme2 \
    /dev/nvme3 \
    /dev/nvme4 \
    /dev/nvme5 \
    /dev/nvme6 \
    /dev/nvme7

mkfs.xfs /dev/md127
mount /dev/md127 /mnt/raid10
# xfs_info /dev/md127
meta-data=/dev/md127             isize=512    agcount=32, agsize=48833792 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1562681344, imaxpct=5
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Using fio commands from a packt book on postgres tuning

single disk fio results

/mnt/single# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=10955: Mon Sep 20 19:19:18 2021
  read: IOPS=69.2k, BW=540MiB/s (567MB/s)(513MiB/950msec)
    slat (nsec): min=2826, max=58801, avg=4760.75, stdev=1984.89
    clat (usec): min=5, max=2378, avg=205.56, stdev=370.88
     lat (usec): min=16, max=2384, avg=210.46, stdev=370.75
    clat percentiles (usec):
     |  1.00th=[   14],  5.00th=[   28], 10.00th=[   41], 20.00th=[   65],
     | 30.00th=[   88], 40.00th=[  109], 50.00th=[  129], 60.00th=[  151],
     | 70.00th=[  174], 80.00th=[  198], 90.00th=[  249], 95.00th=[  371],
     | 99.00th=[ 2089], 99.50th=[ 2147], 99.90th=[ 2212], 99.95th=[ 2245],
     | 99.99th=[ 2278]
   bw (  KiB/s): min=520622, max=520622, per=94.08%, avg=520622.00, stdev= 0.00, samples=1
   iops        : min=65077, max=65077, avg=65077.00, stdev= 0.00, samples=1
  write: IOPS=68.8k, BW=537MiB/s (564MB/s)(511MiB/950msec); 0 zone resets
    slat (nsec): min=2895, max=48391, avg=4970.04, stdev=2028.48
    clat (usec): min=13, max=780, avg=246.87, stdev=87.14
     lat (usec): min=18, max=783, avg=251.97, stdev=87.27
    clat percentiles (usec):
     |  1.00th=[   32],  5.00th=[   82], 10.00th=[  133], 20.00th=[  188],
     | 30.00th=[  208], 40.00th=[  227], 50.00th=[  247], 60.00th=[  269],
     | 70.00th=[  293], 80.00th=[  318], 90.00th=[  355], 95.00th=[  383],
     | 99.00th=[  445], 99.50th=[  478], 99.90th=[  562], 99.95th=[  603],
     | 99.99th=[  725]
   bw (  KiB/s): min=521309, max=521309, per=94.72%, avg=521309.00, stdev= 0.00, samples=1
   iops        : min=65163, max=65163, avg=65163.00, stdev= 0.00, samples=1
  lat (usec)   : 10=0.01%, 20=1.42%, 50=6.70%, 100=13.20%, 250=49.49%
  lat (usec)   : 500=27.01%, 750=0.18%, 1000=0.01%
  lat (msec)   : 2=0.97%, 4=1.01%
  cpu          : usr=17.18%, sys=69.65%, ctx=3799, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=540MiB/s (567MB/s), 540MiB/s-540MiB/s (567MB/s-567MB/s), io=513MiB (538MB), run=950-950msec
  WRITE: bw=537MiB/s (564MB/s), 537MiB/s-537MiB/s (564MB/s-564MB/s), io=511MiB (535MB), run=950-950msec

Disk stats (read/write):
  nvme11n1: ios=60446/60321, merge=0/0, ticks=7766/892, in_queue=8658, util=90.11%
/mnt/single# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=0)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=11288: Mon Sep 20 19:20:29 2021
  read: IOPS=76.6k, BW=598MiB/s (627MB/s)(513MiB/858msec)
    slat (usec): min=2, max=223, avg= 4.72, stdev= 2.32
    clat (usec): min=30, max=1122, avg=193.68, stdev=90.37
     lat (usec): min=37, max=1125, avg=198.54, stdev=90.36
    clat percentiles (usec):
     |  1.00th=[   82],  5.00th=[   93], 10.00th=[  104], 20.00th=[  123],
     | 30.00th=[  141], 40.00th=[  157], 50.00th=[  176], 60.00th=[  192],
     | 70.00th=[  212], 80.00th=[  243], 90.00th=[  318], 95.00th=[  383],
     | 99.00th=[  515], 99.50th=[  545], 99.90th=[  627], 99.95th=[  652],
     | 99.99th=[  857]
   bw (  KiB/s): min=610848, max=610848, per=99.70%, avg=610848.00, stdev= 0.00, samples=1
   iops        : min=76356, max=76356, avg=76356.00, stdev= 0.00, samples=1
  write: IOPS=76.2k, BW=595MiB/s (624MB/s)(511MiB/858msec); 0 zone resets
    slat (nsec): min=2916, max=72217, avg=4857.61, stdev=2154.02
    clat (usec): min=14, max=699, avg=213.88, stdev=56.06
     lat (usec): min=27, max=702, avg=218.88, stdev=56.22
    clat percentiles (usec):
     |  1.00th=[  101],  5.00th=[  135], 10.00th=[  151], 20.00th=[  169],
     | 30.00th=[  184], 40.00th=[  196], 50.00th=[  208], 60.00th=[  223],
     | 70.00th=[  239], 80.00th=[  258], 90.00th=[  285], 95.00th=[  306],
     | 99.00th=[  355], 99.50th=[  408], 99.90th=[  594], 99.95th=[  635],
     | 99.99th=[  668]
   bw (  KiB/s): min=610528, max=610528, per=100.00%, avg=610528.00, stdev= 0.00, samples=1
   iops        : min=76316, max=76316, avg=76316.00, stdev= 0.00, samples=1
  lat (usec)   : 20=0.01%, 50=0.02%, 100=4.60%, 250=74.33%, 500=20.32%
  lat (usec)   : 750=0.72%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=20.42%, sys=75.50%, ctx=3198, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=598MiB/s (627MB/s), 598MiB/s-598MiB/s (627MB/s-627MB/s), io=513MiB (538MB), run=858-858msec
  WRITE: bw=595MiB/s (624MB/s), 595MiB/s-595MiB/s (624MB/s-624MB/s), io=511MiB (535MB), run=858-858msec

Disk stats (read/write):
  nvme11n1: ios=48415/48094, merge=0/0, ticks=6179/678, in_queue=6856, util=86.46%

8 in RAID10 setup

/mnt/raid10# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=11602: Mon Sep 20 19:21:42 2021
  read: IOPS=50.2k, BW=392MiB/s (411MB/s)(513MiB/1309msec)
    slat (usec): min=3, max=249, avg= 6.38, stdev= 3.42
    clat (usec): min=8, max=2362, avg=198.42, stdev=189.93
     lat (usec): min=16, max=2371, avg=204.96, stdev=190.05
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[   40], 10.00th=[   62], 20.00th=[   98],
     | 30.00th=[  126], 40.00th=[  153], 50.00th=[  182], 60.00th=[  210],
     | 70.00th=[  237], 80.00th=[  265], 90.00th=[  302], 95.00th=[  334],
     | 99.00th=[ 1037], 99.50th=[ 1909], 99.90th=[ 2180], 99.95th=[ 2212],
     | 99.99th=[ 2311]
   bw (  KiB/s): min=399424, max=402032, per=99.78%, avg=400728.00, stdev=1844.13, samples=2
   iops        : min=49928, max=50254, avg=50091.00, stdev=230.52, samples=2
  write: IOPS=49.9k, BW=390MiB/s (409MB/s)(511MiB/1309msec); 0 zone resets
    slat (usec): min=5, max=220, avg= 9.53, stdev= 3.73
    clat (usec): min=16, max=1999, avg=423.26, stdev=101.90
     lat (usec): min=28, max=2011, avg=432.97, stdev=102.21
    clat percentiles (usec):
     |  1.00th=[  239],  5.00th=[  293], 10.00th=[  310], 20.00th=[  334],
     | 30.00th=[  359], 40.00th=[  388], 50.00th=[  416], 60.00th=[  445],
     | 70.00th=[  474], 80.00th=[  506], 90.00th=[  553], 95.00th=[  586],
     | 99.00th=[  668], 99.50th=[  717], 99.90th=[  979], 99.95th=[ 1156],
     | 99.99th=[ 1811]
   bw (  KiB/s): min=395264, max=401728, per=99.76%, avg=398496.00, stdev=4570.74, samples=2
   iops        : min=49408, max=50216, avg=49812.00, stdev=571.34, samples=2
  lat (usec)   : 10=0.01%, 20=0.60%, 50=3.18%, 100=6.79%, 250=27.42%
  lat (usec)   : 500=50.47%, 750=10.83%, 1000=0.16%
  lat (msec)   : 2=0.35%, 4=0.21%
  cpu          : usr=14.91%, sys=82.26%, ctx=2428, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=392MiB/s (411MB/s), 392MiB/s-392MiB/s (411MB/s-411MB/s), io=513MiB (538MB), run=1309-1309msec
  WRITE: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), io=511MiB (535MB), run=1309-1309msec

Disk stats (read/write):
    md127: ios=59512/59363, merge=0/0, ticks=3652/1052, in_queue=4704, util=92.40%, aggrios=8214/16339, aggrmerge=0/0, aggrticks=511/238, aggrin_queue=749, aggrutil=90.55%
  nvme0n1: ios=9744/16320, merge=0/0, ticks=553/222, in_queue=775, util=90.21%
  nvme3n1: ios=6722/16320, merge=0/0, ticks=495/255, in_queue=750, util=90.21%
  nvme6n1: ios=9485/16335, merge=0/0, ticks=553/225, in_queue=777, util=90.21%
  nvme2n1: ios=9662/16320, merge=0/0, ticks=600/224, in_queue=823, util=90.21%
  nvme5n1: ios=6863/16384, merge=0/0, ticks=447/253, in_queue=700, util=90.48%
  nvme1n1: ios=6689/16320, merge=0/0, ticks=453/254, in_queue=707, util=90.27%
  nvme4n1: ios=9585/16384, merge=0/0, ticks=564/225, in_queue=789, util=90.55%
  nvme7n1: ios=6963/16335, merge=0/0, ticks=423/253, in_queue=676, util=90.27%
/mnt/raid10# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=12089: Mon Sep 20 19:22:25 2021
  read: IOPS=49.3k, BW=385MiB/s (404MB/s)(513MiB/1333msec)
    slat (usec): min=3, max=286, avg= 6.50, stdev= 3.66
    clat (usec): min=32, max=1247, avg=220.02, stdev=85.51
     lat (usec): min=36, max=1312, avg=226.67, stdev=85.60
    clat percentiles (usec):
     |  1.00th=[   82],  5.00th=[   96], 10.00th=[  110], 20.00th=[  137],
     | 30.00th=[  163], 40.00th=[  190], 50.00th=[  217], 60.00th=[  245],
     | 70.00th=[  273], 80.00th=[  297], 90.00th=[  326], 95.00th=[  347],
     | 99.00th=[  433], 99.50th=[  490], 99.90th=[  635], 99.95th=[  701],
     | 99.99th=[  889]
   bw (  KiB/s): min=394224, max=395840, per=100.00%, avg=395032.00, stdev=1142.68, samples=2
   iops        : min=49278, max=49480, avg=49379.00, stdev=142.84, samples=2
  write: IOPS=49.0k, BW=383MiB/s (402MB/s)(511MiB/1333msec); 0 zone resets
    slat (usec): min=5, max=357, avg= 9.42, stdev= 4.39
    clat (usec): min=45, max=1018, avg=412.76, stdev=87.73
     lat (usec): min=63, max=1246, avg=422.33, stdev=88.03
    clat percentiles (usec):
     |  1.00th=[  273],  5.00th=[  293], 10.00th=[  310], 20.00th=[  330],
     | 30.00th=[  351], 40.00th=[  379], 50.00th=[  404], 60.00th=[  433],
     | 70.00th=[  457], 80.00th=[  490], 90.00th=[  529], 95.00th=[  562],
     | 99.00th=[  635], 99.50th=[  685], 99.90th=[  807], 99.95th=[  889],
     | 99.99th=[  971]
   bw (  KiB/s): min=392128, max=392816, per=100.00%, avg=392472.00, stdev=486.49, samples=2
   iops        : min=49016, max=49102, avg=49059.00, stdev=60.81, samples=2
  lat (usec)   : 50=0.02%, 100=3.26%, 250=28.03%, 500=60.06%, 750=8.53%
  lat (usec)   : 1000=0.10%
  lat (msec)   : 2=0.01%
  cpu          : usr=19.74%, sys=77.48%, ctx=2485, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s), io=513MiB (538MB), run=1333-1333msec
  WRITE: bw=383MiB/s (402MB/s), 383MiB/s-383MiB/s (402MB/s-402MB/s), io=511MiB (535MB), run=1333-1333msec

Disk stats (read/write):
    md127: ios=55970/55695, merge=0/0, ticks=4460/960, in_queue=5420, util=92.81%, aggrios=8214/16358, aggrmerge=0/0, aggrticks=686/234, aggrin_queue=921, aggrutil=89.78%
  nvme0n1: ios=10445/16350, merge=0/0, ticks=876/219, in_queue=1096, util=89.72%
  nvme3n1: ios=5903/16470, merge=0/0, ticks=492/253, in_queue=744, util=89.72%
  nvme6n1: ios=10538/16278, merge=0/0, ticks=880/217, in_queue=1097, util=89.72%
  nvme2n1: ios=10414/16470, merge=0/0, ticks=870/221, in_queue=1090, util=89.72%
  nvme5n1: ios=5972/16337, merge=0/0, ticks=501/248, in_queue=749, util=89.72%
  nvme1n1: ios=5992/16350, merge=0/0, ticks=503/252, in_queue=756, util=89.72%
  nvme4n1: ios=10478/16337, merge=0/0, ticks=875/222, in_queue=1097, util=89.72%
  nvme7n1: ios=5971/16278, merge=0/0, ticks=497/247, in_queue=745, util=89.78%

I’m kind of at a loss here. After being so far in the weeds for a while now I am hoping there is just something simple I overlooked. Any other info I can provide just let me know. I’d appreciate any advice.

I believe newer SSDs have 8k (8192 to be more exact) byte sectors. I’m not sure what I’m doing here, but maybe try to increase that from 4k to 8k? On ZFS you would use ashift=13 when creating the pool, I haven’t used xfs, so I can’t help you with that.

I see that you are already using bs=8k in FIO. Are you sure the drive is correctly reporting the Data Size? If you don’t have anything on the array, try a quick format. A quick search show that you should use

mkfs.xfs -b size=8192 /dev/md127

Then try the FIO test again with both bs=8k and after that with bs=16k, just to be sure.


Again, I have no idea what I’m doing, I’m just throwing s**t against the wall and see what sticks.

Edit: welcome to the forum!


mkfs.xfs -b size=8192 /dev/nvme11n1
mount /dev/nvme11n1 /mnt/kioxia/

results in:

mount: /mnt/kioxia: mount(2) system call failed: Function not implemented
(dmesg) XFS (nvme11n1): File system with blocksize 8192 bytes. Only pagesize (4096) or less will currently work.

mkfs.xfs -b size=8192 -s size=8192 /dev/nvme11n1

results in mkfs.xfs: error - cannot set blocksize 8192 on block device /dev/nvme11n1: Invalid argument

I would think the drive is using 4k not 8k since that’s what the nvme-cli information is reporting, but I am not 100% sure. Though these zfs number may indicate better 8k perf. I originally tried going ZFS, but the actual postgres performance was significantly worse than XFS.

ZFS ashift=12, recordsize=4k


zpool create -o ashift=12 KIOXIAZFS \
    mirror /dev/nvme0n1 /dev/nvme1n1 \
    mirror /dev/nvme2n1 /dev/nvme3n1 \
    mirror /dev/nvme4n1 /dev/nvme5n1 \
    mirror /dev/nvme6n1 /dev/nvme7n1

zfs create KIOXIAZFS/DATA \
  -o mountpoint=/mnt/data \
  -o atime=off \
  -o xattr=sa \
  -o relatime=on \
  -o compression=lz4 \
  -o recordsize=4k \
  -o logbias=throughput \
  -o primarycache=metadata

# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50

test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1): [M(1)][100.0%][r=43.4MiB/s,w=42.2MiB/s][r=5553,w=5397 IOPS][eta 00m:00s]

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=89223: Tue Sep 21 08:18:24 2021

  read: IOPS=10.9k, BW=85.1MiB/s (89.2MB/s)(513MiB/6035msec)

    slat (usec): min=2, max=839, avg=65.03, stdev=56.92

    clat (usec): min=2, max=5757, avg=1432.47, stdev=935.32

     lat (usec): min=138, max=5958, avg=1497.72, stdev=983.07

    clat percentiles (usec):

     |  1.00th=[  297],  5.00th=[  334], 10.00th=[  355], 20.00th=[  392],

     | 30.00th=[  433], 40.00th=[ 1369], 50.00th=[ 1582], 60.00th=[ 1729],

     | 70.00th=[ 1876], 80.00th=[ 2040], 90.00th=[ 2442], 95.00th=[ 3228],

     | 99.00th=[ 4080], 99.50th=[ 4293], 99.90th=[ 4817], 99.95th=[ 5014],

     | 99.99th=[ 5538]

   bw (  KiB/s): min=35088, max=181360, per=100.00%, avg=87238.08, stdev=44443.02, samples=12

   iops        : min= 4386, max=22670, avg=10904.75, stdev=5555.37, samples=12

  write: IOPS=10.8k, BW=84.6MiB/s (88.7MB/s)(511MiB/6035msec); 0 zone resets

    slat (usec): min=15, max=513, avg=23.96, stdev=13.35

    clat (usec): min=202, max=5780, avg=1422.69, stdev=942.18

     lat (usec): min=220, max=5837, avg=1446.79, stdev=950.37

    clat percentiles (usec):

     |  1.00th=[  297],  5.00th=[  334], 10.00th=[  355], 20.00th=[  388],

     | 30.00th=[  429], 40.00th=[ 1336], 50.00th=[ 1565], 60.00th=[ 1713],

     | 70.00th=[ 1860], 80.00th=[ 2040], 90.00th=[ 2442], 95.00th=[ 3261],

     | 99.00th=[ 4113], 99.50th=[ 4359], 99.90th=[ 4948], 99.95th=[ 5014],

     | 99.99th=[ 5473]

   bw (  KiB/s): min=35680, max=180320, per=100.00%, avg=86767.08, stdev=44345.16, samples=12

   iops        : min= 4460, max=22540, avg=10845.83, stdev=5543.11, samples=12

  lat (usec)   : 4=0.01%, 250=0.05%, 500=34.46%, 750=0.93%, 1000=0.53%

  lat (msec)   : 2=42.13%, 4=20.67%, 10=1.24%

  cpu          : usr=4.09%, sys=55.42%, ctx=41928, majf=0, minf=12

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=513MiB (538MB), run=6035-6035msec

  WRITE: bw=84.6MiB/s (88.7MB/s), 84.6MiB/s-84.6MiB/s (88.7MB/s-88.7MB/s), io=511MiB (535MB), run=6035-6035msec


# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50

test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1): [m(1)][100.0%][r=50.2MiB/s,w=49.2MiB/s][r=6421,w=6302 IOPS][eta 00m:00s]

test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=129283: Tue Sep 21 08:19:05 2021

  read: IOPS=5951, BW=46.5MiB/s (48.8MB/s)(513MiB/11042msec)

    slat (usec): min=6, max=977, avg=135.52, stdev=39.91

    clat (usec): min=2, max=5610, avg=2613.82, stdev=562.65

     lat (usec): min=122, max=5800, avg=2749.62, stdev=580.03

    clat percentiles (usec):

     |  1.00th=[ 1532],  5.00th=[ 1827], 10.00th=[ 1975], 20.00th=[ 2180],

     | 30.00th=[ 2311], 40.00th=[ 2442], 50.00th=[ 2573], 60.00th=[ 2671],

     | 70.00th=[ 2802], 80.00th=[ 2999], 90.00th=[ 3261], 95.00th=[ 3589],

     | 99.00th=[ 4555], 99.50th=[ 4817], 99.90th=[ 5145], 99.95th=[ 5276],

     | 99.99th=[ 5473]

   bw (  KiB/s): min=31168, max=59360, per=99.94%, avg=47578.82, stdev=6888.74, samples=22

   iops        : min= 3896, max= 7420, avg=5947.32, stdev=861.04, samples=22

  write: IOPS=5919, BW=46.2MiB/s (48.5MB/s)(511MiB/11042msec); 0 zone resets

    slat (usec): min=15, max=307, avg=28.81, stdev=11.24

    clat (usec): min=486, max=5728, avg=2609.74, stdev=559.32

     lat (usec): min=512, max=5799, avg=2638.70, stdev=565.08

    clat percentiles (usec):

     |  1.00th=[ 1532],  5.00th=[ 1827], 10.00th=[ 1975], 20.00th=[ 2180],

     | 30.00th=[ 2311], 40.00th=[ 2442], 50.00th=[ 2540], 60.00th=[ 2671],

     | 70.00th=[ 2802], 80.00th=[ 2966], 90.00th=[ 3261], 95.00th=[ 3589],

     | 99.00th=[ 4555], 99.50th=[ 4817], 99.90th=[ 5211], 99.95th=[ 5342],

     | 99.99th=[ 5604]

   bw (  KiB/s): min=28992, max=59168, per=99.94%, avg=47323.55, stdev=7068.76, samples=22

   iops        : min= 3624, max= 7396, avg=5915.41, stdev=883.55, samples=22

  lat (usec)   : 4=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%

  lat (msec)   : 2=10.67%, 4=86.65%, 10=2.68%

  cpu          : usr=2.06%, sys=45.39%, ctx=62215, majf=0, minf=12

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=46.5MiB/s (48.8MB/s), 46.5MiB/s-46.5MiB/s (48.8MB/s-48.8MB/s), io=513MiB (538MB), run=11042-11042msec

  WRITE: bw=46.2MiB/s (48.5MB/s), 46.2MiB/s-46.2MiB/s (48.5MB/s-48.5MB/s), io=511MiB (535MB), run=11042-11042msec

ZFS ashift=12, recordsize=8k


zpool create -o ashift=12 KIOXIAZFS \
    mirror /dev/nvme0n1 /dev/nvme1n1 \
    mirror /dev/nvme2n1 /dev/nvme3n1 \
    mirror /dev/nvme4n1 /dev/nvme5n1 \
    mirror /dev/nvme6n1 /dev/nvme7n1

zfs create KIOXIAZFS/DATA \
  -o mountpoint=/mnt/data \
  -o atime=off \
  -o xattr=sa \
  -o relatime=on \
  -o compression=lz4 \
  -o recordsize=8k \
  -o logbias=throughput \
  -o primarycache=metadata

# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50

test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1): [M(1)][100.0%][r=61.2MiB/s,w=61.2MiB/s][r=7831,w=7836 IOPS][eta 00m:00s]

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=31718: Tue Sep 21 08:14:33 2021

  read: IOPS=10.9k, BW=84.9MiB/s (89.0MB/s)(513MiB/6048msec)

    slat (usec): min=2, max=665, avg=73.20, stdev=59.51

    clat (usec): min=2, max=5064, avg=1437.35, stdev=948.45

     lat (usec): min=60, max=5155, avg=1510.75, stdev=999.76

    clat percentiles (usec):

     |  1.00th=[  215],  5.00th=[  233], 10.00th=[  247], 20.00th=[  265],

     | 30.00th=[  297], 40.00th=[ 1336], 50.00th=[ 1713], 60.00th=[ 1942],

     | 70.00th=[ 2114], 80.00th=[ 2311], 90.00th=[ 2540], 95.00th=[ 2737],

     | 99.00th=[ 3228], 99.50th=[ 3490], 99.90th=[ 4228], 99.95th=[ 4490],

     | 99.99th=[ 4686]

   bw (  KiB/s): min=51408, max=192816, per=100.00%, avg=86929.33, stdev=49806.79, samples=12

   iops        : min= 6426, max=24102, avg=10866.17, stdev=6225.85, samples=12

  write: IOPS=10.8k, BW=84.4MiB/s (88.5MB/s)(511MiB/6048msec); 0 zone resets

    slat (usec): min=8, max=216, avg=16.18, stdev= 8.02

    clat (usec): min=159, max=5076, avg=1423.53, stdev=948.80

     lat (usec): min=173, max=5110, avg=1439.84, stdev=952.79

    clat percentiles (usec):

     |  1.00th=[  212],  5.00th=[  233], 10.00th=[  245], 20.00th=[  265],

     | 30.00th=[  293], 40.00th=[ 1303], 50.00th=[ 1696], 60.00th=[ 1926],

     | 70.00th=[ 2114], 80.00th=[ 2311], 90.00th=[ 2540], 95.00th=[ 2737],

     | 99.00th=[ 3195], 99.50th=[ 3490], 99.90th=[ 4113], 99.95th=[ 4490],

     | 99.99th=[ 4686]

   bw (  KiB/s): min=49808, max=191824, per=100.00%, avg=86504.00, stdev=49354.47, samples=12

   iops        : min= 6226, max=23978, avg=10813.00, stdev=6169.31, samples=12

  lat (usec)   : 4=0.01%, 100=0.01%, 250=12.09%, 500=23.03%, 750=0.37%

  lat (usec)   : 1000=0.62%

  lat (msec)   : 2=27.55%, 4=36.22%, 10=0.13%

  cpu          : usr=3.19%, sys=46.30%, ctx=41920, majf=0, minf=13

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=84.9MiB/s (89.0MB/s), 84.9MiB/s-84.9MiB/s (89.0MB/s-89.0MB/s), io=513MiB (538MB), run=6048-6048msec

  WRITE: bw=84.4MiB/s (88.5MB/s), 84.4MiB/s-84.4MiB/s (88.5MB/s-88.5MB/s), io=511MiB (535MB), run=6048-6048msec


# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50

test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1): [m(1)][100.0%][r=59.4MiB/s,w=58.5MiB/s][r=7598,w=7488 IOPS][eta 00m:00s]

test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=50934: Tue Sep 21 08:15:03 2021

  read: IOPS=8180, BW=63.9MiB/s (67.0MB/s)(513MiB/8033msec)

    slat (usec): min=4, max=574, avg=103.64, stdev=38.28

    clat (nsec): min=1844, max=4425.3k, avg=1901616.08, stdev=396829.96

     lat (usec): min=61, max=4563, avg=2005.51, stdev=407.88

    clat percentiles (usec):

     |  1.00th=[ 1090],  5.00th=[ 1303], 10.00th=[ 1434], 20.00th=[ 1582],

     | 30.00th=[ 1696], 40.00th=[ 1795], 50.00th=[ 1876], 60.00th=[ 1975],

     | 70.00th=[ 2073], 80.00th=[ 2212], 90.00th=[ 2376], 95.00th=[ 2540],

     | 99.00th=[ 3195], 99.50th=[ 3458], 99.90th=[ 3851], 99.95th=[ 3949],

     | 99.99th=[ 4146]

   bw (  KiB/s): min=48800, max=74192, per=100.00%, avg=65445.00, stdev=7991.49, samples=16

   iops        : min= 6100, max= 9274, avg=8180.62, stdev=998.94, samples=16

  write: IOPS=8136, BW=63.6MiB/s (66.7MB/s)(511MiB/8033msec); 0 zone resets

    slat (usec): min=7, max=120, avg=15.33, stdev= 5.11

    clat (usec): min=429, max=4384, avg=1898.58, stdev=395.52

     lat (usec): min=445, max=4411, avg=1914.04, stdev=397.29

    clat percentiles (usec):

     |  1.00th=[ 1090],  5.00th=[ 1303], 10.00th=[ 1418], 20.00th=[ 1582],

     | 30.00th=[ 1696], 40.00th=[ 1795], 50.00th=[ 1876], 60.00th=[ 1975],

     | 70.00th=[ 2073], 80.00th=[ 2180], 90.00th=[ 2376], 95.00th=[ 2540],

     | 99.00th=[ 3130], 99.50th=[ 3458], 99.90th=[ 3916], 99.95th=[ 4047],

     | 99.99th=[ 4293]

   bw (  KiB/s): min=47952, max=76224, per=100.00%, avg=65092.00, stdev=8576.42, samples=16

   iops        : min= 5994, max= 9528, avg=8136.50, stdev=1072.05, samples=16

  lat (usec)   : 2=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%

  lat (usec)   : 1000=0.39%

  lat (msec)   : 2=62.39%, 4=37.16%, 10=0.05%

  cpu          : usr=3.19%, sys=31.21%, ctx=58644, majf=0, minf=13

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=63.9MiB/s (67.0MB/s), 63.9MiB/s-63.9MiB/s (67.0MB/s-67.0MB/s), io=513MiB (538MB), run=8033-8033msec

  WRITE: bw=63.6MiB/s (66.7MB/s), 63.6MiB/s-63.6MiB/s (66.7MB/s-66.7MB/s), io=511MiB (535MB), run=8033-8033msec

ZFS ashift=13, recordsize=8k


zpool create -o ashift=13 KIOXIAZFS \
    mirror /dev/nvme0n1 /dev/nvme1n1 \
    mirror /dev/nvme2n1 /dev/nvme3n1 \
    mirror /dev/nvme4n1 /dev/nvme5n1 \
    mirror /dev/nvme6n1 /dev/nvme7n1

zfs create KIOXIAZFS/DATA \
  -o mountpoint=/mnt/data \
  -o atime=off \
  -o xattr=sa \
  -o relatime=on \
  -o compression=lz4 \
  -o recordsize=8k \
  -o logbias=throughput \
  -o primarycache=metadata

# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50

test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1)

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=151051: Tue Sep 21 08:21:21 2021

  read: IOPS=45.2k, BW=353MiB/s (370MB/s)(513MiB/1454msec)

    slat (nsec): min=4999, max=99749, avg=9933.32, stdev=3793.92

    clat (usec): min=23, max=2812, avg=395.48, stdev=363.80

     lat (usec): min=31, max=2820, avg=405.56, stdev=363.45

    clat percentiles (usec):

     |  1.00th=[   57],  5.00th=[  217], 10.00th=[  253], 20.00th=[  285],

     | 30.00th=[  302], 40.00th=[  314], 50.00th=[  322], 60.00th=[  330],

     | 70.00th=[  343], 80.00th=[  367], 90.00th=[  437], 95.00th=[  603],

     | 99.00th=[ 2245], 99.50th=[ 2343], 99.90th=[ 2507], 99.95th=[ 2573],

     | 99.99th=[ 2704]

   bw (  KiB/s): min=345920, max=353680, per=96.75%, avg=349800.00, stdev=5487.15, samples=2

   iops        : min=43240, max=44210, avg=43725.00, stdev=685.89, samples=2

  write: IOPS=44.0k, BW=351MiB/s (368MB/s)(511MiB/1454msec); 0 zone resets

    slat (usec): min=2, max=267, avg= 7.53, stdev= 3.35

    clat (usec): min=37, max=697, avg=294.93, stdev=69.13

     lat (usec): min=41, max=704, avg=302.58, stdev=69.52

    clat percentiles (usec):

     |  1.00th=[   64],  5.00th=[  145], 10.00th=[  204], 20.00th=[  258],

     | 30.00th=[  285], 40.00th=[  302], 50.00th=[  314], 60.00th=[  322],

     | 70.00th=[  330], 80.00th=[  338], 90.00th=[  355], 95.00th=[  375],

     | 99.00th=[  433], 99.50th=[  457], 99.90th=[  523], 99.95th=[  562],

     | 99.99th=[  627]

   bw (  KiB/s): min=345184, max=349456, per=96.58%, avg=347320.00, stdev=3020.76, samples=2

   iops        : min=43148, max=43682, avg=43415.00, stdev=377.60, samples=2

  lat (usec)   : 50=0.57%, 100=1.71%, 250=11.56%, 500=82.62%, 750=1.35%

  lat (usec)   : 1000=0.10%

  lat (msec)   : 2=0.67%, 4=1.42%

  cpu          : usr=15.42%, sys=77.49%, ctx=801, majf=0, minf=14

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=513MiB (538MB), run=1454-1454msec

  WRITE: bw=351MiB/s (368MB/s), 351MiB/s-351MiB/s (368MB/s-368MB/s), io=511MiB (535MB), run=1454-1454msec

Disk stats (read/write):

    dm-1: ios=56881/56649, merge=0/0, ticks=10348/3096, in_queue=13444, util=92.87%, aggrios=65713/65359, aggrmerge=0/0, aggrticks=11320/3480, aggrin_queue=14800, aggrutil=90.43%

    dm-0: ios=65713/65359, merge=0/0, ticks=11320/3480, in_queue=14800, util=90.43%, aggrios=65713/65359, aggrmerge=0/0, aggrticks=10120/1824, aggrin_queue=11944, aggrutil=90.43%

    md0: ios=65713/65359, merge=0/0, ticks=10120/1824, in_queue=11944, util=90.43%, aggrios=32856/52550, aggrmerge=0/12809, aggrticks=4604/1123, aggrin_queue=5728, aggrutil=90.43%

  nvme9n1: ios=18/52550, merge=0/12809, ticks=1/985, in_queue=986, util=90.43%

  nvme8n1: ios=65695/52550, merge=0/12809, ticks=9208/1262, in_queue=10470, util=90.43%


# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50

test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32

fio-3.16

Starting 1 process

test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)

Jobs: 1 (f=1)

test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=151447: Tue Sep 21 08:21:57 2021

  read: IOPS=52.0k, BW=414MiB/s (434MB/s)(513MiB/1240msec)

    slat (usec): min=4, max=210, avg= 8.44, stdev= 3.77

    clat (usec): min=85, max=1167, avg=331.94, stdev=83.33

     lat (usec): min=93, max=1174, avg=340.51, stdev=83.56

    clat percentiles (usec):

     |  1.00th=[  215],  5.00th=[  247], 10.00th=[  260], 20.00th=[  277],

     | 30.00th=[  285], 40.00th=[  297], 50.00th=[  310], 60.00th=[  326],

     | 70.00th=[  347], 80.00th=[  371], 90.00th=[  437], 95.00th=[  502],

     | 99.00th=[  652], 99.50th=[  693], 99.90th=[  775], 99.95th=[  914],

     | 99.99th=[ 1020]

   bw (  KiB/s): min=403776, max=431728, per=98.54%, avg=417752.00, stdev=19765.05, samples=2

   iops        : min=50472, max=53966, avg=52219.00, stdev=2470.63, samples=2

  write: IOPS=52.7k, BW=412MiB/s (432MB/s)(511MiB/1240msec); 0 zone resets

    slat (nsec): min=2655, max=85683, avg=7335.15, stdev=3054.53

    clat (usec): min=88, max=634, avg=255.64, stdev=39.96

     lat (usec): min=96, max=637, avg=263.10, stdev=40.43

    clat percentiles (usec):

     |  1.00th=[  182],  5.00th=[  202], 10.00th=[  212], 20.00th=[  225],

     | 30.00th=[  235], 40.00th=[  241], 50.00th=[  249], 60.00th=[  258],

     | 70.00th=[  269], 80.00th=[  285], 90.00th=[  306], 95.00th=[  326],

     | 99.00th=[  379], 99.50th=[  408], 99.90th=[  494], 99.95th=[  537],

     | 99.99th=[  594]

   bw (  KiB/s): min=403376, max=428896, per=98.69%, avg=416136.00, stdev=18045.37, samples=2

   iops        : min=50422, max=53612, avg=52017.00, stdev=2255.67, samples=2

  lat (usec)   : 100=0.01%, 250=28.44%, 500=68.95%, 750=2.53%, 1000=0.06%

  lat (msec)   : 2=0.01%

  cpu          : usr=15.74%, sys=84.10%, ctx=40, majf=0, minf=12

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0

     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):

   READ: bw=414MiB/s (434MB/s), 414MiB/s-414MiB/s (434MB/s-434MB/s), io=513MiB (538MB), run=1240-1240msec

  WRITE: bw=412MiB/s (432MB/s), 412MiB/s-412MiB/s (432MB/s-432MB/s), io=511MiB (535MB), run=1240-1240msec

Disk stats (read/write):

    dm-1: ios=53245/53054, merge=0/0, ticks=7332/2856, in_queue=10188, util=91.15%, aggrios=65713/65359, aggrmerge=0/0, aggrticks=9124/3340, aggrin_queue=12464, aggrutil=89.56%

    dm-0: ios=65713/65359, merge=0/0, ticks=9124/3340, in_queue=12464, util=89.56%, aggrios=65713/65359, aggrmerge=0/0, aggrticks=7896/1528, aggrin_queue=9424, aggrutil=89.56%

    md0: ios=65713/65359, merge=0/0, ticks=7896/1528, in_queue=9424, util=89.56%, aggrios=32856/65340, aggrmerge=0/19, aggrticks=3955/1253, aggrin_queue=5208, aggrutil=89.56%

  nvme9n1: ios=34157/65340, merge=0/19, ticks=4208/1305, in_queue=5513, util=89.56%

  nvme8n1: ios=31556/65340, merge=0/19, ticks=3703/1202, in_queue=4904, util=89.56%

What a huge increase in IOPS on ZFS going to ashift=13.

Reading through `man mkfs.xfs`, I see that:

-b block_size_options

This option specifies the fundamental block size of the filesystem. The valid block_size_options are: log= value or size= value and only one can be supplied. The block size is specified either as a base two logarithm value with log= , or in bytes with size= . The default value is 4096 bytes (4 KiB), the minimum is 512, and the maximum is 65536 (64 KiB). XFS on Linux currently only supports pagesize or smaller blocks.

Which coincides with your dmesg log.

The "-s" option should be smaller than the "-b" option

-s sector_size

This option specifies the fundamental sector size of the filesystem. The sector_size is specified either as a value in bytes with size= value or as a base two logarithm value with log= value. The default sector_size is 512 bytes. The minimum value for sector size is 512; the maximum is 32768 (32 KiB). The sector_size must be a power of 2 size and cannot be made larger than the filesystem block size.

So maybe try:

mkfs.xfs -b size=8192 -s size=4096 /dev/nvme11n1

Or do that for all the md127 array.

Yeah, probably needs some tuning. XFS is supposed to be better for large files, so no wonder in its default it is faster (also, I believe it has lower overhead).

One thing to note, have you tried FIO on all individual drives? Maybe there’s one that slows down the array? If they all have similar result, try a test in md raid-devices=4, then 6 and see if it makes a difference.


Again, I’m just throwing s*** at the wall and see what sticks

@wendell if it’s not too much to ask, mind giving us chocamo a hand with ZFS or maybe XFS?
:upside_down_face:

That is the highest I’ve seen across the various configurations I’ve tried. Given the specs of these drives, I still feel like I’m on the lower end of things.

mkfs.xfs -b size=8192 -s size=4096 /dev/nvme11n1

results:

mount: /mnt/kioxia: mount(2) system call failed: Function not implemented.
(dmesg) XFS (nvme11n1): File system with blocksize 8192 bytes. Only pagesize (4096) or less will currently work.

Not recently. I’m checking some postgres stuff on the current ZFS configuration. I’ll check that and update in a bit

At this point, that makes 2 of us.

1 Like

The disks do all have very similar direct fio numbers, direct disk 4k-1024k numbers

test_seq_mix_rw: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3782475: Wed Sep 22 10:11:37 2021
  read: IOPS=99.7k, BW=389MiB/s (408MB/s)(512MiB/1315msec)
    slat (nsec): min=1953, max=36419, avg=3153.28, stdev=1499.19
    clat (usec): min=10, max=2272, avg=192.07, stdev=196.03
     lat (usec): min=14, max=2274, avg=195.35, stdev=195.97
    clat percentiles (usec):
     |  1.00th=[   52],  5.00th=[  116], 10.00th=[  137], 20.00th=[  147],
     | 30.00th=[  151], 40.00th=[  153], 50.00th=[  155], 60.00th=[  161],
     | 70.00th=[  178], 80.00th=[  196], 90.00th=[  249], 95.00th=[  306],
     | 99.00th=[ 1778], 99.50th=[ 2008], 99.90th=[ 2147], 99.95th=[ 2180],
     | 99.99th=[ 2245]
   bw (  KiB/s): min=376840, max=412136, per=98.97%, avg=394488.00, stdev=24958.04, samples=2
   iops        : min=94210, max=103034, avg=98622.00, stdev=6239.51, samples=2
  write: IOPS=99.7k, BW=389MiB/s (408MB/s)(512MiB/1315msec); 0 zone resets
    slat (nsec): min=2024, max=35157, avg=3555.56, stdev=1617.24
    clat (usec): min=8, max=317, avg=120.72, stdev=34.81
     lat (usec): min=12, max=322, avg=124.42, stdev=34.96
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   43], 10.00th=[   65], 20.00th=[   96],
     | 30.00th=[  117], 40.00th=[  127], 50.00th=[  133], 60.00th=[  139],
     | 70.00th=[  143], 80.00th=[  147], 90.00th=[  153], 95.00th=[  155],
     | 99.00th=[  165], 99.50th=[  167], 99.90th=[  202], 99.95th=[  221],
     | 99.99th=[  277]
   bw (  KiB/s): min=374392, max=413312, per=98.76%, avg=393852.00, stdev=27520.60, samples=2
   iops        : min=93598, max=103328, avg=98463.00, stdev=6880.15, samples=2
  lat (usec)   : 10=0.01%, 20=1.09%, 50=2.63%, 100=8.89%, 250=82.43%
  lat (usec)   : 500=4.25%, 750=0.12%, 1000=0.05%
  lat (msec)   : 2=0.28%, 4=0.25%
  cpu          : usr=33.87%, sys=61.72%, ctx=1054, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=131040,131104,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=389MiB/s (408MB/s), 389MiB/s-389MiB/s (408MB/s-408MB/s), io=512MiB (537MB), run=1315-1315msec
  WRITE: bw=389MiB/s (408MB/s), 389MiB/s-389MiB/s (408MB/s-408MB/s), io=512MiB (537MB), run=1315-1315msec

Disk stats (read/write):
  nvme0n1: ios=113936/113631, merge=0/0, ticks=9782/1291, in_queue=11073, util=92.78%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3782616: Wed Sep 22 10:11:43 2021
  read: IOPS=77.7k, BW=607MiB/s (636MB/s)(513MiB/846msec)
    slat (nsec): min=2375, max=55906, avg=3743.51, stdev=1649.55
    clat (usec): min=10, max=2424, avg=285.78, stdev=352.78
     lat (usec): min=15, max=2428, avg=289.65, stdev=352.67
    clat percentiles (usec):
     |  1.00th=[   37],  5.00th=[   87], 10.00th=[  129], 20.00th=[  163],
     | 30.00th=[  169], 40.00th=[  174], 50.00th=[  182], 60.00th=[  208],
     | 70.00th=[  255], 80.00th=[  318], 90.00th=[  420], 95.00th=[  510],
     | 99.00th=[ 2114], 99.50th=[ 2147], 99.90th=[ 2245], 99.95th=[ 2278],
     | 99.99th=[ 2343]
   bw (  KiB/s): min=580448, max=580448, per=93.41%, avg=580448.00, stdev= 0.00, samples=1
   iops        : min=72556, max=72556, avg=72556.00, stdev= 0.00, samples=1
  write: IOPS=77.3k, BW=604MiB/s (633MB/s)(511MiB/846msec); 0 zone resets
    slat (nsec): min=2474, max=35828, avg=4404.78, stdev=1764.10
    clat (usec): min=9, max=595, avg=117.12, stdev=53.01
     lat (usec): min=14, max=604, avg=121.66, stdev=53.17
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   23], 10.00th=[   38], 20.00th=[   62],
     | 30.00th=[   86], 40.00th=[  108], 50.00th=[  130], 60.00th=[  149],
     | 70.00th=[  159], 80.00th=[  169], 90.00th=[  174], 95.00th=[  176],
     | 99.00th=[  196], 99.50th=[  233], 99.90th=[  322], 99.95th=[  375],
     | 99.99th=[  469]
   bw (  KiB/s): min=581680, max=581680, per=94.12%, avg=581680.00, stdev= 0.00, samples=1
   iops        : min=72710, max=72710, avg=72710.00, stdev= 0.00, samples=1
  lat (usec)   : 10=0.01%, 20=2.23%, 50=7.00%, 100=12.03%, 250=63.20%
  lat (usec)   : 500=12.85%, 750=0.77%, 1000=0.06%
  lat (msec)   : 2=0.82%, 4=1.04%
  cpu          : usr=31.36%, sys=54.20%, ctx=1925, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=607MiB/s (636MB/s), 607MiB/s-607MiB/s (636MB/s-636MB/s), io=513MiB (538MB), run=846-846msec
  WRITE: bw=604MiB/s (633MB/s), 604MiB/s-604MiB/s (633MB/s-633MB/s), io=511MiB (535MB), run=846-846msec

Disk stats (read/write):
  nvme0n1: ios=48887/48555, merge=0/0, ticks=8772/788, in_queue=9560, util=87.94%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3782757: Wed Sep 22 10:11:50 2021
  read: IOPS=56.2k, BW=877MiB/s (920MB/s)(512MiB/583msec)
    slat (nsec): min=2766, max=64623, avg=4264.59, stdev=1883.07
    clat (usec): min=13, max=2772, avg=427.73, stdev=573.49
     lat (usec): min=18, max=2775, avg=432.13, stdev=573.35
    clat percentiles (usec):
     |  1.00th=[   23],  5.00th=[   53], 10.00th=[   90], 20.00th=[  155],
     | 30.00th=[  180], 40.00th=[  190], 50.00th=[  198], 60.00th=[  225],
     | 70.00th=[  302], 80.00th=[  469], 90.00th=[ 1598], 95.00th=[ 2057],
     | 99.00th=[ 2212], 99.50th=[ 2278], 99.90th=[ 2409], 99.95th=[ 2606],
     | 99.99th=[ 2769]
   bw (  KiB/s): min=926880, max=926880, per=100.00%, avg=926880.00, stdev= 0.00, samples=1
   iops        : min=57930, max=57930, avg=57930.00, stdev= 0.00, samples=1
  write: IOPS=56.3k, BW=879MiB/s (922MB/s)(512MiB/583msec); 0 zone resets
    slat (nsec): min=2835, max=43553, avg=5180.80, stdev=2009.20
    clat (usec): min=5, max=1843, avg=130.07, stdev=75.49
     lat (usec): min=17, max=1846, avg=135.39, stdev=75.55
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   22], 10.00th=[   33], 20.00th=[   59],
     | 30.00th=[   86], 40.00th=[  113], 50.00th=[  141], 60.00th=[  159],
     | 70.00th=[  174], 80.00th=[  186], 90.00th=[  192], 95.00th=[  204],
     | 99.00th=[  392], 99.50th=[  465], 99.90th=[  586], 99.95th=[  627],
     | 99.99th=[ 1156]
   bw (  KiB/s): min=929024, max=929024, per=100.00%, avg=929024.00, stdev= 0.00, samples=1
   iops        : min=58064, max=58064, avg=58064.00, stdev= 0.00, samples=1
  lat (usec)   : 10=0.01%, 20=2.42%, 50=8.17%, 100=12.86%, 250=57.93%
  lat (usec)   : 500=9.36%, 750=3.28%, 1000=0.28%
  lat (msec)   : 2=2.38%, 4=3.31%
  cpu          : usr=27.15%, sys=42.44%, ctx=1748, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=32737,32799,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=512MiB (536MB), run=583-583msec
  WRITE: bw=879MiB/s (922MB/s), 879MiB/s-879MiB/s (922MB/s-922MB/s), io=512MiB (537MB), run=583-583msec

Disk stats (read/write):
  nvme0n1: ios=22036/21938, merge=0/0, ticks=6864/704, in_queue=7568, util=82.26%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3782898: Wed Sep 22 10:11:57 2021
  read: IOPS=38.8k, BW=1214MiB/s (1273MB/s)(512MiB/422msec)
    slat (nsec): min=2534, max=73940, avg=4650.70, stdev=2095.33
    clat (usec): min=20, max=2825, avg=690.39, stdev=711.76
     lat (usec): min=25, max=2829, avg=695.18, stdev=711.60
    clat percentiles (usec):
     |  1.00th=[   34],  5.00th=[   75], 10.00th=[  111], 20.00th=[  182],
     | 30.00th=[  225], 40.00th=[  281], 50.00th=[  371], 60.00th=[  519],
     | 70.00th=[  668], 80.00th=[ 1221], 90.00th=[ 2114], 95.00th=[ 2212],
     | 99.00th=[ 2343], 99.50th=[ 2409], 99.90th=[ 2540], 99.95th=[ 2638],
     | 99.99th=[ 2802]
  write: IOPS=38.8k, BW=1213MiB/s (1272MB/s)(512MiB/422msec); 0 zone resets
    slat (nsec): min=2615, max=42871, avg=5712.32, stdev=2388.70
    clat (usec): min=16, max=2203, avg=120.68, stdev=133.91
     lat (usec): min=21, max=2209, avg=126.51, stdev=133.95
    clat percentiles (usec):
     |  1.00th=[   22],  5.00th=[   24], 10.00th=[   28], 20.00th=[   38],
     | 30.00th=[   52], 40.00th=[   68], 50.00th=[   84], 60.00th=[  102],
     | 70.00th=[  125], 80.00th=[  155], 90.00th=[  219], 95.00th=[  429],
     | 99.00th=[  652], 99.50th=[  709], 99.90th=[ 1045], 99.95th=[ 1123],
     | 99.99th=[ 2180]
  lat (usec)   : 20=0.12%, 50=15.68%, 100=17.97%, 250=29.50%, 500=14.36%
  lat (usec)   : 750=9.06%, 1000=2.32%
  lat (msec)   : 2=4.05%, 4=6.94%
  cpu          : usr=17.81%, sys=34.92%, ctx=2933, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=16390,16378,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1214MiB/s (1273MB/s), 1214MiB/s-1214MiB/s (1273MB/s-1273MB/s), io=512MiB (537MB), run=422-422msec
  WRITE: bw=1213MiB/s (1272MB/s), 1213MiB/s-1213MiB/s (1272MB/s-1272MB/s), io=512MiB (537MB), run=422-422msec

Disk stats (read/write):
  nvme0n1: ios=14858/14766, merge=0/0, ticks=9883/962, in_queue=10844, util=82.26%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3783039: Wed Sep 22 10:12:07 2021
  read: IOPS=23.8k, BW=1487MiB/s (1559MB/s)(510MiB/343msec)
    slat (nsec): min=4528, max=37170, avg=6283.33, stdev=2047.06
    clat (usec): min=37, max=3490, avg=1239.57, stdev=622.83
     lat (usec): min=42, max=3495, avg=1246.00, stdev=622.68
    clat percentiles (usec):
     |  1.00th=[  153],  5.00th=[  424], 10.00th=[  537], 20.00th=[  644],
     | 30.00th=[  750], 40.00th=[  898], 50.00th=[ 1123], 60.00th=[ 1385],
     | 70.00th=[ 1696], 80.00th=[ 1926], 90.00th=[ 2114], 95.00th=[ 2212],
     | 99.00th=[ 2474], 99.50th=[ 2573], 99.90th=[ 3261], 99.95th=[ 3359],
     | 99.99th=[ 3490]
  write: IOPS=23.0k, BW=1499MiB/s (1571MB/s)(514MiB/343msec); 0 zone resets
    slat (nsec): min=4578, max=58120, avg=8480.63, stdev=2813.91
    clat (usec): min=26, max=832, avg=84.97, stdev=67.67
     lat (usec): min=34, max=840, avg=93.59, stdev=67.84
    clat percentiles (usec):
     |  1.00th=[   31],  5.00th=[   32], 10.00th=[   34], 20.00th=[   40],
     | 30.00th=[   48], 40.00th=[   56], 50.00th=[   67], 60.00th=[   78],
     | 70.00th=[   94], 80.00th=[  115], 90.00th=[  147], 95.00th=[  194],
     | 99.00th=[  412], 99.50th=[  474], 99.90th=[  529], 99.95th=[  586],
     | 99.99th=[  832]
  lat (usec)   : 50=17.29%, 100=19.84%, 250=12.13%, 500=4.68%, 750=11.24%
  lat (usec)   : 1000=7.31%
  lat (msec)   : 2=19.54%, 4=7.97%
  cpu          : usr=16.96%, sys=28.95%, ctx=4714, majf=0, minf=15
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=8160,8224,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1487MiB/s (1559MB/s), 1487MiB/s-1487MiB/s (1559MB/s-1559MB/s), io=510MiB (535MB), run=343-343msec
  WRITE: bw=1499MiB/s (1571MB/s), 1499MiB/s-1499MiB/s (1571MB/s-1571MB/s), io=514MiB (539MB), run=343-343msec

Disk stats (read/write):
  nvme0n1: ios=3808/3909, merge=0/0, ticks=4035/321, in_queue=4356, util=63.16%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3783181: Wed Sep 22 10:12:15 2021
  read: IOPS=14.7k, BW=1834MiB/s (1923MB/s)(506MiB/276msec)
    slat (nsec): min=6162, max=45607, avg=8454.63, stdev=1975.36
    clat (usec): min=62, max=5533, avg=1836.18, stdev=793.24
     lat (usec): min=70, max=5541, avg=1844.76, stdev=793.06
    clat percentiles (usec):
     |  1.00th=[  343],  5.00th=[  685], 10.00th=[  922], 20.00th=[ 1205],
     | 30.00th=[ 1401], 40.00th=[ 1582], 50.00th=[ 1745], 60.00th=[ 1926],
     | 70.00th=[ 2180], 80.00th=[ 2409], 90.00th=[ 2769], 95.00th=[ 3163],
     | 99.00th=[ 4490], 99.50th=[ 4883], 99.90th=[ 5473], 99.95th=[ 5473],
     | 99.99th=[ 5538]
  write: IOPS=15.0k, BW=1876MiB/s (1967MB/s)(518MiB/276msec); 0 zone resets
    slat (nsec): min=6172, max=50576, avg=11992.60, stdev=3027.74
    clat (usec): min=44, max=2834, avg=308.68, stdev=300.41
     lat (usec): min=56, max=2846, avg=320.81, stdev=300.25
    clat percentiles (usec):
     |  1.00th=[   50],  5.00th=[   52], 10.00th=[   56], 20.00th=[   72],
     | 30.00th=[   97], 40.00th=[  125], 50.00th=[  167], 60.00th=[  249],
     | 70.00th=[  396], 80.00th=[  570], 90.00th=[  791], 95.00th=[  922],
     | 99.00th=[ 1156], 99.50th=[ 1237], 99.90th=[ 1696], 99.95th=[ 2278],
     | 99.99th=[ 2835]
  lat (usec)   : 50=1.09%, 100=15.27%, 250=14.40%, 500=9.29%, 750=7.59%
  lat (usec)   : 1000=7.39%
  lat (msec)   : 2=26.40%, 4=17.70%, 10=0.87%
  cpu          : usr=14.18%, sys=26.91%, ctx=4497, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=4050,4142,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1834MiB/s (1923MB/s), 1834MiB/s-1834MiB/s (1923MB/s-1923MB/s), io=506MiB (531MB), run=276-276msec
  WRITE: bw=1876MiB/s (1967MB/s), 1876MiB/s-1876MiB/s (1967MB/s-1967MB/s), io=518MiB (543MB), run=276-276msec

Disk stats (read/write):
  nvme0n1: ios=2344/2418, merge=0/0, ticks=3505/959, in_queue=4464, util=61.54%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3783322: Wed Sep 22 10:12:23 2021
  read: IOPS=7488, BW=1872MiB/s (1963MB/s)(498MiB/266msec)
    slat (nsec): min=9638, max=77687, avg=12177.98, stdev=3124.11
    clat (usec): min=115, max=7210, avg=2423.52, stdev=1209.42
     lat (usec): min=125, max=7221, avg=2435.79, stdev=1209.14
    clat percentiles (usec):
     |  1.00th=[  510],  5.00th=[  922], 10.00th=[ 1123], 20.00th=[ 1401],
     | 30.00th=[ 1663], 40.00th=[ 1926], 50.00th=[ 2180], 60.00th=[ 2442],
     | 70.00th=[ 2802], 80.00th=[ 3458], 90.00th=[ 4146], 95.00th=[ 4883],
     | 99.00th=[ 5800], 99.50th=[ 5866], 99.90th=[ 7111], 99.95th=[ 7242],
     | 99.99th=[ 7242]
  write: IOPS=7909, BW=1977MiB/s (2073MB/s)(526MiB/266msec); 0 zone resets
    slat (nsec): min=9878, max=79491, avg=18815.35, stdev=4286.38
    clat (usec): min=86, max=6551, avg=1710.45, stdev=1105.67
     lat (usec): min=97, max=6571, avg=1729.37, stdev=1105.86
    clat percentiles (usec):
     |  1.00th=[   89],  5.00th=[  153], 10.00th=[  498], 20.00th=[  889],
     | 30.00th=[ 1139], 40.00th=[ 1319], 50.00th=[ 1565], 60.00th=[ 1795],
     | 70.00th=[ 1975], 80.00th=[ 2278], 90.00th=[ 3163], 95.00th=[ 3818],
     | 99.00th=[ 5604], 99.50th=[ 6259], 99.90th=[ 6390], 99.95th=[ 6390],
     | 99.99th=[ 6521]
  lat (usec)   : 100=1.61%, 250=2.00%, 500=1.98%, 750=4.20%, 1000=5.91%
  lat (msec)   : 2=42.21%, 4=34.28%, 10=7.81%
  cpu          : usr=6.79%, sys=24.15%, ctx=2923, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1992,2104,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1872MiB/s (1963MB/s), 1872MiB/s-1872MiB/s (1963MB/s-1963MB/s), io=498MiB (522MB), run=266-266msec
  WRITE: bw=1977MiB/s (2073MB/s), 1977MiB/s-1977MiB/s (2073MB/s-2073MB/s), io=526MiB (552MB), run=266-266msec

Disk stats (read/write):
  nvme0n1: ios=1255/1275, merge=0/0, ticks=2516/1846, in_queue=4362, util=61.54%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3783464: Wed Sep 22 10:12:32 2021
  read: IOPS=3913, BW=1957MiB/s (2052MB/s)(497MiB/254msec)
    slat (nsec): min=16762, max=76575, avg=22144.55, stdev=4027.76
    clat (usec): min=357, max=12940, avg=4116.81, stdev=2289.31
     lat (usec): min=378, max=12960, avg=4139.08, stdev=2288.96
    clat percentiles (usec):
     |  1.00th=[  996],  5.00th=[ 1483], 10.00th=[ 1844], 20.00th=[ 2180],
     | 30.00th=[ 2442], 40.00th=[ 2704], 50.00th=[ 3195], 60.00th=[ 4424],
     | 70.00th=[ 5407], 80.00th=[ 6325], 90.00th=[ 7308], 95.00th=[ 8225],
     | 99.00th=[10552], 99.50th=[11207], 99.90th=[12911], 99.95th=[12911],
     | 99.99th=[12911]
  write: IOPS=4149, BW=2075MiB/s (2176MB/s)(527MiB/254msec); 0 zone resets
    slat (usec): min=22, max=102, avg=37.40, stdev= 6.35
    clat (usec): min=175, max=8553, avg=3721.76, stdev=1466.17
     lat (usec): min=218, max=8585, avg=3759.28, stdev=1466.76
    clat percentiles (usec):
     |  1.00th=[  578],  5.00th=[ 1319], 10.00th=[ 1991], 20.00th=[ 2573],
     | 30.00th=[ 2900], 40.00th=[ 3228], 50.00th=[ 3621], 60.00th=[ 4015],
     | 70.00th=[ 4359], 80.00th=[ 4752], 90.00th=[ 5800], 95.00th=[ 6652],
     | 99.00th=[ 7308], 99.50th=[ 7570], 99.90th=[ 7898], 99.95th=[ 8586],
     | 99.99th=[ 8586]
  lat (usec)   : 250=0.05%, 500=0.39%, 750=1.03%, 1000=0.73%
  lat (msec)   : 2=10.01%, 4=46.14%, 10=40.97%, 20=0.68%
  cpu          : usr=5.14%, sys=23.72%, ctx=1577, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=994,1054,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1957MiB/s (2052MB/s), 1957MiB/s-1957MiB/s (2052MB/s-2052MB/s), io=497MiB (521MB), run=254-254msec
  WRITE: bw=2075MiB/s (2176MB/s), 2075MiB/s-2075MiB/s (2176MB/s-2176MB/s), io=527MiB (553MB), run=254-254msec

Disk stats (read/write):
  nvme0n1: ios=672/653, merge=0/0, ticks=1670/2654, in_queue=4323, util=58.30%
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process

test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3783605: Wed Sep 22 10:12:39 2021
  read: IOPS=1373, BW=1374MiB/s (1441MB/s)(496MiB/361msec)
    slat (nsec): min=33874, max=91533, avg=40402.20, stdev=6347.61
    clat (usec): min=2450, max=29871, avg=20530.36, stdev=4273.93
     lat (usec): min=2543, max=29909, avg=20570.89, stdev=4271.85
    clat percentiles (usec):
     |  1.00th=[ 3687],  5.00th=[12125], 10.00th=[15270], 20.00th=[18220],
     | 30.00th=[19268], 40.00th=[20317], 50.00th=[21103], 60.00th=[21890],
     | 70.00th=[22938], 80.00th=[23987], 90.00th=[24773], 95.00th=[25297],
     | 99.00th=[28705], 99.50th=[29492], 99.90th=[29754], 99.95th=[29754],
     | 99.99th=[29754]
  write: IOPS=1462, BW=1463MiB/s (1534MB/s)(528MiB/361msec); 0 zone resets
    slat (usec): min=34, max=116, avg=65.60, stdev=13.26
    clat (usec): min=298, max=21950, avg=2348.68, stdev=4184.52
     lat (usec): min=348, max=22030, avg=2414.42, stdev=4188.55
    clat percentiles (usec):
     |  1.00th=[  302],  5.00th=[  310], 10.00th=[  310], 20.00th=[  314],
     | 30.00th=[  318], 40.00th=[  322], 50.00th=[  392], 60.00th=[  603],
     | 70.00th=[ 1029], 80.00th=[ 3785], 90.00th=[ 6980], 95.00th=[13304],
     | 99.00th=[18744], 99.50th=[21103], 99.90th=[21890], 99.95th=[21890],
     | 99.99th=[21890]
  lat (usec)   : 500=27.44%, 750=7.03%, 1000=1.37%
  lat (msec)   : 2=2.44%, 4=4.10%, 10=6.05%, 20=21.09%, 50=30.47%
  cpu          : usr=8.89%, sys=8.89%, ctx=910, majf=0, minf=14
  IO depths    : 1=0.1%, 2=0.2%, 4=0.4%, 8=0.8%, 16=1.6%, 32=97.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=496,528,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1374MiB/s (1441MB/s), 1374MiB/s-1374MiB/s (1441MB/s-1441MB/s), io=496MiB (520MB), run=361-361msec
  WRITE: bw=1463MiB/s (1534MB/s), 1463MiB/s-1463MiB/s (1534MB/s-1534MB/s), io=528MiB (554MB), run=361-361msec

Disk stats (read/write):
  nvme0n1: ios=292/298, merge=0/0, ticks=4039/1666, in_queue=5705, util=55.06%

As far as my postgres test, I was just restoring a database backup. I just did a time to check how long the restores were.

mdraid
---------------------------

real    1112m16.401s
user    114m47.432s
sys     16m19.817s


zfs 8k 
---------------------

real    1262m27.236s
user    116m14.397s
sys     13m49.149s

By comparison, on the old system with MDRAID+LUKS, and the 8x 850 EVO drives

md + luks

real    550m43.996s
user    151m13.206s
sys     26m21.459s
2 Likes

I remembered some talks from Wendell about Interrupt vs Polling in Linux and a Hybrid approach. Searched for “Interrupt vs polling” and found this:

I’m not experienced enough, but I think you need to change the context switching (CS) from the default (which is interrupt) to either polling (which will use a lot of CPU, maybe your 64 cores will be enough) or hybrid polling.

OH. Speaking of 64 cores, you got 2 CPUs. If you have physical access to the server, maybe check to see in which PCI-E slots are the expansion cards connecting the NVME drives to. It may be that 4 are connected to 1 CPU and 4 are connected to the other, so there may be additional latency when you need to transfer data between the CPUs. Or maybe, I think you can check it with lspci or something, you should be able to see to what PCI-E port the card is connected and to which CPU the lanes go.

Have you tried a md raid with just 4 drives like I suggested? If you’re using 2 expansion cards, then try doing an array on just 1 of them and check the FIO performance. Ignore ZFS, just do straight XFS, because it’s going to be faster in its default config anyway (less headaches and less things to account for when testing).

1 Like

lspci -tv seems to indicate they are all on the same CPU

 +-[0000:20]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-01.1-[21-2e]----00.0-[22-2e]--+-00.0-[23]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-01.0-[24]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-02.0-[25]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-03.0-[26]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-04.0-[27]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-05.0-[28]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-06.0-[29]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-07.0-[2a]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-0c.0-[2b]--
 |           |                               +-0d.0-[2c]--
 |           |                               +-0e.0-[2d]--
 |           |                               \-0f.0-[2e]--
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-07.1-[2f]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           \-08.1-[30]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP

sudo mdadm --create --verbose /dev/md50 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1

sudo mdadm --create --verbose /dev/md51 --level=4 --raid-devices=4 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1

2 disk RAID1


/tmp/raid1# xfs_info /tmp/raid1
meta-data=/dev/md50              isize=512    agcount=4, agsize=97667604 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=390670416, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=190757, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


/tmp/raid1# sync;fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq_mix_rw --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3785319: Wed Sep 22 14:41:35 2021
  read: IOPS=45.8k, BW=358MiB/s (375MB/s)(513MiB/1435msec)
    slat (nsec): min=3116, max=88257, avg=6067.01, stdev=3375.07
    clat (usec): min=8, max=2955, avg=310.79, stdev=429.85
     lat (usec): min=15, max=2985, avg=317.00, stdev=429.77
    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   48], 10.00th=[   73], 20.00th=[  111],
     | 30.00th=[  143], 40.00th=[  174], 50.00th=[  202], 60.00th=[  231],
     | 70.00th=[  260], 80.00th=[  297], 90.00th=[  441], 95.00th=[ 1598],
     | 99.00th=[ 2147], 99.50th=[ 2212], 99.90th=[ 2343], 99.95th=[ 2409],
     | 99.99th=[ 2474]
   bw (  KiB/s): min=303760, max=381312, per=93.50%, avg=342536.00, stdev=54837.55, samples=2
   iops        : min=37970, max=47664, avg=42817.00, stdev=6854.69, samples=2
  write: IOPS=45.5k, BW=356MiB/s (373MB/s)(511MiB/1435msec); 0 zone resets
    slat (usec): min=4, max=227, avg= 9.37, stdev= 4.16
    clat (usec): min=15, max=1825, avg=372.61, stdev=130.20
     lat (usec): min=23, max=1843, avg=382.13, stdev=130.38
    clat percentiles (usec):
     |  1.00th=[   47],  5.00th=[  147], 10.00th=[  217], 20.00th=[  285],
     | 30.00th=[  314], 40.00th=[  343], 50.00th=[  371], 60.00th=[  400],
     | 70.00th=[  437], 80.00th=[  474], 90.00th=[  523], 95.00th=[  562],
     | 99.00th=[  652], 99.50th=[  717], 99.90th=[ 1336], 99.95th=[ 1614],
     | 99.99th=[ 1729]
   bw (  KiB/s): min=303888, max=375808, per=93.27%, avg=339848.00, stdev=50855.12, samples=2
   iops        : min=37986, max=46976, avg=42481.00, stdev=6356.89, samples=2
  lat (usec)   : 10=0.01%, 20=0.40%, 50=2.90%, 100=6.64%, 250=30.31%
  lat (usec)   : 500=48.30%, 750=7.61%, 1000=0.37%
  lat (msec)   : 2=2.20%, 4=1.26%
  cpu          : usr=15.83%, sys=69.11%, ctx=5659, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=358MiB/s (375MB/s), 358MiB/s-358MiB/s (375MB/s-375MB/s), io=513MiB (538MB), run=1435-1435msec
  WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=511MiB (535MB), run=1435-1435msec

Disk stats (read/write):
    md50: ios=63757/63484, merge=0/0, ticks=13372/4264, in_queue=17636, util=93.40%, aggrios=33580/66080, aggrmerge=1531/1533, aggrticks=6794/3355, aggrin_queue=10150, aggrutil=90.94%
  nvme0n1: ios=1465/65359, merge=3063/0, ticks=309/2463, in_queue=2773, util=90.94%
  nvme1n1: ios=65695/66802, merge=0/3067, ticks=13279/4247, in_queue=17527, util=90.94%



/tmp/raid1# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3785453: Wed Sep 22 14:41:43 2021
  read: IOPS=50.3k, BW=393MiB/s (412MB/s)(513MiB/1306msec)
    slat (nsec): min=3166, max=95210, avg=6234.30, stdev=3573.25
    clat (usec): min=29, max=953, avg=240.94, stdev=94.16
     lat (usec): min=33, max=957, avg=247.31, stdev=94.17
    clat percentiles (usec):
     |  1.00th=[   86],  5.00th=[  106], 10.00th=[  127], 20.00th=[  161],
     | 30.00th=[  188], 40.00th=[  212], 50.00th=[  235], 60.00th=[  255],
     | 70.00th=[  281], 80.00th=[  306], 90.00th=[  351], 95.00th=[  412],
     | 99.00th=[  545], 99.50th=[  594], 99.90th=[  685], 99.95th=[  717],
     | 99.99th=[  791]
   bw (  KiB/s): min=403376, max=407056, per=100.00%, avg=405216.00, stdev=2602.15, samples=2
   iops        : min=50422, max=50882, avg=50652.00, stdev=325.27, samples=2
  write: IOPS=50.0k, BW=391MiB/s (410MB/s)(511MiB/1306msec); 0 zone resets
    slat (usec): min=4, max=132, avg= 9.54, stdev= 4.23
    clat (usec): min=90, max=1107, avg=379.21, stdev=84.15
     lat (usec): min=117, max=1113, avg=388.91, stdev=84.30
    clat percentiles (usec):
     |  1.00th=[  225],  5.00th=[  262], 10.00th=[  281], 20.00th=[  306],
     | 30.00th=[  326], 40.00th=[  347], 50.00th=[  371], 60.00th=[  396],
     | 70.00th=[  420], 80.00th=[  453], 90.00th=[  490], 95.00th=[  529],
     | 99.00th=[  594], 99.50th=[  627], 99.90th=[  766], 99.95th=[  881],
     | 99.99th=[ 1037]
   bw (  KiB/s): min=400496, max=406144, per=100.00%, avg=403320.00, stdev=3993.74, samples=2
   iops        : min=50062, max=50768, avg=50415.00, stdev=499.22, samples=2
  lat (usec)   : 50=0.01%, 100=1.83%, 250=28.55%, 500=64.53%, 750=5.02%
  lat (usec)   : 1000=0.05%
  lat (msec)   : 2=0.01%
  cpu          : usr=18.93%, sys=77.39%, ctx=3644, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=513MiB (538MB), run=1306-1306msec
  WRITE: bw=391MiB/s (410MB/s), 391MiB/s-391MiB/s (410MB/s-410MB/s), io=511MiB (535MB), run=1306-1306msec

Disk stats (read/write):
    md50: ios=55899/55625, merge=0/0, ticks=7380/3472, in_queue=10852, util=92.21%, aggrios=33985/66489, aggrmerge=1237/1237, aggrticks=4464/3472, aggrin_queue=7936, aggrutil=88.65%
  nvme0n1: ios=39355/65361, merge=2475/0, ticks=5310/2705, in_queue=8015, util=88.27%
  nvme1n1: ios=28615/67618, merge=0/2475, ticks=3619/4239, in_queue=7858, util=88.65%


4 disk RAID10

/tmp/raid10-4disk# xfs_info /tmp/raid10-4disk/
meta-data=/dev/md51              isize=512    agcount=32, agsize=36625408 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1172011008, imaxpct=5
         =                       sunit=128    swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


/tmp/raid10-4disk# sync;fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq_mix_rw --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=125MiB/s,w=124MiB/s][r=16.0k,w=15.8k IOPS][eta 00m:00s]
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3786308: Wed Sep 22 14:43:45 2021
  read: IOPS=18.2k, BW=142MiB/s (149MB/s)(513MiB/3619msec)
    slat (usec): min=3, max=11147, avg=11.68, stdev=43.63
    clat (usec): min=8, max=24293, avg=487.95, stdev=478.94
     lat (usec): min=17, max=35409, avg=499.73, stdev=489.90
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   74], 10.00th=[  121], 20.00th=[  184],
     | 30.00th=[  265], 40.00th=[  367], 50.00th=[  486], 60.00th=[  586],
     | 70.00th=[  685], 80.00th=[  766], 90.00th=[  840], 95.00th=[  898],
     | 99.00th=[ 1029], 99.50th=[ 1090], 99.90th=[ 1582], 99.95th=[ 3032],
     | 99.99th=[23987]
   bw (  KiB/s): min=125168, max=187488, per=100.00%, avg=145858.29, stdev=25504.48, samples=7
   iops        : min=15646, max=23436, avg=18232.29, stdev=3188.06, samples=7
  write: IOPS=18.1k, BW=141MiB/s (148MB/s)(511MiB/3619msec); 0 zone resets
    slat (usec): min=3, max=225, avg=12.53, stdev= 4.01
    clat (usec): min=75, max=43334, avg=1250.08, stdev=797.42
     lat (usec): min=89, max=43560, avg=1262.71, stdev=797.99
    clat percentiles (usec):
     |  1.00th=[  586],  5.00th=[  717], 10.00th=[  799], 20.00th=[  930],
     | 30.00th=[ 1057], 40.00th=[ 1139], 50.00th=[ 1205], 60.00th=[ 1270],
     | 70.00th=[ 1352], 80.00th=[ 1467], 90.00th=[ 1729], 95.00th=[ 1942],
     | 99.00th=[ 2343], 99.50th=[ 2474], 99.90th=[ 3064], 99.95th=[ 4424],
     | 99.99th=[43254]
   bw (  KiB/s): min=125472, max=188464, per=100.00%, avg=145181.71, stdev=25903.31, samples=7
   iops        : min=15684, max=23558, avg=18147.71, stdev=3237.91, samples=7
  lat (usec)   : 10=0.01%, 20=0.20%, 50=1.32%, 100=2.23%, 250=10.47%
  lat (usec)   : 500=11.72%, 750=16.55%, 1000=19.59%
  lat (msec)   : 2=35.79%, 4=2.07%, 10=0.01%, 20=0.01%, 50=0.02%
  cpu          : usr=6.38%, sys=45.11%, ctx=17914, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=513MiB (538MB), run=3619-3619msec
  WRITE: bw=141MiB/s (148MB/s), 141MiB/s-141MiB/s (148MB/s-148MB/s), io=511MiB (535MB), run=3619-3619msec

Disk stats (read/write):
    md51: ios=63312/63022, merge=0/0, ticks=28548/48076, in_queue=76624, util=97.17%, aggrios=4393/4847, aggrmerge=96294/84522, aggrticks=1959/1732, aggrin_queue=3690, aggrutil=97.39%
  nvme3n1: ios=5893/2881, merge=130157/40639, ticks=2727/1103, in_queue=3830, util=94.49%
  nvme2n1: ios=4775/3048, merge=118015/40502, ticks=2302/958, in_queue=3259, util=94.69%
  nvme5n1: ios=1021/10759, merge=7080/216003, ticks=261/3902, in_queue=4162, util=97.39%
  nvme4n1: ios=5884/2701, merge=129924/40947, ticks=2546/965, in_queue=3511, util=94.36%


/tmp/raid10-4disk# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=88.4MiB/s,w=87.5MiB/s][r=11.3k,w=11.2k IOPS][eta 00m:00s]
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3786441: Wed Sep 22 14:43:58 2021
  read: IOPS=11.1k, BW=87.0MiB/s (91.2MB/s)(513MiB/5900msec)
    slat (usec): min=4, max=31026, avg=14.50, stdev=121.02
    clat (usec): min=16, max=22230, avg=675.78, stdev=364.85
     lat (usec): min=28, max=41709, avg=690.40, stdev=396.38
    clat percentiles (usec):
     |  1.00th=[  239],  5.00th=[  396], 10.00th=[  461], 20.00th=[  545],
     | 30.00th=[  594], 40.00th=[  635], 50.00th=[  676], 60.00th=[  709],
     | 70.00th=[  750], 80.00th=[  799], 90.00th=[  865], 95.00th=[  938],
     | 99.00th=[ 1090], 99.50th=[ 1139], 99.90th=[ 1270], 99.95th=[ 1369],
     | 99.99th=[22152]
   bw (  KiB/s): min=78192, max=92432, per=99.70%, avg=88836.36, stdev=3789.38, samples=11
   iops        : min= 9774, max=11554, avg=11104.55, stdev=473.67, samples=11
  write: IOPS=11.1k, BW=86.5MiB/s (90.7MB/s)(511MiB/5900msec); 0 zone resets
    slat (nsec): min=5580, max=39054, avg=15588.08, stdev=3355.33
    clat (usec): min=466, max=24020, avg=2162.81, stdev=647.04
     lat (usec): min=478, max=24033, avg=2178.51, stdev=647.05
    clat percentiles (usec):
     |  1.00th=[  988],  5.00th=[ 1336], 10.00th=[ 1532], 20.00th=[ 1745],
     | 30.00th=[ 1909], 40.00th=[ 2040], 50.00th=[ 2147], 60.00th=[ 2245],
     | 70.00th=[ 2376], 80.00th=[ 2573], 90.00th=[ 2769], 95.00th=[ 2999],
     | 99.00th=[ 3392], 99.50th=[ 3523], 99.90th=[ 3818], 99.95th=[13042],
     | 99.99th=[23987]
   bw (  KiB/s): min=78560, max=90608, per=99.92%, avg=88549.82, stdev=3422.63, samples=11
   iops        : min= 9820, max=11326, avg=11068.73, stdev=427.83, samples=11
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.04%, 250=0.54%, 500=6.41%
  lat (usec)   : 750=28.12%, 1000=14.22%
  lat (msec)   : 2=19.08%, 4=31.53%, 10=0.01%, 20=0.02%, 50=0.02%
  cpu          : usr=4.39%, sys=34.96%, ctx=34319, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s (91.2MB/s-91.2MB/s), io=513MiB (538MB), run=5900-5900msec
  WRITE: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s (90.7MB/s-90.7MB/s), io=511MiB (535MB), run=5900-5900msec

Disk stats (read/write):
    md51: ios=65580/65230, merge=0/0, ticks=40972/81160, in_queue=122132, util=97.68%, aggrios=54438/36047, aggrmerge=162601/69321, aggrticks=13177/5572, aggrin_queue=18749, aggrutil=98.59%
  nvme3n1: ios=72495/22061, merge=216794/21608, ticks=18560/3591, in_queue=22151, util=96.90%
  nvme2n1: ios=71186/21923, merge=218396/21500, ticks=17365/3220, in_queue=20585, util=96.93%
  nvme5n1: ios=0/77999, merge=0/212751, ticks=0/12375, in_queue=12375, util=98.59%
  nvme4n1: ios=74074/22207, merge=215214/21428, ticks=16783/3104, in_queue=19887, util=97.10%

The scheduler on all these disks is currently set to none

# cat /sys/block/nvme*n1/queue/scheduler
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline

including mdraid devices

# cat /sys/block/md*/queue/scheduler
none
none
none

Polling/Interrupt

Based on Wendell’s post (here)[Fixing Slow NVMe Raid Performance on Epyc] newer kernel (I am on 5.11) should have hybrid on by default. That being said…according to that link you sent

cat /sys/block/nvme0n1/queue/io_poll_delay should out put 0, but I’m getting back -1.

So I made the following changes::

GRUB_CMDLINE_LINUX="nvme_core.default_ps_max_latency_us=0 nvme.poll_queues=64"

update-initramfs -u -k all and update-grub

after fiddling with all that, minimal improvements:

/tmp/raid10# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=117MiB/s,w=118MiB/s][r=14.9k,w=15.1k IOPS][eta 00m:00s]
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3963: Wed Sep 22 15:36:24 2021
  read: IOPS=15.4k, BW=120MiB/s (126MB/s)(513MiB/4265msec)
    slat (nsec): min=3787, max=79941, avg=12023.42, stdev=3055.78
    clat (usec): min=12, max=1307, avg=491.25, stdev=120.87
     lat (usec): min=23, max=1317, avg=503.37, stdev=120.95
    clat percentiles (usec):
     |  1.00th=[  188],  5.00th=[  289], 10.00th=[  338], 20.00th=[  396],
     | 30.00th=[  437], 40.00th=[  465], 50.00th=[  494], 60.00th=[  523],
     | 70.00th=[  553], 80.00th=[  586], 90.00th=[  635], 95.00th=[  685],
     | 99.00th=[  783], 99.50th=[  816], 99.90th=[  889], 99.95th=[  914],
     | 99.99th=[ 1004]
   bw (  KiB/s): min=119184, max=141488, per=100.00%, avg=123254.00, stdev=7483.98, samples=8
   iops        : min=14898, max=17686, avg=15406.75, stdev=935.50, samples=8
  write: IOPS=15.3k, BW=120MiB/s (126MB/s)(511MiB/4265msec); 0 zone resets
    slat (usec): min=4, max=295, avg=13.32, stdev= 3.40
    clat (usec): min=415, max=3192, avg=1567.22, stdev=361.43
     lat (usec): min=428, max=3203, avg=1580.64, stdev=361.60
    clat percentiles (usec):
     |  1.00th=[  766],  5.00th=[  979], 10.00th=[ 1090], 20.00th=[ 1254],
     | 30.00th=[ 1385], 40.00th=[ 1500], 50.00th=[ 1565], 60.00th=[ 1647],
     | 70.00th=[ 1729], 80.00th=[ 1860], 90.00th=[ 2024], 95.00th=[ 2180],
     | 99.00th=[ 2474], 99.50th=[ 2573], 99.90th=[ 2737], 99.95th=[ 2835],
     | 99.99th=[ 2999]
   bw (  KiB/s): min=118752, max=141152, per=100.00%, avg=122728.00, stdev=7497.03, samples=8
   iops        : min=14844, max=17644, avg=15341.00, stdev=937.13, samples=8
  lat (usec)   : 20=0.01%, 50=0.04%, 100=0.06%, 250=1.29%, 500=24.61%
  lat (usec)   : 750=23.63%, 1000=3.37%
  lat (msec)   : 2=41.37%, 4=5.62%
  cpu          : usr=5.25%, sys=40.27%, ctx=29656, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=513MiB (538MB), run=4265-4265msec
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=511MiB (535MB), run=4265-4265msec

Disk stats (read/write):
    md127: ios=63657/63395, merge=0/0, ticks=30336/58008, in_queue=88344, util=97.71%, aggrios=54186/35805, aggrmerge=153885/66183, aggrticks=10059/3974, aggrin_queue=14034, aggrutil=97.69%
  nvme3n1: ios=72170/22126, merge=205049/21549, ticks=13680/2380, in_queue=16060, util=96.47%
  nvme2n1: ios=70824/21973, merge=206803/21456, ticks=13358/2289, in_queue=15647, util=96.39%
  nvme5n1: ios=0/77010, merge=0/200201, ticks=0/9014, in_queue=9013, util=97.69%
  nvme4n1: ios=73751/22114, merge=203690/21527, ticks=13200/2216, in_queue=15416, util=96.47%

Wow, it’s incredible. Even 2 disk RAID1 has lower IOPS and both slower seq and rand performance than 1 disk.

I’m running out of ideas. I found people complaining online for similar issues with Linux NVME and HDD RAIDd:

This one here is about HDDs, but it does have some interesting ZFS settings (I should probably bookmark this for myself):

Disabling write cache on HDDs made them faster as well:

But these aren’t real fixes for your case. I’ll post on the other thread, maybe someone will see it and check this thread out.

1 Like

I thought I was going crazy these last few weeks :joy:relieved you’ve come to the same conclusion as me.

Thanks so much for all your help, if for nothing else for helping bounce this all off of. I’m going to dive into those links and continue searching

2 Likes

Now that I think about it, maybe you should try servethehome forums, they also have some people with experience in filesystems (mostly ZFS though).

I’ve reached out to Kioxia support to see if they have any suggestions. I am also going to try and take these disks onto a different (Intel based) system just to see if there is any difference. I’ll update once I have done that, and if that doesn’t lead me anywhere I’ll try the serverthehome forums

1 Like

So NVME is just as fun as USB-C.

2 systems, the DELL R7425 in previous messages, and HP DL380 Gen 9. The HP is on an older firmware (2019). I won’t get into updates being behind support pay wall.

  • Kioxia drives this post thread has been about, will not show up on HP server, not in BIOS or OS.

  • Intel P3500, will not show up on HP server, not in BIOS or OS. Exhibits same behavior as KIOXIA on DELL server. RAID1 is lower read speed than single drive.

  • “Dell Express Flash NVMe 400GB”, runs on both DELL + HP. Exhibits same behavior, RAID1 is lower reas speed than single drive.

  • VO002000KWJSF (HP firmware version of Intel P4500), will not show up in BIOS/OS on DELL. On HP server RAID1 got same performance as a single disk (read + write)

Just for a sanity check, on the HP I ran fio against 2 SATA SSDS, put them in RAID1 and ran fio directly against the MD device and that was fine.

Still waiting to hopefully hear from kioxia :crossed_fingers:

1 Like

I’ve been thinking there was something I may have overlooked with mdraid tuning but I think that’s not the case. Something “interesting”, I decided to just do 2 simultaneous fio runs against separate disks. The first fio drops off significantly as soon as the 2nd fio job starts against a separate disk. Similarly, the 2nd fio run speeds up significantly when the first completes. I think this points me in the right direction, I don’t know what that direction is yet, but I feel like things are narrowing down.

iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=2327MiB/s][r=596k IOPS][eta 01m:57s]
Jobs: 4 (f=4): [r(4)][4.2%][r=2318MiB/s][r=593k IOPS][eta 01m:55s]
Jobs: 4 (f=4): [r(4)][5.8%][r=2321MiB/s][r=594k IOPS][eta 01m:53s]
Jobs: 4 (f=4): [r(4)][7.5%][r=2321MiB/s][r=594k IOPS][eta 01m:51s]
Jobs: 4 (f=4): [r(4)][9.2%][r=669MiB/s][r=171k IOPS][eta 01m:49s]  <- I started second fio process
Jobs: 4 (f=4): [r(4)][10.8%][r=667MiB/s][r=171k IOPS][eta 01m:47s]
Jobs: 4 (f=4): [r(4)][12.5%][r=668MiB/s][r=171k IOPS][eta 01m:45s]
Jobs: 4 (f=4): [r(4)][14.2%][r=668MiB/s][r=171k IOPS][eta 01m:43s]
Jobs: 4 (f=4): [r(4)][15.8%][r=668MiB/s][r=171k IOPS][eta 01m:41s]
Jobs: 4 (f=4): [r(4)][17.5%][r=680MiB/s][r=174k IOPS][eta 01m:39s]
Jobs: 4 (f=4): [r(4)][19.2%][r=679MiB/s][r=174k IOPS][eta 01m:37s]
Jobs: 4 (f=4): [r(4)][20.8%][r=681MiB/s][r=174k IOPS][eta 01m:35s]
Jobs: 4 (f=4): [r(4)][22.5%][r=679MiB/s][r=174k IOPS][eta 01m:33s]
Jobs: 4 (f=4): [r(4)][24.2%][r=681MiB/s][r=174k IOPS][eta 01m:31s]
Jobs: 4 (f=4): [r(4)][25.8%][r=681MiB/s][r=174k IOPS][eta 01m:29s]
Jobs: 4 (f=4): [r(4)][27.5%][r=680MiB/s][r=174k IOPS][eta 01m:27s]
Jobs: 4 (f=4): [r(4)][29.2%][r=681MiB/s][r=174k IOPS][eta 01m:25s]
Jobs: 4 (f=4): [r(4)][30.8%][r=681MiB/s][r=174k IOPS][eta 01m:23s]
Jobs: 4 (f=4): [r(4)][32.5%][r=678MiB/s][r=173k IOPS][eta 01m:21s]
Jobs: 4 (f=4): [r(4)][34.2%][r=681MiB/s][r=174k IOPS][eta 01m:19s]
Jobs: 4 (f=4): [r(4)][35.8%][r=681MiB/s][r=174k IOPS][eta 01m:17s]
Jobs: 4 (f=4): [r(4)][37.5%][r=678MiB/s][r=174k IOPS][eta 01m:15s]
Jobs: 4 (f=4): [r(4)][39.2%][r=682MiB/s][r=175k IOPS][eta 01m:13s]
Jobs: 4 (f=4): [r(4)][40.8%][r=682MiB/s][r=175k IOPS][eta 01m:11s]
Jobs: 4 (f=4): [r(4)][42.5%][r=681MiB/s][r=174k IOPS][eta 01m:09s]
Jobs: 4 (f=4): [r(4)][44.2%][r=680MiB/s][r=174k IOPS][eta 01m:07s]
Jobs: 4 (f=4): [r(4)][45.8%][r=682MiB/s][r=175k IOPS][eta 01m:05s]
Jobs: 4 (f=4): [r(4)][47.5%][r=681MiB/s][r=174k IOPS][eta 01m:03s]
Jobs: 4 (f=4): [r(4)][49.2%][r=681MiB/s][r=174k IOPS][eta 01m:01s]
Jobs: 4 (f=4): [r(4)][50.8%][r=682MiB/s][r=174k IOPS][eta 00m:59s]
Jobs: 4 (f=4): [r(4)][52.5%][r=682MiB/s][r=174k IOPS][eta 00m:57s]
Jobs: 4 (f=4): [r(4)][54.2%][r=681MiB/s][r=174k IOPS][eta 00m:55s]
Jobs: 4 (f=4): [r(4)][55.8%][r=681MiB/s][r=174k IOPS][eta 00m:53s]
Jobs: 4 (f=4): [r(4)][57.5%][r=679MiB/s][r=174k IOPS][eta 00m:51s]
Jobs: 4 (f=4): [r(4)][59.2%][r=680MiB/s][r=174k IOPS][eta 00m:49s]
Jobs: 4 (f=4): [r(4)][60.8%][r=683MiB/s][r=175k IOPS][eta 00m:47s]
Jobs: 4 (f=4): [r(4)][62.5%][r=681MiB/s][r=174k IOPS][eta 00m:45s]
Jobs: 4 (f=4): [r(4)][64.2%][r=683MiB/s][r=175k IOPS][eta 00m:43s]
Jobs: 4 (f=4): [r(4)][65.8%][r=681MiB/s][r=174k IOPS][eta 00m:41s]
Jobs: 4 (f=4): [r(4)][67.5%][r=682MiB/s][r=174k IOPS][eta 00m:39s]
Jobs: 4 (f=4): [r(4)][69.2%][r=682MiB/s][r=175k IOPS][eta 00m:37s]
Jobs: 4 (f=4): [r(4)][70.8%][r=681MiB/s][r=174k IOPS][eta 00m:35s]
Jobs: 4 (f=4): [r(4)][72.5%][r=682MiB/s][r=175k IOPS][eta 00m:33s]
Jobs: 4 (f=4): [r(4)][74.2%][r=678MiB/s][r=174k IOPS][eta 00m:31s]
Jobs: 4 (f=4): [r(4)][75.8%][r=679MiB/s][r=174k IOPS][eta 00m:29s]
Jobs: 4 (f=4): [r(4)][77.5%][r=679MiB/s][r=174k IOPS][eta 00m:27s]
Jobs: 4 (f=4): [r(4)][79.2%][r=680MiB/s][r=174k IOPS][eta 00m:25s]
Jobs: 4 (f=4): [r(4)][80.8%][r=680MiB/s][r=174k IOPS][eta 00m:23s]
Jobs: 4 (f=4): [r(4)][82.5%][r=678MiB/s][r=174k IOPS][eta 00m:21s]
Jobs: 4 (f=4): [r(4)][84.2%][r=680MiB/s][r=174k IOPS][eta 00m:19s]
Jobs: 4 (f=4): [r(4)][85.8%][r=681MiB/s][r=174k IOPS][eta 00m:17s]
Jobs: 4 (f=4): [r(4)][87.5%][r=681MiB/s][r=174k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [r(4)][89.2%][r=680MiB/s][r=174k IOPS][eta 00m:13s]
Jobs: 4 (f=4): [r(4)][90.8%][r=680MiB/s][r=174k IOPS][eta 00m:11s]
Jobs: 4 (f=4): [r(4)][92.5%][r=680MiB/s][r=174k IOPS][eta 00m:09s]
Jobs: 4 (f=4): [r(4)][94.2%][r=681MiB/s][r=174k IOPS][eta 00m:07s]
Jobs: 4 (f=4): [r(4)][95.8%][r=679MiB/s][r=174k IOPS][eta 00m:05s]
Jobs: 4 (f=4): [r(4)][97.5%][r=679MiB/s][r=174k IOPS][eta 00m:03s]
Jobs: 4 (f=4): [r(4)][99.2%][r=680MiB/s][r=174k IOPS][eta 00m:01s]
Jobs: 4 (f=4): [r(4)][100.0%][r=679MiB/s][r=174k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=1336169: Sun Sep 26 18:50:20 2021
  read: IOPS=206k, BW=805MiB/s (844MB/s)(94.3GiB/120001msec)
    slat (usec): min=2, max=4029, avg= 4.00, stdev= 2.00
    clat (usec): min=29, max=4395, avg=616.16, stdev=226.57
     lat (usec): min=35, max=4401, avg=620.27, stdev=226.03
    clat percentiles (usec):
     |  1.00th=[  133],  5.00th=[  180], 10.00th=[  212], 20.00th=[  260],
     | 30.00th=[  644], 40.00th=[  693], 50.00th=[  717], 60.00th=[  734],
     | 70.00th=[  750], 80.00th=[  775], 90.00th=[  807], 95.00th=[  832],
     | 99.00th=[  881], 99.50th=[  906], 99.90th=[  963], 99.95th=[  988],
     | 99.99th=[ 1045]
   bw (  KiB/s): min=681992, max=2398808, per=100.00%, avg=825255.77, stdev=111114.97, samples=956
   iops        : min=170498, max=599702, avg=206313.95, stdev=27778.75, samples=956
  lat (usec)   : 50=0.01%, 100=0.17%, 250=19.00%, 500=3.30%, 750=45.74%
  lat (usec)   : 1000=31.75%
  lat (msec)   : 2=0.03%, 4=0.01%, 10=0.01%
  cpu          : usr=9.47%, sys=27.29%, ctx=10409610, majf=0, minf=8830
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=24733125,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

update: if I do fio against a disk on bay 1 while doing fio against drive in bay 2, iops/speed are unaffected.

2 Likes

Progress kind of… I “fixed” fio.

This post led me to start looking at NUMA. More googling led me to this post for locking fio to the cores that control the NUMA nodes the individual NVME drives are on.

There is still hit to overall throughput of 1 disk when I kick off the second fio job, but not nearly the degradation I’ve been seeing. My problem now is, how do I apply this to mdraid, xfs, and further to postgres if needed. The search continues…

[email protected]:/home/angeltrax/fio# ./rand-read-4k.sh /dev/nvme8n1p1
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=3385MiB/s][r=866k IOPS][eta 01m:57s]
Jobs: 4 (f=4): [r(4)][4.2%][r=3384MiB/s][r=866k IOPS][eta 01m:55s]
Jobs: 4 (f=4): [r(4)][5.8%][r=3384MiB/s][r=866k IOPS][eta 01m:53s]
Jobs: 4 (f=4): [r(4)][7.5%][r=2505MiB/s][r=641k IOPS][eta 01m:51s]
Jobs: 4 (f=4): [r(4)][9.2%][r=2317MiB/s][r=593k IOPS][eta 01m:49s] <-- started 2nd fio job
Jobs: 4 (f=4): [r(4)][10.8%][r=2317MiB/s][r=593k IOPS][eta 01m:47s]
Jobs: 4 (f=4): [r(4)][12.5%][r=2317MiB/s][r=593k IOPS][eta 01m:45s]
Jobs: 4 (f=4): [r(4)][14.2%][r=2317MiB/s][r=593k IOPS][eta 01m:43s]
Jobs: 4 (f=4): [r(4)][15.8%][r=2318MiB/s][r=593k IOPS][eta 01m:41s]
Jobs: 4 (f=4): [r(4)][17.5%][r=2317MiB/s][r=593k IOPS][eta 01m:39s]
Jobs: 4 (f=4): [r(4)][19.2%][r=2317MiB/s][r=593k IOPS][eta 01m:37s]
Jobs: 4 (f=4): [r(4)][20.8%][r=2318MiB/s][r=593k IOPS][eta 01m:35s]
Jobs: 4 (f=4): [r(4)][22.5%][r=2317MiB/s][r=593k IOPS][eta 01m:33s]
Jobs: 4 (f=4): [r(4)][24.2%][r=2317MiB/s][r=593k IOPS][eta 01m:31s]
Jobs: 4 (f=4): [r(4)][25.8%][r=2317MiB/s][r=593k IOPS][eta 01m:29s]
Jobs: 4 (f=4): [r(4)][27.5%][r=2318MiB/s][r=593k IOPS][eta 01m:27s]
Jobs: 4 (f=4): [r(4)][29.2%][r=2317MiB/s][r=593k IOPS][eta 01m:25s]
Jobs: 4 (f=4): [r(4)][30.8%][r=2316MiB/s][r=593k IOPS][eta 01m:23s]
Jobs: 4 (f=4): [r(4)][32.5%][r=2317MiB/s][r=593k IOPS][eta 01m:21s]
Jobs: 4 (f=4): [r(4)][34.2%][r=2317MiB/s][r=593k IOPS][eta 01m:19s]
^Cbs: 4 (f=4): [r(4)][35.0%][r=2317MiB/s][r=593k IOPS][eta 01m:18s]
fio: terminating on signal 2

iops-test-job: (groupid=0, jobs=4): err= 0: pid=5255: Sun Sep 26 20:37:07 2021
  read: IOPS=647k, BW=2529MiB/s (2652MB/s)(105GiB/42576msec)
    slat (nsec): min=1903, max=4023.5k, avg=4080.93, stdev=2200.87
    clat (usec): min=26, max=4439, avg=192.47, stdev=50.60
     lat (usec): min=28, max=4445, avg=196.71, stdev=51.04
    clat percentiles (usec):
     |  1.00th=[   68],  5.00th=[   89], 10.00th=[  135], 20.00th=[  149],
     | 30.00th=[  159], 40.00th=[  186], 50.00th=[  206], 60.00th=[  215],
     | 70.00th=[  223], 80.00th=[  231], 90.00th=[  245], 95.00th=[  265],
     | 99.00th=[  302], 99.50th=[  318], 99.90th=[  359], 99.95th=[  375],
     | 99.99th=[  424]
   bw (  MiB/s): min= 2312, max= 3462, per=100.00%, avg=2528.84, stdev=106.54, samples=340
   iops        : min=592104, max=886330, avg=647384.02, stdev=27275.42, samples=340
  lat (usec)   : 50=0.01%, 100=5.89%, 250=85.81%, 500=8.28%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=31.83%, sys=67.68%, ctx=170292, majf=0, minf=248
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=27561695,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2529MiB/s (2652MB/s), 2529MiB/s-2529MiB/s (2652MB/s-2652MB/s), io=105GiB (113GB), run=42576-42576msec

Disk stats (read/write):
  nvme8n1: ios=27555865/0, merge=0/0, ticks=2507783/0, in_queue=2507783, util=99.78%
2 Likes

Alright I’m at a complete stand still now. Not sure if with just these 2 disks in raid 1 if I’m just hitting an expected bottleneck. With RAID10 and 8 disks I’m still in same boat with only about 50k IOPS on the same random read 4k test.

This is where I’m at:

lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           1
Model name:                      AMD EPYC 7601 32-Core Processor
Stepping:                        2
CPU MHz:                         2195.930
BogoMIPS:                        4391.86
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       4 MiB
L2 cache:                        32 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120
NUMA node1 CPU(s):               2,10,18,26,34,42,50,58,66,74,82,90,98,106,114,122
NUMA node2 CPU(s):               4,12,20,28,36,44,52,60,68,76,84,92,100,108,116,124
NUMA node3 CPU(s):               6,14,22,30,38,46,54,62,70,78,86,94,102,110,118,126
NUMA node4 CPU(s):               1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121
NUMA node5 CPU(s):               3,11,19,27,35,43,51,59,67,75,83,91,99,107,115,123
NUMA node6 CPU(s):               5,13,21,29,37,45,53,61,69,77,85,93,101,109,117,125
NUMA node7 CPU(s):               7,15,23,31,39,47,55,63,71,79,87,95,103,111,119,127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid a
                                 md_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tc
                                 e topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsave
                                 erptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev sev_es

lstopo

# lstopo

Machine (378GB total)
  Package L#0
     # redacted for brevity
  Package L#1
    # redacted for brevity
    Die L#5
      NUMANode L#5 (P#5 63GB)
      L3 L#10 (8192KB)
        L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (64KB) + Core L#40
          PU L#80 (P#3)
          PU L#81 (P#67)
        L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (64KB) + Core L#41
          PU L#82 (P#19)
          PU L#83 (P#83)
        L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (64KB) + Core L#42
          PU L#84 (P#35)
          PU L#85 (P#99)
        L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (64KB) + Core L#43
          PU L#86 (P#51)
          PU L#87 (P#115)
      L3 L#11 (8192KB)
        L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (64KB) + Core L#44
          PU L#88 (P#11)
          PU L#89 (P#75)
        L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (64KB) + Core L#45
          PU L#90 (P#27)
          PU L#91 (P#91)
        L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (64KB) + Core L#46
          PU L#92 (P#43)
          PU L#93 (P#107)
        L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (64KB) + Core L#47
          PU L#94 (P#59)
          PU L#95 (P#123)
      HostBridge
        PCIBridge
          PCIBridge
            PCIBridge
              PCI a5:00.0 (NVMExp)
                Block(Disk) "nvme6n1"
            PCIBridge
              PCI a6:00.0 (NVMExp)
                Block(Disk) "nvme7n1"
            PCIBridge
              PCI a7:00.0 (NVMExp)
                Block(Disk) "nvme8n1"            <--- targeting this one
            PCIBridge
              PCI a8:00.0 (NVMExp)
                Block(Disk) "nvme9n1"            <--- and this one
            PCIBridge
              PCI ad:00.0 (NVMExp)
                Block(Disk) "nvme10n1"
            PCIBridge
              PCI ae:00.0 (NVMExp)
                Block(Disk) "nvme11n1"
    Die L#6
       redacted for brevity

rand-read-4k.sh

fio --numa_mem_policy=local --cpus_allowed=3,19,35,51,67,83,99,115 --filename=$DISK --size=500GB --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=32 --runtime=120 --numjobs=8 --time_based --group_reporting --na
mdadm --create --assume-clean --verbose /dev/md99 --level=1 --raid-devices=2 /dev/nvme9n1p1 /dev/nvme8n1p1

./rand-read-4k.sh /dev/md99

iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 8 processes
Jobs: 8 (f=8): [r(8)][2.5%][r=2845MiB/s][r=728k IOPS][eta 01m:57s]
Jobs: 8 (f=8): [r(8)][4.2%][r=2849MiB/s][r=729k IOPS][eta 01m:55s]
Jobs: 8 (f=8): [r(8)][5.8%][r=2843MiB/s][r=728k IOPS][eta 01m:53s]
Jobs: 8 (f=8): [r(8)][7.5%][r=2844MiB/s][r=728k IOPS][eta 01m:51s]
Jobs: 8 (f=8): [r(8)][9.2%][r=2845MiB/s][r=728k IOPS][eta 01m:49s]
Jobs: 8 (f=8): [r(8)][10.8%][r=2846MiB/s][r=729k IOPS][eta 01m:47s]
Jobs: 8 (f=8): [r(8)][12.5%][r=2846MiB/s][r=728k IOPS][eta 01m:45s]
Jobs: 8 (f=8): [r(8)][14.2%][r=2843MiB/s][r=728k IOPS][eta 01m:43s]
Jobs: 8 (f=8): [r(8)][15.8%][r=2843MiB/s][r=728k IOPS][eta 01m:41s]
Jobs: 8 (f=8): [r(8)][17.5%][r=2846MiB/s][r=729k IOPS][eta 01m:39s]
Jobs: 8 (f=8): [r(8)][19.2%][r=2840MiB/s][r=727k IOPS][eta 01m:37s]
Jobs: 8 (f=8): [r(8)][20.8%][r=2843MiB/s][r=728k IOPS][eta 01m:35s]
Jobs: 8 (f=8): [r(8)][22.5%][r=2843MiB/s][r=728k IOPS][eta 01m:33s]
Jobs: 8 (f=8): [r(8)][24.2%][r=2846MiB/s][r=728k IOPS][eta 01m:31s]
Jobs: 8 (f=8): [r(8)][25.8%][r=2843MiB/s][r=728k IOPS][eta 01m:29s]
Jobs: 8 (f=8): [r(8)][27.5%][r=2845MiB/s][r=728k IOPS][eta 01m:27s]
Jobs: 8 (f=8): [r(8)][29.2%][r=2843MiB/s][r=728k IOPS][eta 01m:25s]
Jobs: 8 (f=8): [r(8)][30.8%][r=2844MiB/s][r=728k IOPS][eta 01m:23s]
Jobs: 8 (f=8): [r(8)][32.5%][r=2841MiB/s][r=727k IOPS][eta 01m:21s]
Jobs: 8 (f=8): [r(8)][34.2%][r=2843MiB/s][r=728k IOPS][eta 01m:19s]
Jobs: 8 (f=8): [r(8)][35.8%][r=2845MiB/s][r=728k IOPS][eta 01m:17s]
Jobs: 8 (f=8): [r(8)][37.5%][r=2846MiB/s][r=728k IOPS][eta 01m:15s]
Jobs: 8 (f=8): [r(8)][39.2%][r=2840MiB/s][r=727k IOPS][eta 01m:13s]
Jobs: 8 (f=8): [r(8)][40.8%][r=2844MiB/s][r=728k IOPS][eta 01m:11s]
Jobs: 8 (f=8): [r(8)][42.5%][r=2844MiB/s][r=728k IOPS][eta 01m:09s]
Jobs: 8 (f=8): [r(8)][44.2%][r=2847MiB/s][r=729k IOPS][eta 01m:07s]
Jobs: 8 (f=8): [r(8)][45.8%][r=2845MiB/s][r=728k IOPS][eta 01m:05s]
Jobs: 8 (f=8): [r(8)][47.5%][r=2845MiB/s][r=728k IOPS][eta 01m:03s]
Jobs: 8 (f=8): [r(8)][49.2%][r=2842MiB/s][r=728k IOPS][eta 01m:01s]
Jobs: 8 (f=8): [r(8)][50.8%][r=2845MiB/s][r=728k IOPS][eta 00m:59s]
Jobs: 8 (f=8): [r(8)][52.5%][r=2844MiB/s][r=728k IOPS][eta 00m:57s]
Jobs: 8 (f=8): [r(8)][54.2%][r=2842MiB/s][r=728k IOPS][eta 00m:55s]
Jobs: 8 (f=8): [r(8)][55.8%][r=2842MiB/s][r=727k IOPS][eta 00m:53s]
Jobs: 8 (f=8): [r(8)][57.5%][r=2844MiB/s][r=728k IOPS][eta 00m:51s]
Jobs: 8 (f=8): [r(8)][59.2%][r=2842MiB/s][r=728k IOPS][eta 00m:49s]
Jobs: 8 (f=8): [r(8)][60.8%][r=2846MiB/s][r=729k IOPS][eta 00m:47s]
Jobs: 8 (f=8): [r(8)][62.5%][r=2840MiB/s][r=727k IOPS][eta 00m:45s]
Jobs: 8 (f=8): [r(8)][64.2%][r=2844MiB/s][r=728k IOPS][eta 00m:43s]
Jobs: 8 (f=8): [r(8)][65.8%][r=2843MiB/s][r=728k IOPS][eta 00m:41s]
Jobs: 8 (f=8): [r(8)][67.5%][r=2844MiB/s][r=728k IOPS][eta 00m:39s]
Jobs: 8 (f=8): [r(8)][69.2%][r=2842MiB/s][r=727k IOPS][eta 00m:37s]
Jobs: 8 (f=8): [r(8)][70.8%][r=2844MiB/s][r=728k IOPS][eta 00m:35s]
Jobs: 8 (f=8): [r(8)][72.5%][r=2839MiB/s][r=727k IOPS][eta 00m:33s]
Jobs: 8 (f=8): [r(8)][74.2%][r=2844MiB/s][r=728k IOPS][eta 00m:31s]
Jobs: 8 (f=8): [r(8)][75.8%][r=2838MiB/s][r=726k IOPS][eta 00m:29s]
Jobs: 8 (f=8): [r(8)][77.5%][r=2847MiB/s][r=729k IOPS][eta 00m:27s]
Jobs: 8 (f=8): [r(8)][79.2%][r=2841MiB/s][r=727k IOPS][eta 00m:25s]
Jobs: 8 (f=8): [r(8)][80.8%][r=2846MiB/s][r=729k IOPS][eta 00m:23s]
Jobs: 8 (f=8): [r(8)][82.5%][r=2844MiB/s][r=728k IOPS][eta 00m:21s]
Jobs: 8 (f=8): [r(8)][84.2%][r=2845MiB/s][r=728k IOPS][eta 00m:19s]
Jobs: 8 (f=8): [r(8)][85.8%][r=2843MiB/s][r=728k IOPS][eta 00m:17s]
Jobs: 8 (f=8): [r(8)][87.5%][r=2847MiB/s][r=729k IOPS][eta 00m:15s]
Jobs: 8 (f=8): [r(8)][89.2%][r=2844MiB/s][r=728k IOPS][eta 00m:13s]
Jobs: 8 (f=8): [r(8)][90.8%][r=2844MiB/s][r=728k IOPS][eta 00m:11s]
Jobs: 8 (f=8): [r(8)][92.5%][r=2847MiB/s][r=729k IOPS][eta 00m:09s]
Jobs: 8 (f=8): [r(8)][94.2%][r=2846MiB/s][r=729k IOPS][eta 00m:07s]
Jobs: 8 (f=8): [r(8)][95.8%][r=2842MiB/s][r=728k IOPS][eta 00m:05s]
Jobs: 8 (f=8): [r(8)][97.5%][r=2847MiB/s][r=729k IOPS][eta 00m:03s]
Jobs: 8 (f=8): [r(8)][99.2%][r=2840MiB/s][r=727k IOPS][eta 00m:01s]
Jobs: 8 (f=8): [r(8)][100.0%][r=2845MiB/s][r=728k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=8): err= 0: pid=5864: Mon Sep 27 14:13:45 2021
  read: IOPS=728k, BW=2842MiB/s (2980MB/s)(333GiB/120001msec)
    slat (usec): min=2, max=740, avg= 8.13, stdev= 5.90
    clat (usec): min=78, max=1874, avg=342.05, stdev=46.61
     lat (usec): min=82, max=1895, avg=350.45, stdev=47.06
    clat percentiles (usec):
     |  1.00th=[  235],  5.00th=[  262], 10.00th=[  281], 20.00th=[  306],
     | 30.00th=[  322], 40.00th=[  334], 50.00th=[  347], 60.00th=[  355],
     | 70.00th=[  367], 80.00th=[  379], 90.00th=[  400], 95.00th=[  416],
     | 99.00th=[  449], 99.50th=[  465], 99.90th=[  529], 99.95th=[  578],
     | 99.99th=[  742]
   bw (  MiB/s): min= 2789, max= 2880, per=100.00%, avg=2844.39, stdev= 1.34, samples=1912
   iops        : min=714158, max=737344, avg=728163.68, stdev=342.86, samples=1912
  lat (usec)   : 100=0.01%, 250=2.68%, 500=97.14%, 750=0.17%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=13.73%, sys=55.29%, ctx=57154950, majf=0, minf=464
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=87315555,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=2842MiB/s (2980MB/s), 2842MiB/s-2842MiB/s (2980MB/s-2980MB/s), io=333GiB (358GB), run=120001-120001msec

Disk stats (read/write):
    md99: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=43657777/1, aggrmerge=0/0, aggrticks=3164192/0, aggrin_queue=3164192, aggrutil=99.94%
  nvme8n1: ios=43321141/1, merge=0/0, ticks=3134555/0, in_queue=3134555, util=99.94%
  nvme9n1: ios=43994414/1, merge=0/0, ticks=3193830/0, in_queue=3193829, util=99.94%

stopped mdadm for single disk check

mdadm --stop md99

./rand-read-4k.sh /dev/nvme8n1p1

iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 8 processes
Jobs: 8 (f=8): [r(8)][2.5%][r=3386MiB/s][r=867k IOPS][eta 01m:57s]
Jobs: 8 (f=8): [r(8)][4.2%][r=3386MiB/s][r=867k IOPS][eta 01m:55s]
Jobs: 8 (f=8): [r(8)][5.8%][r=3386MiB/s][r=867k IOPS][eta 01m:53s]
Jobs: 8 (f=8): [r(8)][7.5%][r=3386MiB/s][r=867k IOPS][eta 01m:51s]
Jobs: 8 (f=8): [r(8)][9.2%][r=3387MiB/s][r=867k IOPS][eta 01m:49s]
Jobs: 8 (f=8): [r(8)][10.8%][r=3386MiB/s][r=867k IOPS][eta 01m:47s]
Jobs: 8 (f=8): [r(8)][12.5%][r=3194MiB/s][r=818k IOPS][eta 01m:45s]
Jobs: 8 (f=8): [r(8)][14.2%][r=1758MiB/s][r=450k IOPS][eta 01m:43s]  <-- where fio for /dev/nvme9n1p1 started
Jobs: 8 (f=8): [r(8)][15.8%][r=1775MiB/s][r=455k IOPS][eta 01m:41s]
Jobs: 8 (f=8): [r(8)][17.5%][r=1792MiB/s][r=459k IOPS][eta 01m:39s]
Jobs: 8 (f=8): [r(8)][19.2%][r=1819MiB/s][r=466k IOPS][eta 01m:37s]
Jobs: 8 (f=8): [r(8)][20.8%][r=1805MiB/s][r=462k IOPS][eta 01m:35s]
Jobs: 8 (f=8): [r(8)][22.5%][r=1832MiB/s][r=469k IOPS][eta 01m:33s]
Jobs: 8 (f=8): [r(8)][24.2%][r=1732MiB/s][r=443k IOPS][eta 01m:31s]
Jobs: 8 (f=8): [r(8)][25.8%][r=1702MiB/s][r=436k IOPS][eta 01m:29s]
Jobs: 8 (f=8): [r(8)][27.5%][r=1698MiB/s][r=435k IOPS][eta 01m:27s]
Jobs: 8 (f=8): [r(8)][29.2%][r=1711MiB/s][r=438k IOPS][eta 01m:25s]
Jobs: 8 (f=8): [r(8)][30.8%][r=1733MiB/s][r=444k IOPS][eta 01m:23s]
Jobs: 8 (f=8): [r(8)][32.5%][r=1731MiB/s][r=443k IOPS][eta 01m:21s]
Jobs: 8 (f=8): [r(8)][34.2%][r=1724MiB/s][r=441k IOPS][eta 01m:19s]
Jobs: 8 (f=8): [r(8)][35.8%][r=1705MiB/s][r=436k IOPS][eta 01m:17s]
Jobs: 8 (f=8): [r(8)][37.5%][r=1707MiB/s][r=437k IOPS][eta 01m:15s]
Jobs: 8 (f=8): [r(8)][39.2%][r=1718MiB/s][r=440k IOPS][eta 01m:13s]
Jobs: 8 (f=8): [r(8)][40.8%][r=1722MiB/s][r=441k IOPS][eta 01m:11s]
Jobs: 8 (f=8): [r(8)][42.5%][r=1752MiB/s][r=449k IOPS][eta 01m:09s]
Jobs: 8 (f=8): [r(8)][44.2%][r=1723MiB/s][r=441k IOPS][eta 01m:07s]
Jobs: 8 (f=8): [r(8)][45.8%][r=1713MiB/s][r=439k IOPS][eta 01m:05s]
Jobs: 8 (f=8): [r(8)][47.5%][r=1718MiB/s][r=440k IOPS][eta 01m:03s]
Jobs: 8 (f=8): [r(8)][49.2%][r=1729MiB/s][r=443k IOPS][eta 01m:01s]
Jobs: 8 (f=8): [r(8)][50.8%][r=1694MiB/s][r=434k IOPS][eta 00m:59s]
Jobs: 8 (f=8): [r(8)][52.5%][r=1684MiB/s][r=431k IOPS][eta 00m:57s]
Jobs: 8 (f=8): [r(8)][54.2%][r=1710MiB/s][r=438k IOPS][eta 00m:55s]
Jobs: 8 (f=8): [r(8)][55.8%][r=1709MiB/s][r=437k IOPS][eta 00m:53s]
Jobs: 8 (f=8): [r(8)][57.5%][r=1706MiB/s][r=437k IOPS][eta 00m:51s]
Jobs: 8 (f=8): [r(8)][59.2%][r=1714MiB/s][r=439k IOPS][eta 00m:49s]
Jobs: 8 (f=8): [r(8)][60.8%][r=1699MiB/s][r=435k IOPS][eta 00m:47s]
Jobs: 8 (f=8): [r(8)][62.5%][r=1693MiB/s][r=433k IOPS][eta 00m:45s]
Jobs: 8 (f=8): [r(8)][64.2%][r=1709MiB/s][r=437k IOPS][eta 00m:43s]
Jobs: 8 (f=8): [r(8)][65.8%][r=1685MiB/s][r=431k IOPS][eta 00m:41s]
Jobs: 8 (f=8): [r(8)][67.5%][r=1640MiB/s][r=420k IOPS][eta 00m:39s]
Jobs: 8 (f=8): [r(8)][69.2%][r=1644MiB/s][r=421k IOPS][eta 00m:37s]
Jobs: 8 (f=8): [r(8)][70.8%][r=1643MiB/s][r=421k IOPS][eta 00m:35s]
Jobs: 8 (f=8): [r(8)][72.5%][r=1641MiB/s][r=420k IOPS][eta 00m:33s]
Jobs: 8 (f=8): [r(8)][74.2%][r=1642MiB/s][r=420k IOPS][eta 00m:31s]
Jobs: 8 (f=8): [r(8)][75.8%][r=1642MiB/s][r=420k IOPS][eta 00m:29s]
Jobs: 8 (f=8): [r(8)][77.5%][r=1641MiB/s][r=420k IOPS][eta 00m:27s]
Jobs: 8 (f=8): [r(8)][79.2%][r=1641MiB/s][r=420k IOPS][eta 00m:25s]
Jobs: 8 (f=8): [r(8)][80.8%][r=1640MiB/s][r=420k IOPS][eta 00m:23s]
Jobs: 8 (f=8): [r(8)][82.5%][r=1640MiB/s][r=420k IOPS][eta 00m:21s]
Jobs: 8 (f=8): [r(8)][84.2%][r=1641MiB/s][r=420k IOPS][eta 00m:19s]
Jobs: 8 (f=8): [r(8)][85.8%][r=1641MiB/s][r=420k IOPS][eta 00m:17s]
Jobs: 8 (f=8): [r(8)][87.5%][r=1642MiB/s][r=420k IOPS][eta 00m:15s]
Jobs: 8 (f=8): [r(8)][89.2%][r=1642MiB/s][r=420k IOPS][eta 00m:13s]
Jobs: 8 (f=8): [r(8)][90.8%][r=1641MiB/s][r=420k IOPS][eta 00m:11s]
Jobs: 8 (f=8): [r(8)][92.5%][r=1641MiB/s][r=420k IOPS][eta 00m:09s]
Jobs: 8 (f=8): [r(8)][94.2%][r=1643MiB/s][r=421k IOPS][eta 00m:07s]
Jobs: 8 (f=8): [r(8)][95.8%][r=1641MiB/s][r=420k IOPS][eta 00m:05s]
Jobs: 8 (f=8): [r(8)][97.5%][r=1641MiB/s][r=420k IOPS][eta 00m:03s]
Jobs: 8 (f=8): [r(8)][99.2%][r=1642MiB/s][r=420k IOPS][eta 00m:01s]
Jobs: 8 (f=8): [r(8)][100.0%][r=1641MiB/s][r=420k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=8): err= 0: pid=6334: Mon Sep 27 14:19:53 2021
  read: IOPS=487k, BW=1900MiB/s (1993MB/s)(223GiB/120002msec)
    slat (nsec): min=1923, max=50273k, avg=10551.44, stdev=140546.84
    clat (usec): min=26, max=51478, avg=512.29, stdev=977.05
     lat (usec): min=29, max=51481, avg=523.27, stdev=988.61
    clat percentiles (usec):
     |  1.00th=[   84],  5.00th=[  131], 10.00th=[  155], 20.00th=[  202],
     | 30.00th=[  281], 40.00th=[  457], 50.00th=[  529], 60.00th=[  570],
     | 70.00th=[  603], 80.00th=[  635], 90.00th=[  676], 95.00th=[  725],
     | 99.00th=[ 1500], 99.50th=[ 1844], 99.90th=[17957], 99.95th=[21627],
     | 99.99th=[36963]
   bw (  MiB/s): min= 1276, max= 4901, per=100.00%, avg=1902.29, stdev=83.47, samples=1912
   iops        : min=326910, max=1254692, avg=486987.04, stdev=21369.23, samples=1912
  lat (usec)   : 50=0.01%, 100=2.34%, 250=25.21%, 500=16.74%, 750=52.05%
  lat (usec)   : 1000=1.75%
  lat (msec)   : 2=1.48%, 4=0.07%, 10=0.14%, 20=0.16%, 50=0.06%
  lat (msec)   : 100=0.01%
  cpu          : usr=10.48%, sys=28.23%, ctx=32066705, majf=0, minf=466
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=58381579,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1900MiB/s (1993MB/s), 1900MiB/s-1900MiB/s (1993MB/s-1993MB/s), io=223GiB (239GB), run=120002-120002msec

Disk stats (read/write):
  nvme8n1: ios=58304320/0, merge=0/0, ticks=5802867/0, in_queue=5802867, util=99.94%

./rand-read-4k.sh /dev/nvme9n1p1

iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 8 processes
Jobs: 8 (f=8): [r(8)][2.5%][r=1774MiB/s][r=454k IOPS][eta 01m:57s]
Jobs: 8 (f=8): [r(8)][4.2%][r=1826MiB/s][r=468k IOPS][eta 01m:55s]
Jobs: 8 (f=8): [r(8)][5.8%][r=1799MiB/s][r=461k IOPS][eta 01m:53s]
Jobs: 8 (f=8): [r(8)][7.5%][r=1767MiB/s][r=452k IOPS][eta 01m:51s]
Jobs: 8 (f=8): [r(8)][9.2%][r=1789MiB/s][r=458k IOPS][eta 01m:49s]
Jobs: 8 (f=8): [r(8)][10.8%][r=1700MiB/s][r=435k IOPS][eta 01m:47s]
Jobs: 8 (f=8): [r(8)][12.5%][r=1731MiB/s][r=443k IOPS][eta 01m:45s]
Jobs: 8 (f=8): [r(8)][14.2%][r=1710MiB/s][r=438k IOPS][eta 01m:43s]
Jobs: 8 (f=8): [r(8)][15.8%][r=1732MiB/s][r=443k IOPS][eta 01m:41s]
Jobs: 8 (f=8): [r(8)][17.5%][r=1711MiB/s][r=438k IOPS][eta 01m:39s]
Jobs: 8 (f=8): [r(8)][19.2%][r=1706MiB/s][r=437k IOPS][eta 01m:37s]
Jobs: 8 (f=8): [r(8)][20.8%][r=1705MiB/s][r=436k IOPS][eta 01m:35s]
Jobs: 8 (f=8): [r(8)][22.5%][r=1720MiB/s][r=440k IOPS][eta 01m:33s]
Jobs: 8 (f=8): [r(8)][24.2%][r=1739MiB/s][r=445k IOPS][eta 01m:31s]
Jobs: 8 (f=8): [r(8)][25.8%][r=1722MiB/s][r=441k IOPS][eta 01m:29s]
Jobs: 8 (f=8): [r(8)][27.5%][r=1715MiB/s][r=439k IOPS][eta 01m:27s]
Jobs: 8 (f=8): [r(8)][29.2%][r=1717MiB/s][r=439k IOPS][eta 01m:25s]
Jobs: 8 (f=8): [r(8)][30.8%][r=1728MiB/s][r=442k IOPS][eta 01m:23s]
Jobs: 8 (f=8): [r(8)][32.5%][r=1715MiB/s][r=439k IOPS][eta 01m:21s]
Jobs: 8 (f=8): [r(8)][34.2%][r=1716MiB/s][r=439k IOPS][eta 01m:19s]
Jobs: 8 (f=8): [r(8)][35.8%][r=1718MiB/s][r=440k IOPS][eta 01m:17s]
Jobs: 8 (f=8): [r(8)][37.5%][r=1697MiB/s][r=434k IOPS][eta 01m:15s]
Jobs: 8 (f=8): [r(8)][39.2%][r=1728MiB/s][r=442k IOPS][eta 01m:13s]
Jobs: 8 (f=8): [r(8)][40.8%][r=1724MiB/s][r=441k IOPS][eta 01m:11s]
Jobs: 8 (f=8): [r(8)][42.5%][r=1729MiB/s][r=443k IOPS][eta 01m:09s]
Jobs: 8 (f=8): [r(8)][44.2%][r=1716MiB/s][r=439k IOPS][eta 01m:07s]
Jobs: 8 (f=8): [r(8)][45.8%][r=1722MiB/s][r=441k IOPS][eta 01m:05s]
Jobs: 8 (f=8): [r(8)][47.5%][r=1709MiB/s][r=437k IOPS][eta 01m:03s]
Jobs: 8 (f=8): [r(8)][49.2%][r=1728MiB/s][r=442k IOPS][eta 01m:01s]
Jobs: 8 (f=8): [r(8)][50.8%][r=1744MiB/s][r=447k IOPS][eta 00m:59s]
Jobs: 8 (f=8): [r(8)][52.5%][r=1719MiB/s][r=440k IOPS][eta 00m:57s]
Jobs: 8 (f=8): [r(8)][54.2%][r=1701MiB/s][r=435k IOPS][eta 00m:55s]
Jobs: 8 (f=8): [r(8)][55.8%][r=1641MiB/s][r=420k IOPS][eta 00m:53s]
Jobs: 8 (f=8): [r(8)][57.5%][r=1643MiB/s][r=420k IOPS][eta 00m:51s]
Jobs: 8 (f=8): [r(8)][59.2%][r=1642MiB/s][r=420k IOPS][eta 00m:49s]
Jobs: 8 (f=8): [r(8)][60.8%][r=1642MiB/s][r=420k IOPS][eta 00m:47s]
Jobs: 8 (f=8): [r(8)][62.5%][r=1639MiB/s][r=420k IOPS][eta 00m:45s]
Jobs: 8 (f=8): [r(8)][64.2%][r=1640MiB/s][r=420k IOPS][eta 00m:43s]
Jobs: 8 (f=8): [r(8)][65.8%][r=1641MiB/s][r=420k IOPS][eta 00m:41s]
Jobs: 8 (f=8): [r(8)][67.5%][r=1641MiB/s][r=420k IOPS][eta 00m:39s]
Jobs: 8 (f=8): [r(8)][69.2%][r=1642MiB/s][r=420k IOPS][eta 00m:37s]
Jobs: 8 (f=8): [r(8)][70.8%][r=1638MiB/s][r=419k IOPS][eta 00m:35s]
Jobs: 8 (f=8): [r(8)][72.5%][r=1642MiB/s][r=420k IOPS][eta 00m:33s]
Jobs: 8 (f=8): [r(8)][74.2%][r=1642MiB/s][r=420k IOPS][eta 00m:31s]
Jobs: 8 (f=8): [r(8)][75.8%][r=1642MiB/s][r=420k IOPS][eta 00m:29s]
Jobs: 8 (f=8): [r(8)][77.5%][r=1643MiB/s][r=421k IOPS][eta 00m:27s]
Jobs: 8 (f=8): [r(8)][79.2%][r=1640MiB/s][r=420k IOPS][eta 00m:25s]
Jobs: 8 (f=8): [r(8)][80.8%][r=1641MiB/s][r=420k IOPS][eta 00m:23s]
Jobs: 8 (f=8): [r(8)][82.5%][r=1641MiB/s][r=420k IOPS][eta 00m:21s]
Jobs: 8 (f=8): [r(8)][84.2%][r=1641MiB/s][r=420k IOPS][eta 00m:19s]
Jobs: 8 (f=8): [r(8)][85.8%][r=1639MiB/s][r=420k IOPS][eta 00m:17s]
Jobs: 8 (f=8): [r(8)][87.5%][r=1641MiB/s][r=420k IOPS][eta 00m:15s]
Jobs: 8 (f=8): [r(8)][89.2%][r=3387MiB/s][r=867k IOPS][eta 00m:13s] <-- where fio for /dev/nvme8n1p1 finished
Jobs: 8 (f=8): [r(8)][90.8%][r=3387MiB/s][r=867k IOPS][eta 00m:11s]
Jobs: 8 (f=8): [r(8)][92.5%][r=3387MiB/s][r=867k IOPS][eta 00m:09s]
Jobs: 8 (f=8): [r(8)][94.2%][r=3387MiB/s][r=867k IOPS][eta 00m:07s]
Jobs: 8 (f=8): [r(8)][95.8%][r=3387MiB/s][r=867k IOPS][eta 00m:05s]
Jobs: 8 (f=8): [r(8)][97.5%][r=3387MiB/s][r=867k IOPS][eta 00m:03s]
Jobs: 8 (f=8): [r(8)][99.2%][r=3387MiB/s][r=867k IOPS][eta 00m:01s]
Jobs: 8 (f=8): [r(8)][100.0%][r=3387MiB/s][r=867k IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=8): err= 0: pid=6474: Mon Sep 27 14:20:07 2021
  read: IOPS=487k, BW=1902MiB/s (1995MB/s)(223GiB/120009msec)
    slat (nsec): min=1883, max=50616k, avg=10593.65, stdev=142749.85
    clat (usec): min=31, max=51625, avg=511.80, stdev=998.05
     lat (usec): min=34, max=51628, avg=522.82, stdev=1009.70
    clat percentiles (usec):
     |  1.00th=[   83],  5.00th=[  129], 10.00th=[  153], 20.00th=[  200],
     | 30.00th=[  277], 40.00th=[  453], 50.00th=[  529], 60.00th=[  570],
     | 70.00th=[  603], 80.00th=[  635], 90.00th=[  676], 95.00th=[  725],
     | 99.00th=[ 1532], 99.50th=[ 1860], 99.90th=[18220], 99.95th=[22152],
     | 99.99th=[38536]
   bw (  MiB/s): min= 1271, max= 4846, per=99.76%, avg=1897.63, stdev=79.53, samples=1912
   iops        : min=325509, max=1240592, avg=485792.95, stdev=20359.40, samples=1912
  lat (usec)   : 50=0.01%, 100=2.47%, 250=25.45%, 500=16.49%, 750=51.90%
  lat (usec)   : 1000=1.80%
  lat (msec)   : 2=1.46%, 4=0.07%, 10=0.13%, 20=0.15%, 50=0.07%
  lat (msec)   : 100=0.01%
  cpu          : usr=10.38%, sys=28.29%, ctx=32037939, majf=0, minf=487
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=58439698,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1902MiB/s (1995MB/s), 1902MiB/s-1902MiB/s (1995MB/s-1995MB/s), io=223GiB (239GB), run=120009-120009msec

Disk stats (read/write):
  nvme9n1: ios=58285250/0, merge=0/0, ticks=5768332/0, in_queue=5768332, util=99.95%

Just for fun I tested out CrystalDiskMark with Windows Server

Running 2 independent disks at same time


Running a single disk


storage spaces with a 2 disk mirror


storage spaces with 8 disk mirrored pairs


1 Like

Just wanted to update here in case anyone else happens to come along and read this. I ended up posting over at ServeTheHome as well.

  • With CPU pinning in FIO, I max somewhere between 1739k and 2100k 4k read iops for all 8 disks at once. I’m not sure if I’m just hitting an overall CPU/PCIE3 bottleneck, any ideas on what I should expect?
  • Without CPU pinning, I get WORSE throughput than a single disk. How much worse just depends on how many disks I run FIO against at the same time. I even took 1 of the physical CPUs out to see if it was cross socket communication that was the issue, and still have the same problem.
2 Likes

I hate when I find issues similar to mine with no resolution so just wanted to come back and update things a bit. I ended up moving the drives to the newer R7525 servers and can get full throughput without any issues.

I believe my issue came down to 2 things:

  1. PCIe switches
  2. Poor NUMA memory optimizations on gen 1 epycs

To recap a bit, I had determined I could get full throughput of 2 disks simultaneously if I pinned fio to the proper sockets/cores. As soon as I would try to run a 3rd fio against a separate drive throughput of all 3 would decline, never getting overall throughput of more than 2 drives. The more I ran simultaneously the worse things got until the throughput of 8 disks combined was about 10% of a single drive. I had found a document from Dell with diagrams for various configurations. The one in these severs was their “storage optimized” not “performance optimized” which allowed for more NVME drives by using PCIe switches. I believe when those were pushed too far the entire system would start to crumble.

Thanks @Biky for letting me bounce my issues off you and for your suggestions.

2 Likes

If I can help with anything, then it’s always my pleasure.