Linux mdraid NVME issues

lspci -tv seems to indicate they are all on the same CPU

 +-[0000:20]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-01.1-[21-2e]----00.0-[22-2e]--+-00.0-[23]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-01.0-[24]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-02.0-[25]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-03.0-[26]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-04.0-[27]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-05.0-[28]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-06.0-[29]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-07.0-[2a]----00.0  KIOXIA Corporation NVMe SSD Controller Cx6
 |           |                               +-0c.0-[2b]--
 |           |                               +-0d.0-[2c]--
 |           |                               +-0e.0-[2d]--
 |           |                               \-0f.0-[2e]--
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           +-07.1-[2f]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
 |           \-08.1-[30]--+-00.0  Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
 |                        \-00.1  Advanced Micro Devices, Inc. [AMD] Zeppelin Cryptographic Coprocessor NTBCCP

sudo mdadm --create --verbose /dev/md50 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1

sudo mdadm --create --verbose /dev/md51 --level=4 --raid-devices=4 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1

2 disk RAID1


/tmp/raid1# xfs_info /tmp/raid1
meta-data=/dev/md50              isize=512    agcount=4, agsize=97667604 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=390670416, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=190757, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


/tmp/raid1# sync;fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq_mix_rw --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3785319: Wed Sep 22 14:41:35 2021
  read: IOPS=45.8k, BW=358MiB/s (375MB/s)(513MiB/1435msec)
    slat (nsec): min=3116, max=88257, avg=6067.01, stdev=3375.07
    clat (usec): min=8, max=2955, avg=310.79, stdev=429.85
     lat (usec): min=15, max=2985, avg=317.00, stdev=429.77
    clat percentiles (usec):
     |  1.00th=[   24],  5.00th=[   48], 10.00th=[   73], 20.00th=[  111],
     | 30.00th=[  143], 40.00th=[  174], 50.00th=[  202], 60.00th=[  231],
     | 70.00th=[  260], 80.00th=[  297], 90.00th=[  441], 95.00th=[ 1598],
     | 99.00th=[ 2147], 99.50th=[ 2212], 99.90th=[ 2343], 99.95th=[ 2409],
     | 99.99th=[ 2474]
   bw (  KiB/s): min=303760, max=381312, per=93.50%, avg=342536.00, stdev=54837.55, samples=2
   iops        : min=37970, max=47664, avg=42817.00, stdev=6854.69, samples=2
  write: IOPS=45.5k, BW=356MiB/s (373MB/s)(511MiB/1435msec); 0 zone resets
    slat (usec): min=4, max=227, avg= 9.37, stdev= 4.16
    clat (usec): min=15, max=1825, avg=372.61, stdev=130.20
     lat (usec): min=23, max=1843, avg=382.13, stdev=130.38
    clat percentiles (usec):
     |  1.00th=[   47],  5.00th=[  147], 10.00th=[  217], 20.00th=[  285],
     | 30.00th=[  314], 40.00th=[  343], 50.00th=[  371], 60.00th=[  400],
     | 70.00th=[  437], 80.00th=[  474], 90.00th=[  523], 95.00th=[  562],
     | 99.00th=[  652], 99.50th=[  717], 99.90th=[ 1336], 99.95th=[ 1614],
     | 99.99th=[ 1729]
   bw (  KiB/s): min=303888, max=375808, per=93.27%, avg=339848.00, stdev=50855.12, samples=2
   iops        : min=37986, max=46976, avg=42481.00, stdev=6356.89, samples=2
  lat (usec)   : 10=0.01%, 20=0.40%, 50=2.90%, 100=6.64%, 250=30.31%
  lat (usec)   : 500=48.30%, 750=7.61%, 1000=0.37%
  lat (msec)   : 2=2.20%, 4=1.26%
  cpu          : usr=15.83%, sys=69.11%, ctx=5659, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=358MiB/s (375MB/s), 358MiB/s-358MiB/s (375MB/s-375MB/s), io=513MiB (538MB), run=1435-1435msec
  WRITE: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=511MiB (535MB), run=1435-1435msec

Disk stats (read/write):
    md50: ios=63757/63484, merge=0/0, ticks=13372/4264, in_queue=17636, util=93.40%, aggrios=33580/66080, aggrmerge=1531/1533, aggrticks=6794/3355, aggrin_queue=10150, aggrutil=90.94%
  nvme0n1: ios=1465/65359, merge=3063/0, ticks=309/2463, in_queue=2773, util=90.94%
  nvme1n1: ios=65695/66802, merge=0/3067, ticks=13279/4247, in_queue=17527, util=90.94%



/tmp/raid1# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3785453: Wed Sep 22 14:41:43 2021
  read: IOPS=50.3k, BW=393MiB/s (412MB/s)(513MiB/1306msec)
    slat (nsec): min=3166, max=95210, avg=6234.30, stdev=3573.25
    clat (usec): min=29, max=953, avg=240.94, stdev=94.16
     lat (usec): min=33, max=957, avg=247.31, stdev=94.17
    clat percentiles (usec):
     |  1.00th=[   86],  5.00th=[  106], 10.00th=[  127], 20.00th=[  161],
     | 30.00th=[  188], 40.00th=[  212], 50.00th=[  235], 60.00th=[  255],
     | 70.00th=[  281], 80.00th=[  306], 90.00th=[  351], 95.00th=[  412],
     | 99.00th=[  545], 99.50th=[  594], 99.90th=[  685], 99.95th=[  717],
     | 99.99th=[  791]
   bw (  KiB/s): min=403376, max=407056, per=100.00%, avg=405216.00, stdev=2602.15, samples=2
   iops        : min=50422, max=50882, avg=50652.00, stdev=325.27, samples=2
  write: IOPS=50.0k, BW=391MiB/s (410MB/s)(511MiB/1306msec); 0 zone resets
    slat (usec): min=4, max=132, avg= 9.54, stdev= 4.23
    clat (usec): min=90, max=1107, avg=379.21, stdev=84.15
     lat (usec): min=117, max=1113, avg=388.91, stdev=84.30
    clat percentiles (usec):
     |  1.00th=[  225],  5.00th=[  262], 10.00th=[  281], 20.00th=[  306],
     | 30.00th=[  326], 40.00th=[  347], 50.00th=[  371], 60.00th=[  396],
     | 70.00th=[  420], 80.00th=[  453], 90.00th=[  490], 95.00th=[  529],
     | 99.00th=[  594], 99.50th=[  627], 99.90th=[  766], 99.95th=[  881],
     | 99.99th=[ 1037]
   bw (  KiB/s): min=400496, max=406144, per=100.00%, avg=403320.00, stdev=3993.74, samples=2
   iops        : min=50062, max=50768, avg=50415.00, stdev=499.22, samples=2
  lat (usec)   : 50=0.01%, 100=1.83%, 250=28.55%, 500=64.53%, 750=5.02%
  lat (usec)   : 1000=0.05%
  lat (msec)   : 2=0.01%
  cpu          : usr=18.93%, sys=77.39%, ctx=3644, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=393MiB/s (412MB/s), 393MiB/s-393MiB/s (412MB/s-412MB/s), io=513MiB (538MB), run=1306-1306msec
  WRITE: bw=391MiB/s (410MB/s), 391MiB/s-391MiB/s (410MB/s-410MB/s), io=511MiB (535MB), run=1306-1306msec

Disk stats (read/write):
    md50: ios=55899/55625, merge=0/0, ticks=7380/3472, in_queue=10852, util=92.21%, aggrios=33985/66489, aggrmerge=1237/1237, aggrticks=4464/3472, aggrin_queue=7936, aggrutil=88.65%
  nvme0n1: ios=39355/65361, merge=2475/0, ticks=5310/2705, in_queue=8015, util=88.27%
  nvme1n1: ios=28615/67618, merge=0/2475, ticks=3619/4239, in_queue=7858, util=88.65%


4 disk RAID10

/tmp/raid10-4disk# xfs_info /tmp/raid10-4disk/
meta-data=/dev/md51              isize=512    agcount=32, agsize=36625408 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=1172011008, imaxpct=5
         =                       sunit=128    swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


/tmp/raid10-4disk# sync;fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq_mix_rw --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=125MiB/s,w=124MiB/s][r=16.0k,w=15.8k IOPS][eta 00m:00s]
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3786308: Wed Sep 22 14:43:45 2021
  read: IOPS=18.2k, BW=142MiB/s (149MB/s)(513MiB/3619msec)
    slat (usec): min=3, max=11147, avg=11.68, stdev=43.63
    clat (usec): min=8, max=24293, avg=487.95, stdev=478.94
     lat (usec): min=17, max=35409, avg=499.73, stdev=489.90
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   74], 10.00th=[  121], 20.00th=[  184],
     | 30.00th=[  265], 40.00th=[  367], 50.00th=[  486], 60.00th=[  586],
     | 70.00th=[  685], 80.00th=[  766], 90.00th=[  840], 95.00th=[  898],
     | 99.00th=[ 1029], 99.50th=[ 1090], 99.90th=[ 1582], 99.95th=[ 3032],
     | 99.99th=[23987]
   bw (  KiB/s): min=125168, max=187488, per=100.00%, avg=145858.29, stdev=25504.48, samples=7
   iops        : min=15646, max=23436, avg=18232.29, stdev=3188.06, samples=7
  write: IOPS=18.1k, BW=141MiB/s (148MB/s)(511MiB/3619msec); 0 zone resets
    slat (usec): min=3, max=225, avg=12.53, stdev= 4.01
    clat (usec): min=75, max=43334, avg=1250.08, stdev=797.42
     lat (usec): min=89, max=43560, avg=1262.71, stdev=797.99
    clat percentiles (usec):
     |  1.00th=[  586],  5.00th=[  717], 10.00th=[  799], 20.00th=[  930],
     | 30.00th=[ 1057], 40.00th=[ 1139], 50.00th=[ 1205], 60.00th=[ 1270],
     | 70.00th=[ 1352], 80.00th=[ 1467], 90.00th=[ 1729], 95.00th=[ 1942],
     | 99.00th=[ 2343], 99.50th=[ 2474], 99.90th=[ 3064], 99.95th=[ 4424],
     | 99.99th=[43254]
   bw (  KiB/s): min=125472, max=188464, per=100.00%, avg=145181.71, stdev=25903.31, samples=7
   iops        : min=15684, max=23558, avg=18147.71, stdev=3237.91, samples=7
  lat (usec)   : 10=0.01%, 20=0.20%, 50=1.32%, 100=2.23%, 250=10.47%
  lat (usec)   : 500=11.72%, 750=16.55%, 1000=19.59%
  lat (msec)   : 2=35.79%, 4=2.07%, 10=0.01%, 20=0.01%, 50=0.02%
  cpu          : usr=6.38%, sys=45.11%, ctx=17914, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=513MiB (538MB), run=3619-3619msec
  WRITE: bw=141MiB/s (148MB/s), 141MiB/s-141MiB/s (148MB/s-148MB/s), io=511MiB (535MB), run=3619-3619msec

Disk stats (read/write):
    md51: ios=63312/63022, merge=0/0, ticks=28548/48076, in_queue=76624, util=97.17%, aggrios=4393/4847, aggrmerge=96294/84522, aggrticks=1959/1732, aggrin_queue=3690, aggrutil=97.39%
  nvme3n1: ios=5893/2881, merge=130157/40639, ticks=2727/1103, in_queue=3830, util=94.49%
  nvme2n1: ios=4775/3048, merge=118015/40502, ticks=2302/958, in_queue=3259, util=94.69%
  nvme5n1: ios=1021/10759, merge=7080/216003, ticks=261/3902, in_queue=4162, util=97.39%
  nvme4n1: ios=5884/2701, merge=129924/40947, ticks=2546/965, in_queue=3511, util=94.36%


/tmp/raid10-4disk# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_rand_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=88.4MiB/s,w=87.5MiB/s][r=11.3k,w=11.2k IOPS][eta 00m:00s]
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3786441: Wed Sep 22 14:43:58 2021
  read: IOPS=11.1k, BW=87.0MiB/s (91.2MB/s)(513MiB/5900msec)
    slat (usec): min=4, max=31026, avg=14.50, stdev=121.02
    clat (usec): min=16, max=22230, avg=675.78, stdev=364.85
     lat (usec): min=28, max=41709, avg=690.40, stdev=396.38
    clat percentiles (usec):
     |  1.00th=[  239],  5.00th=[  396], 10.00th=[  461], 20.00th=[  545],
     | 30.00th=[  594], 40.00th=[  635], 50.00th=[  676], 60.00th=[  709],
     | 70.00th=[  750], 80.00th=[  799], 90.00th=[  865], 95.00th=[  938],
     | 99.00th=[ 1090], 99.50th=[ 1139], 99.90th=[ 1270], 99.95th=[ 1369],
     | 99.99th=[22152]
   bw (  KiB/s): min=78192, max=92432, per=99.70%, avg=88836.36, stdev=3789.38, samples=11
   iops        : min= 9774, max=11554, avg=11104.55, stdev=473.67, samples=11
  write: IOPS=11.1k, BW=86.5MiB/s (90.7MB/s)(511MiB/5900msec); 0 zone resets
    slat (nsec): min=5580, max=39054, avg=15588.08, stdev=3355.33
    clat (usec): min=466, max=24020, avg=2162.81, stdev=647.04
     lat (usec): min=478, max=24033, avg=2178.51, stdev=647.05
    clat percentiles (usec):
     |  1.00th=[  988],  5.00th=[ 1336], 10.00th=[ 1532], 20.00th=[ 1745],
     | 30.00th=[ 1909], 40.00th=[ 2040], 50.00th=[ 2147], 60.00th=[ 2245],
     | 70.00th=[ 2376], 80.00th=[ 2573], 90.00th=[ 2769], 95.00th=[ 2999],
     | 99.00th=[ 3392], 99.50th=[ 3523], 99.90th=[ 3818], 99.95th=[13042],
     | 99.99th=[23987]
   bw (  KiB/s): min=78560, max=90608, per=99.92%, avg=88549.82, stdev=3422.63, samples=11
   iops        : min= 9820, max=11326, avg=11068.73, stdev=427.83, samples=11
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.04%, 250=0.54%, 500=6.41%
  lat (usec)   : 750=28.12%, 1000=14.22%
  lat (msec)   : 2=19.08%, 4=31.53%, 10=0.01%, 20=0.02%, 50=0.02%
  cpu          : usr=4.39%, sys=34.96%, ctx=34319, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s (91.2MB/s-91.2MB/s), io=513MiB (538MB), run=5900-5900msec
  WRITE: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s (90.7MB/s-90.7MB/s), io=511MiB (535MB), run=5900-5900msec

Disk stats (read/write):
    md51: ios=65580/65230, merge=0/0, ticks=40972/81160, in_queue=122132, util=97.68%, aggrios=54438/36047, aggrmerge=162601/69321, aggrticks=13177/5572, aggrin_queue=18749, aggrutil=98.59%
  nvme3n1: ios=72495/22061, merge=216794/21608, ticks=18560/3591, in_queue=22151, util=96.90%
  nvme2n1: ios=71186/21923, merge=218396/21500, ticks=17365/3220, in_queue=20585, util=96.93%
  nvme5n1: ios=0/77999, merge=0/212751, ticks=0/12375, in_queue=12375, util=98.59%
  nvme4n1: ios=74074/22207, merge=215214/21428, ticks=16783/3104, in_queue=19887, util=97.10%

The scheduler on all these disks is currently set to none

# cat /sys/block/nvme*n1/queue/scheduler
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline
[none] mq-deadline

including mdraid devices

# cat /sys/block/md*/queue/scheduler
none
none
none

Polling/Interrupt

Based on Wendell’s post (here)[Fixing Slow NVMe Raid Performance on Epyc] newer kernel (I am on 5.11) should have hybrid on by default. That being said…according to that link you sent

cat /sys/block/nvme0n1/queue/io_poll_delay should out put 0, but I’m getting back -1.

So I made the following changes::

GRUB_CMDLINE_LINUX="nvme_core.default_ps_max_latency_us=0 nvme.poll_queues=64"

update-initramfs -u -k all and update-grub

after fiddling with all that, minimal improvements:

/tmp/raid10# sync;fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=117MiB/s,w=118MiB/s][r=14.9k,w=15.1k IOPS][eta 00m:00s]
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=3963: Wed Sep 22 15:36:24 2021
  read: IOPS=15.4k, BW=120MiB/s (126MB/s)(513MiB/4265msec)
    slat (nsec): min=3787, max=79941, avg=12023.42, stdev=3055.78
    clat (usec): min=12, max=1307, avg=491.25, stdev=120.87
     lat (usec): min=23, max=1317, avg=503.37, stdev=120.95
    clat percentiles (usec):
     |  1.00th=[  188],  5.00th=[  289], 10.00th=[  338], 20.00th=[  396],
     | 30.00th=[  437], 40.00th=[  465], 50.00th=[  494], 60.00th=[  523],
     | 70.00th=[  553], 80.00th=[  586], 90.00th=[  635], 95.00th=[  685],
     | 99.00th=[  783], 99.50th=[  816], 99.90th=[  889], 99.95th=[  914],
     | 99.99th=[ 1004]
   bw (  KiB/s): min=119184, max=141488, per=100.00%, avg=123254.00, stdev=7483.98, samples=8
   iops        : min=14898, max=17686, avg=15406.75, stdev=935.50, samples=8
  write: IOPS=15.3k, BW=120MiB/s (126MB/s)(511MiB/4265msec); 0 zone resets
    slat (usec): min=4, max=295, avg=13.32, stdev= 3.40
    clat (usec): min=415, max=3192, avg=1567.22, stdev=361.43
     lat (usec): min=428, max=3203, avg=1580.64, stdev=361.60
    clat percentiles (usec):
     |  1.00th=[  766],  5.00th=[  979], 10.00th=[ 1090], 20.00th=[ 1254],
     | 30.00th=[ 1385], 40.00th=[ 1500], 50.00th=[ 1565], 60.00th=[ 1647],
     | 70.00th=[ 1729], 80.00th=[ 1860], 90.00th=[ 2024], 95.00th=[ 2180],
     | 99.00th=[ 2474], 99.50th=[ 2573], 99.90th=[ 2737], 99.95th=[ 2835],
     | 99.99th=[ 2999]
   bw (  KiB/s): min=118752, max=141152, per=100.00%, avg=122728.00, stdev=7497.03, samples=8
   iops        : min=14844, max=17644, avg=15341.00, stdev=937.13, samples=8
  lat (usec)   : 20=0.01%, 50=0.04%, 100=0.06%, 250=1.29%, 500=24.61%
  lat (usec)   : 750=23.63%, 1000=3.37%
  lat (msec)   : 2=41.37%, 4=5.62%
  cpu          : usr=5.25%, sys=40.27%, ctx=29656, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=513MiB (538MB), run=4265-4265msec
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=511MiB (535MB), run=4265-4265msec

Disk stats (read/write):
    md127: ios=63657/63395, merge=0/0, ticks=30336/58008, in_queue=88344, util=97.71%, aggrios=54186/35805, aggrmerge=153885/66183, aggrticks=10059/3974, aggrin_queue=14034, aggrutil=97.69%
  nvme3n1: ios=72170/22126, merge=205049/21549, ticks=13680/2380, in_queue=16060, util=96.47%
  nvme2n1: ios=70824/21973, merge=206803/21456, ticks=13358/2289, in_queue=15647, util=96.39%
  nvme5n1: ios=0/77010, merge=0/200201, ticks=0/9014, in_queue=9013, util=97.69%
  nvme4n1: ios=73751/22114, merge=203690/21527, ticks=13200/2216, in_queue=15416, util=96.47%