I have had a plethora of NVME related performance issues with RAID and Linux. I have been following various issues as far back as Wendell helping Linus with his server NVME/ZFS/Intel issues. At the time I had 8 of the same Intel model.
I am trying to tune this system for PostgreSQL. Since all my issues appear storage related I am starting from scratch, so nothing postgres specific in here. I am moving from a dual Xeon Gold 5120 system with 8 x 850 EVO. In things like pg_dump, replicaiton, etc. I am experiencing performance degratations. 2 hour on old system vs 8 hour on new system for a pg_dump of a specific database. At this point, I get better (fio) performance with a single drive than with 8 in RAID10 with mdadm.
specs
Ubuntu 20.04
2 x AMD EPYC 7601
384GB of DDR4 2666 MT/s , 16GB sticks.
8 x KCM6XVUL1T60 Kioxia NVME drives.
I have NVME formatted the disks
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 2 : Metadata Size: 0 bytes - Data Size: 1 bytes - Relative Performance: 0 Best
LBA Format 3 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
LBA Format 4 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 5 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
Single drive setup
mkfs.xfs /dev/nvme8n1
mount /dev/nvme8n1 /mnt/single
# xfs_info /dev/nvme8n1
meta-data=/dev/nvme8n1 isize=512 agcount=4, agsize=97675862 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=390703446, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=190773, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
8 in RAID10 setup
sudo mdadm --create --verbose /dev/md127 --level=10 --raid-devices=8 \
/dev/nvme0 \
/dev/nvme1 \
/dev/nvme2 \
/dev/nvme3 \
/dev/nvme4 \
/dev/nvme5 \
/dev/nvme6 \
/dev/nvme7
mkfs.xfs /dev/md127
mount /dev/md127 /mnt/raid10
# xfs_info /dev/md127
meta-data=/dev/md127 isize=512 agcount=32, agsize=48833792 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1
data = bsize=4096 blocks=1562681344, imaxpct=5
= sunit=128 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Using fio commands from a packt book on postgres tuning
single disk fio results
/mnt/single# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=10955: Mon Sep 20 19:19:18 2021
read: IOPS=69.2k, BW=540MiB/s (567MB/s)(513MiB/950msec)
slat (nsec): min=2826, max=58801, avg=4760.75, stdev=1984.89
clat (usec): min=5, max=2378, avg=205.56, stdev=370.88
lat (usec): min=16, max=2384, avg=210.46, stdev=370.75
clat percentiles (usec):
| 1.00th=[ 14], 5.00th=[ 28], 10.00th=[ 41], 20.00th=[ 65],
| 30.00th=[ 88], 40.00th=[ 109], 50.00th=[ 129], 60.00th=[ 151],
| 70.00th=[ 174], 80.00th=[ 198], 90.00th=[ 249], 95.00th=[ 371],
| 99.00th=[ 2089], 99.50th=[ 2147], 99.90th=[ 2212], 99.95th=[ 2245],
| 99.99th=[ 2278]
bw ( KiB/s): min=520622, max=520622, per=94.08%, avg=520622.00, stdev= 0.00, samples=1
iops : min=65077, max=65077, avg=65077.00, stdev= 0.00, samples=1
write: IOPS=68.8k, BW=537MiB/s (564MB/s)(511MiB/950msec); 0 zone resets
slat (nsec): min=2895, max=48391, avg=4970.04, stdev=2028.48
clat (usec): min=13, max=780, avg=246.87, stdev=87.14
lat (usec): min=18, max=783, avg=251.97, stdev=87.27
clat percentiles (usec):
| 1.00th=[ 32], 5.00th=[ 82], 10.00th=[ 133], 20.00th=[ 188],
| 30.00th=[ 208], 40.00th=[ 227], 50.00th=[ 247], 60.00th=[ 269],
| 70.00th=[ 293], 80.00th=[ 318], 90.00th=[ 355], 95.00th=[ 383],
| 99.00th=[ 445], 99.50th=[ 478], 99.90th=[ 562], 99.95th=[ 603],
| 99.99th=[ 725]
bw ( KiB/s): min=521309, max=521309, per=94.72%, avg=521309.00, stdev= 0.00, samples=1
iops : min=65163, max=65163, avg=65163.00, stdev= 0.00, samples=1
lat (usec) : 10=0.01%, 20=1.42%, 50=6.70%, 100=13.20%, 250=49.49%
lat (usec) : 500=27.01%, 750=0.18%, 1000=0.01%
lat (msec) : 2=0.97%, 4=1.01%
cpu : usr=17.18%, sys=69.65%, ctx=3799, majf=0, minf=14
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=540MiB/s (567MB/s), 540MiB/s-540MiB/s (567MB/s-567MB/s), io=513MiB (538MB), run=950-950msec
WRITE: bw=537MiB/s (564MB/s), 537MiB/s-537MiB/s (564MB/s-564MB/s), io=511MiB (535MB), run=950-950msec
Disk stats (read/write):
nvme11n1: ios=60446/60321, merge=0/0, ticks=7766/892, in_queue=8658, util=90.11%
/mnt/single# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=0)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=11288: Mon Sep 20 19:20:29 2021
read: IOPS=76.6k, BW=598MiB/s (627MB/s)(513MiB/858msec)
slat (usec): min=2, max=223, avg= 4.72, stdev= 2.32
clat (usec): min=30, max=1122, avg=193.68, stdev=90.37
lat (usec): min=37, max=1125, avg=198.54, stdev=90.36
clat percentiles (usec):
| 1.00th=[ 82], 5.00th=[ 93], 10.00th=[ 104], 20.00th=[ 123],
| 30.00th=[ 141], 40.00th=[ 157], 50.00th=[ 176], 60.00th=[ 192],
| 70.00th=[ 212], 80.00th=[ 243], 90.00th=[ 318], 95.00th=[ 383],
| 99.00th=[ 515], 99.50th=[ 545], 99.90th=[ 627], 99.95th=[ 652],
| 99.99th=[ 857]
bw ( KiB/s): min=610848, max=610848, per=99.70%, avg=610848.00, stdev= 0.00, samples=1
iops : min=76356, max=76356, avg=76356.00, stdev= 0.00, samples=1
write: IOPS=76.2k, BW=595MiB/s (624MB/s)(511MiB/858msec); 0 zone resets
slat (nsec): min=2916, max=72217, avg=4857.61, stdev=2154.02
clat (usec): min=14, max=699, avg=213.88, stdev=56.06
lat (usec): min=27, max=702, avg=218.88, stdev=56.22
clat percentiles (usec):
| 1.00th=[ 101], 5.00th=[ 135], 10.00th=[ 151], 20.00th=[ 169],
| 30.00th=[ 184], 40.00th=[ 196], 50.00th=[ 208], 60.00th=[ 223],
| 70.00th=[ 239], 80.00th=[ 258], 90.00th=[ 285], 95.00th=[ 306],
| 99.00th=[ 355], 99.50th=[ 408], 99.90th=[ 594], 99.95th=[ 635],
| 99.99th=[ 668]
bw ( KiB/s): min=610528, max=610528, per=100.00%, avg=610528.00, stdev= 0.00, samples=1
iops : min=76316, max=76316, avg=76316.00, stdev= 0.00, samples=1
lat (usec) : 20=0.01%, 50=0.02%, 100=4.60%, 250=74.33%, 500=20.32%
lat (usec) : 750=0.72%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=20.42%, sys=75.50%, ctx=3198, majf=0, minf=13
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=598MiB/s (627MB/s), 598MiB/s-598MiB/s (627MB/s-627MB/s), io=513MiB (538MB), run=858-858msec
WRITE: bw=595MiB/s (624MB/s), 595MiB/s-595MiB/s (624MB/s-624MB/s), io=511MiB (535MB), run=858-858msec
Disk stats (read/write):
nvme11n1: ios=48415/48094, merge=0/0, ticks=6179/678, in_queue=6856, util=86.46%
8 in RAID10 setup
/mnt/raid10# fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1)
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=11602: Mon Sep 20 19:21:42 2021
read: IOPS=50.2k, BW=392MiB/s (411MB/s)(513MiB/1309msec)
slat (usec): min=3, max=249, avg= 6.38, stdev= 3.42
clat (usec): min=8, max=2362, avg=198.42, stdev=189.93
lat (usec): min=16, max=2371, avg=204.96, stdev=190.05
clat percentiles (usec):
| 1.00th=[ 19], 5.00th=[ 40], 10.00th=[ 62], 20.00th=[ 98],
| 30.00th=[ 126], 40.00th=[ 153], 50.00th=[ 182], 60.00th=[ 210],
| 70.00th=[ 237], 80.00th=[ 265], 90.00th=[ 302], 95.00th=[ 334],
| 99.00th=[ 1037], 99.50th=[ 1909], 99.90th=[ 2180], 99.95th=[ 2212],
| 99.99th=[ 2311]
bw ( KiB/s): min=399424, max=402032, per=99.78%, avg=400728.00, stdev=1844.13, samples=2
iops : min=49928, max=50254, avg=50091.00, stdev=230.52, samples=2
write: IOPS=49.9k, BW=390MiB/s (409MB/s)(511MiB/1309msec); 0 zone resets
slat (usec): min=5, max=220, avg= 9.53, stdev= 3.73
clat (usec): min=16, max=1999, avg=423.26, stdev=101.90
lat (usec): min=28, max=2011, avg=432.97, stdev=102.21
clat percentiles (usec):
| 1.00th=[ 239], 5.00th=[ 293], 10.00th=[ 310], 20.00th=[ 334],
| 30.00th=[ 359], 40.00th=[ 388], 50.00th=[ 416], 60.00th=[ 445],
| 70.00th=[ 474], 80.00th=[ 506], 90.00th=[ 553], 95.00th=[ 586],
| 99.00th=[ 668], 99.50th=[ 717], 99.90th=[ 979], 99.95th=[ 1156],
| 99.99th=[ 1811]
bw ( KiB/s): min=395264, max=401728, per=99.76%, avg=398496.00, stdev=4570.74, samples=2
iops : min=49408, max=50216, avg=49812.00, stdev=571.34, samples=2
lat (usec) : 10=0.01%, 20=0.60%, 50=3.18%, 100=6.79%, 250=27.42%
lat (usec) : 500=50.47%, 750=10.83%, 1000=0.16%
lat (msec) : 2=0.35%, 4=0.21%
cpu : usr=14.91%, sys=82.26%, ctx=2428, majf=0, minf=14
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=392MiB/s (411MB/s), 392MiB/s-392MiB/s (411MB/s-411MB/s), io=513MiB (538MB), run=1309-1309msec
WRITE: bw=390MiB/s (409MB/s), 390MiB/s-390MiB/s (409MB/s-409MB/s), io=511MiB (535MB), run=1309-1309msec
Disk stats (read/write):
md127: ios=59512/59363, merge=0/0, ticks=3652/1052, in_queue=4704, util=92.40%, aggrios=8214/16339, aggrmerge=0/0, aggrticks=511/238, aggrin_queue=749, aggrutil=90.55%
nvme0n1: ios=9744/16320, merge=0/0, ticks=553/222, in_queue=775, util=90.21%
nvme3n1: ios=6722/16320, merge=0/0, ticks=495/255, in_queue=750, util=90.21%
nvme6n1: ios=9485/16335, merge=0/0, ticks=553/225, in_queue=777, util=90.21%
nvme2n1: ios=9662/16320, merge=0/0, ticks=600/224, in_queue=823, util=90.21%
nvme5n1: ios=6863/16384, merge=0/0, ticks=447/253, in_queue=700, util=90.48%
nvme1n1: ios=6689/16320, merge=0/0, ticks=453/254, in_queue=707, util=90.27%
nvme4n1: ios=9585/16384, merge=0/0, ticks=564/225, in_queue=789, util=90.55%
nvme7n1: ios=6963/16335, merge=0/0, ticks=423/253, in_queue=676, util=90.27%
/mnt/raid10# fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw --filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=12089: Mon Sep 20 19:22:25 2021
read: IOPS=49.3k, BW=385MiB/s (404MB/s)(513MiB/1333msec)
slat (usec): min=3, max=286, avg= 6.50, stdev= 3.66
clat (usec): min=32, max=1247, avg=220.02, stdev=85.51
lat (usec): min=36, max=1312, avg=226.67, stdev=85.60
clat percentiles (usec):
| 1.00th=[ 82], 5.00th=[ 96], 10.00th=[ 110], 20.00th=[ 137],
| 30.00th=[ 163], 40.00th=[ 190], 50.00th=[ 217], 60.00th=[ 245],
| 70.00th=[ 273], 80.00th=[ 297], 90.00th=[ 326], 95.00th=[ 347],
| 99.00th=[ 433], 99.50th=[ 490], 99.90th=[ 635], 99.95th=[ 701],
| 99.99th=[ 889]
bw ( KiB/s): min=394224, max=395840, per=100.00%, avg=395032.00, stdev=1142.68, samples=2
iops : min=49278, max=49480, avg=49379.00, stdev=142.84, samples=2
write: IOPS=49.0k, BW=383MiB/s (402MB/s)(511MiB/1333msec); 0 zone resets
slat (usec): min=5, max=357, avg= 9.42, stdev= 4.39
clat (usec): min=45, max=1018, avg=412.76, stdev=87.73
lat (usec): min=63, max=1246, avg=422.33, stdev=88.03
clat percentiles (usec):
| 1.00th=[ 273], 5.00th=[ 293], 10.00th=[ 310], 20.00th=[ 330],
| 30.00th=[ 351], 40.00th=[ 379], 50.00th=[ 404], 60.00th=[ 433],
| 70.00th=[ 457], 80.00th=[ 490], 90.00th=[ 529], 95.00th=[ 562],
| 99.00th=[ 635], 99.50th=[ 685], 99.90th=[ 807], 99.95th=[ 889],
| 99.99th=[ 971]
bw ( KiB/s): min=392128, max=392816, per=100.00%, avg=392472.00, stdev=486.49, samples=2
iops : min=49016, max=49102, avg=49059.00, stdev=60.81, samples=2
lat (usec) : 50=0.02%, 100=3.26%, 250=28.03%, 500=60.06%, 750=8.53%
lat (usec) : 1000=0.10%
lat (msec) : 2=0.01%
cpu : usr=19.74%, sys=77.48%, ctx=2485, majf=0, minf=12
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=65713,65359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s), io=513MiB (538MB), run=1333-1333msec
WRITE: bw=383MiB/s (402MB/s), 383MiB/s-383MiB/s (402MB/s-402MB/s), io=511MiB (535MB), run=1333-1333msec
Disk stats (read/write):
md127: ios=55970/55695, merge=0/0, ticks=4460/960, in_queue=5420, util=92.81%, aggrios=8214/16358, aggrmerge=0/0, aggrticks=686/234, aggrin_queue=921, aggrutil=89.78%
nvme0n1: ios=10445/16350, merge=0/0, ticks=876/219, in_queue=1096, util=89.72%
nvme3n1: ios=5903/16470, merge=0/0, ticks=492/253, in_queue=744, util=89.72%
nvme6n1: ios=10538/16278, merge=0/0, ticks=880/217, in_queue=1097, util=89.72%
nvme2n1: ios=10414/16470, merge=0/0, ticks=870/221, in_queue=1090, util=89.72%
nvme5n1: ios=5972/16337, merge=0/0, ticks=501/248, in_queue=749, util=89.72%
nvme1n1: ios=5992/16350, merge=0/0, ticks=503/252, in_queue=756, util=89.72%
nvme4n1: ios=10478/16337, merge=0/0, ticks=875/222, in_queue=1097, util=89.72%
nvme7n1: ios=5971/16278, merge=0/0, ticks=497/247, in_queue=745, util=89.78%
I’m kind of at a loss here. After being so far in the weeds for a while now I am hoping there is just something simple I overlooked. Any other info I can provide just let me know. I’d appreciate any advice.