Threadripper Pro + Optane (Asus WRX80E) -- not getting full performance?

I’ve got an Optane 5800x with a Threadripper Pro Asus WRX80E set-up. I saw the youtube video that noted that this Asus board will only run PCIe 3.0 speeds for the U.2 ports :stuck_out_tongue: So I took the recommendation to get a M.2 → U.2 adapter.

I think I got the one that yall even recommended: https://www.amazon.com/Ableconn-M2-U2-131-SFF-8643-Connector-PCIe-NVMe/dp/B01D2PXUQA/ref=sr_1_2_sspa?crid=17GED5DQF3FY1&keywords=ableconn+m.2+u.2&qid=1643272423&sprefix=ableconn+m.2+u.2%2Caps%2C95&sr=8-2-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyRUFWMjNQS1pKT0g4JmVuY3J5cHRlZElkPUEwNDE4MjE1MVBLUlFSUUQ0TDZXOCZlbmNyeXB0ZWRBZElkPUEwMjUzMTU4MkJMTjkwWkJLTTFFQiZ3aWRnZXROYW1lPXNwX2F0ZiZhY3Rpb249Y2xpY2tSZWRpcmVjdCZkb05vdExvZ0NsaWNrPXRydWU=

Well I did some basic tests with fio and I’m not seeing a ton of improvement:

Using the U.2 port on the mobo:

sudo tsc=$(journalctl -k | grep 'tsc: Refined TSC clocksource calibration:' | awk '{print $11}')  taskset -c 0 t/io_uring -r30 -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 -t1 -T${tsc}000 /dev/nvme8n1  >> ./optane_u2_direct_pcie3
Added file /dev/nvme8n1 (submitter 0)
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
submitter=0, tid=13134
IOPS=898K, BW=438MiB/s, IOS/call=32/31, inflight=(48)
IOPS=901K, BW=440MiB/s, IOS/call=32/32, inflight=(48)
IOPS=898K, BW=438MiB/s, IOS/call=32/32, inflight=(48)
IOPS=897K, BW=438MiB/s, IOS/call=32/31, inflight=(80)
IOPS=900K, BW=439MiB/s, IOS/call=32/32, inflight=(80)
IOPS=897K, BW=438MiB/s, IOS/call=32/32, inflight=(48)
IOPS=895K, BW=437MiB/s, IOS/call=32/31, inflight=(80)
IOPS=895K, BW=437MiB/s, IOS/call=32/31, inflight=(112)
IOPS=897K, BW=438MiB/s, IOS/call=32/32, inflight=(48)
IOPS=898K, BW=438MiB/s, IOS/call=32/31, inflight=(80)
IOPS=898K, BW=438MiB/s, IOS/call=32/32, inflight=(48)
IOPS=899K, BW=439MiB/s, IOS/call=32/31, inflight=(112)
IOPS=900K, BW=439MiB/s, IOS/call=32/32, inflight=(112)
IOPS=899K, BW=439MiB/s, IOS/call=32/32, inflight=(80)
IOPS=899K, BW=439MiB/s, IOS/call=32/31, inflight=(112)
IOPS=898K, BW=438MiB/s, IOS/call=32/32, inflight=(112)
IOPS=898K, BW=438MiB/s, IOS/call=32/32, inflight=(112)
IOPS=898K, BW=438MiB/s, IOS/call=32/32, inflight=(80)
IOPS=899K, BW=439MiB/s, IOS/call=32/32, inflight=(48)
IOPS=896K, BW=437MiB/s, IOS/call=32/31, inflight=(80)
IOPS=894K, BW=436MiB/s, IOS/call=32/31, inflight=(112)
IOPS=894K, BW=436MiB/s, IOS/call=32/32, inflight=(112)
IOPS=896K, BW=437MiB/s, IOS/call=32/32, inflight=(80)
IOPS=896K, BW=437MiB/s, IOS/call=32/32, inflight=(48)
IOPS=896K, BW=437MiB/s, IOS/call=32/31, inflight=(80)
IOPS=899K, BW=439MiB/s, IOS/call=31/32, inflight=(48)
IOPS=898K, BW=438MiB/s, IOS/call=32/31, inflight=(80)
IOPS=897K, BW=438MiB/s, IOS/call=32/32, inflight=(80)
IOPS=899K, BW=439MiB/s, IOS/call=32/32, inflight=(48)
Exiting on timeout
Maximum IOPS=901K
13134: Latency percentiles:
    percentiles (usec):
     |  1.0000th=[ 35159614],  5.0000th=[ 35539718], 10.0000th=[ 35919822],
     | 20.0000th=[ 68798813], 30.0000th=[ 68798813], 40.0000th=[ 70319228],
     | 50.0000th=[ 98066816], 60.0000th=[102628063], 70.0000th=[107189310],
     | 80.0000th=[113270973], 90.0000th=[139118041], 95.0000th=[148240535],
     | 99.0000th=[151281366], 99.5000th=[152801782], 99.9000th=[157363029],
     | 99.9500th=[160403861], 99.9900th=[169526355]

Using the M.2 adapter:

sudo tsc=$(journalctl -k | grep 'tsc: Refined TSC clocksource calibration:' | awk '{print $11}')  taskset -c 0 t/io_uring -r30 -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 -t1 -T${tsc}000 /dev/nvme9n1 >> ./optane_m2_adapter_pci4
Using TSC rate 2694Hz
Added file /dev/nvme9n1 (submitter 0)
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
submitter=0, tid=5335
IOPS=915K, BW=446MiB/s, IOS/call=32/31, inflight=(80)
IOPS=918K, BW=448MiB/s, IOS/call=32/31, inflight=(112)
IOPS=914K, BW=446MiB/s, IOS/call=32/32, inflight=(112)
IOPS=916K, BW=447MiB/s, IOS/call=32/32, inflight=(112)
IOPS=918K, BW=448MiB/s, IOS/call=32/32, inflight=(80)
IOPS=915K, BW=447MiB/s, IOS/call=32/31, inflight=(112)
IOPS=917K, BW=448MiB/s, IOS/call=32/32, inflight=(80)
IOPS=919K, BW=448MiB/s, IOS/call=32/31, inflight=(112)
IOPS=919K, BW=448MiB/s, IOS/call=32/32, inflight=(112)
IOPS=918K, BW=448MiB/s, IOS/call=32/32, inflight=(80)
IOPS=917K, BW=447MiB/s, IOS/call=32/31, inflight=(112)
IOPS=915K, BW=447MiB/s, IOS/call=31/32, inflight=(80)
IOPS=915K, BW=447MiB/s, IOS/call=32/31, inflight=(112)
IOPS=918K, BW=448MiB/s, IOS/call=32/32, inflight=(80)
IOPS=917K, BW=447MiB/s, IOS/call=32/32, inflight=(48)
IOPS=922K, BW=450MiB/s, IOS/call=31/31, inflight=(112)
IOPS=926K, BW=452MiB/s, IOS/call=32/32, inflight=(80)
IOPS=921K, BW=449MiB/s, IOS/call=32/32, inflight=(48)
IOPS=922K, BW=450MiB/s, IOS/call=32/31, inflight=(80)
IOPS=922K, BW=450MiB/s, IOS/call=32/31, inflight=(112)
IOPS=920K, BW=449MiB/s, IOS/call=32/32, inflight=(112)
IOPS=921K, BW=450MiB/s, IOS/call=32/32, inflight=(112)
IOPS=921K, BW=450MiB/s, IOS/call=32/32, inflight=(48)
IOPS=923K, BW=450MiB/s, IOS/call=31/31, inflight=(80)
IOPS=928K, BW=453MiB/s, IOS/call=32/32, inflight=(80)
IOPS=923K, BW=450MiB/s, IOS/call=31/31, inflight=(20)
IOPS=922K, BW=450MiB/s, IOS/call=32/32, inflight=(112)
IOPS=921K, BW=449MiB/s, IOS/call=32/32, inflight=(80)
IOPS=922K, BW=450MiB/s, IOS/call=31/31, inflight=(80)
Exiting on timeout
Maximum IOPS=928K
5335: Latency percentiles:
    percentiles (usec):
     |  1.0000th=[ 34399407],  5.0000th=[ 34779511], 10.0000th=[ 34779511],
     | 20.0000th=[ 66518189], 30.0000th=[ 67278397], 40.0000th=[ 68798813],
     | 50.0000th=[ 95406088], 60.0000th=[ 99587231], 70.0000th=[104148479],
     | 80.0000th=[110230142], 90.0000th=[136077209], 95.0000th=[145199704],
     | 99.0000th=[146720119], 99.5000th=[148240535], 99.9000th=[152801782],
     | 99.9500th=[155842614], 99.9900th=[175608018]

Has anybody else gotten over 1M IOPS on this mobo?

Ubuntu 20.04.1

For comparison, a single 980 pro in a Hyper M.2 pci-e card (no RAID yet)

sudo tsc=$(journalctl -k | grep 'tsc: Refined TSC clocksource calibration:' | awk '{print $11}')  taskset -c 0 t/io_uring -r30 -b512 -d128 -s32 -c32 -p1 -F1 -B1 -n1 -t1 -T${tsc}000 /dev/nvme3n1  >> ./980_pro_no_raid_yet 
Added file /dev/nvme3n1 (submitter 0)
polled=1, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
submitter=0, tid=13189
IOPS=656K, BW=320MiB/s, IOS/call=31/31, inflight=(112)
IOPS=660K, BW=322MiB/s, IOS/call=32/32, inflight=(48)
IOPS=661K, BW=323MiB/s, IOS/call=32/31, inflight=(80)
IOPS=662K, BW=323MiB/s, IOS/call=32/32, inflight=(48)
IOPS=663K, BW=323MiB/s, IOS/call=31/31, inflight=(112)
IOPS=663K, BW=324MiB/s, IOS/call=32/32, inflight=(48)
IOPS=662K, BW=323MiB/s, IOS/call=32/31, inflight=(116)
IOPS=663K, BW=323MiB/s, IOS/call=32/32, inflight=(80)
IOPS=663K, BW=323MiB/s, IOS/call=32/31, inflight=(112)
IOPS=663K, BW=324MiB/s, IOS/call=32/31, inflight=(116)
IOPS=661K, BW=322MiB/s, IOS/call=32/32, inflight=(80)
IOPS=659K, BW=322MiB/s, IOS/call=32/31, inflight=(116)
IOPS=659K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
IOPS=658K, BW=321MiB/s, IOS/call=32/31, inflight=(80)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
IOPS=657K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
IOPS=657K, BW=321MiB/s, IOS/call=31/31, inflight=(112)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
IOPS=657K, BW=321MiB/s, IOS/call=32/31, inflight=(116)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(112)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(80)
IOPS=658K, BW=321MiB/s, IOS/call=32/31, inflight=(112)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
IOPS=658K, BW=321MiB/s, IOS/call=32/31, inflight=(52)
IOPS=659K, BW=321MiB/s, IOS/call=32/31, inflight=(80)
IOPS=658K, BW=321MiB/s, IOS/call=32/31, inflight=(112)
IOPS=658K, BW=321MiB/s, IOS/call=32/31, inflight=(116)
IOPS=658K, BW=321MiB/s, IOS/call=32/32, inflight=(80)
IOPS=657K, BW=321MiB/s, IOS/call=32/32, inflight=(48)
Exiting on timeout
Maximum IOPS=663K
13189: Latency percentiles:
    percentiles (usec):
     |  1.0000th=[ 47322940],  5.0000th=[ 49033408], 10.0000th=[ 50553824],
     | 20.0000th=[ 93885672], 30.0000th=[ 96166296], 40.0000th=[ 99587231],
     | 50.0000th=[137597625], 60.0000th=[142158872], 70.0000th=[146720119],
     | 80.0000th=[152801782], 90.0000th=[193853007], 95.0000th=[199174462],
     | 99.0000th=[205256125], 99.5000th=[205256125], 99.9000th=[211337788],
     | 99.9500th=[214378620], 99.9900th=[226541946]

don’t have much to say but lurking the thread as ive ordered a P5800x for my TR 3945 but i hope u can find a fix!

I may be speaking out of ignorance, but I thought that there was some Intel secret sauce that allowed this to work well, and in the non-windows world, it is just seen as a normal disk and not much of anything special.

ive heard that too, i also have my xeon icelake parts here within a few weeks and i want to test optane on both systems

If you want to make sure the pcie is error free dona sequential test. If it’s pcie3 it’ll cap out at about 4gb/sec.

The thread here has the command I ran on the p5800x. You may need to add more threads or something.

Thanks @wendell ! but …

The thread here has the command I ran on the p5800x. You may need to add more threads or something.

doh which thread / command? am interested to try your benchmark. i also stood up a RAID of M.2s to try to compare. inspired by that Optane guy on the youtubes

ok i found a rando benchmark online and it looks like perhaps the io_uring engine is just not working right on my ubuntu? one of the fio maintainers suggested using that specific benchmark, but when I run the one below with libaio I’m getting about 5.6 GBytes/sec sequential and at least many of the reads are under 100 micros, but I thought it was supposed to be closer to 10-15 micros for some of these.

sudo fio --loops=5 --size=1000m --filename=/media/optane/tasty.tmp --stonewall --ioengine=libaio --direct=1 --name=Seqread --bs=1m --rw=read   --name=Seqwrite --bs=1m --rw=write   --name=512Kread --bs=512k --rw=randread   --name=512Kwrite --bs=512k --rw=randwrite   --name=4kQD32read --bs=4k --iodepth=32 --rw=randread   --name=4kQD32write --bs=4k --iodepth=32 --rw=randwrite
Seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
Seqwrite: (g=1): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
512Kread: (g=2): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
512Kwrite: (g=3): rw=randwrite, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
4kQD32read: (g=4): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
4kQD32write: (g=5): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.29-31-g2b3d
Starting 6 processes
Seqread: Laying out IO file (1 file / 1000MiB)
Jobs: 1 (f=1): [_(5),w(1)][100.0%][w=870MiB/s][w=223k IOPS][eta 00m:00s]                        
Seqread: (groupid=0, jobs=1): err= 0: pid=1108700: Thu Jan 27 18:19:23 2022
  read: IOPS=5376, BW=5376MiB/s (5638MB/s)(5000MiB/930msec)
    slat (usec): min=24, max=492, avg=25.54, stdev= 6.89
    clat (usec): min=123, max=187, avg=159.98, stdev= 1.82
     lat (usec): min=182, max=632, avg=185.59, stdev= 6.56
    clat percentiles (usec):
     |  1.00th=[  157],  5.00th=[  159], 10.00th=[  159], 20.00th=[  159],
     | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  161],
     | 70.00th=[  161], 80.00th=[  161], 90.00th=[  161], 95.00th=[  163],
     | 99.00th=[  165], 99.50th=[  169], 99.90th=[  176], 99.95th=[  182],
     | 99.99th=[  188]
   bw (  MiB/s): min= 5381, max= 5381, per=100.00%, avg=5381.90, stdev= 0.00, samples=1
   iops        : min= 5381, max= 5381, avg=5381.00, stdev= 0.00, samples=1
  lat (usec)   : 250=100.00%
  cpu          : usr=0.00%, sys=15.72%, ctx=5002, majf=0, minf=268
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=5000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Seqwrite: (groupid=1, jobs=1): err= 0: pid=1108707: Thu Jan 27 18:19:23 2022
  write: IOPS=4690, BW=4690MiB/s (4918MB/s)(5000MiB/1066msec); 0 zone resets
    slat (usec): min=25, max=235, avg=28.54, stdev= 4.03
    clat (usec): min=126, max=236, avg=183.95, stdev=19.11
     lat (usec): min=174, max=427, avg=212.57, stdev=19.36
    clat percentiles (usec):
     |  1.00th=[  149],  5.00th=[  159], 10.00th=[  163], 20.00th=[  167],
     | 30.00th=[  169], 40.00th=[  174], 50.00th=[  180], 60.00th=[  192],
     | 70.00th=[  198], 80.00th=[  202], 90.00th=[  208], 95.00th=[  219],
     | 99.00th=[  229], 99.50th=[  231], 99.90th=[  233], 99.95th=[  235],
     | 99.99th=[  237]
   bw (  MiB/s): min= 4701, max= 4701, per=100.00%, avg=4701.43, stdev= 0.00, samples=1
   iops        : min= 4701, max= 4701, avg=4701.00, stdev= 0.00, samples=1
  lat (usec)   : 250=100.00%
  cpu          : usr=2.16%, sys=12.02%, ctx=5004, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
512Kread: (groupid=2, jobs=1): err= 0: pid=1108712: Thu Jan 27 18:19:23 2022
  read: IOPS=9049, BW=4525MiB/s (4745MB/s)(5000MiB/1105msec)
    slat (usec): min=13, max=272, avg=13.59, stdev= 2.97
    clat (usec): min=71, max=131, avg=96.43, stdev= 1.47
     lat (usec): min=105, max=366, avg=110.09, stdev= 3.10
    clat percentiles (usec):
     |  1.00th=[   94],  5.00th=[   96], 10.00th=[   96], 20.00th=[   96],
     | 30.00th=[   96], 40.00th=[   96], 50.00th=[   96], 60.00th=[   96],
     | 70.00th=[   97], 80.00th=[   97], 90.00th=[   97], 95.00th=[   99],
     | 99.00th=[  102], 99.50th=[  105], 99.90th=[  111], 99.95th=[  115],
     | 99.99th=[  130]
   bw (  MiB/s): min= 4528, max= 4531, per=100.00%, avg=4529.62, stdev= 1.96, samples=2
   iops        : min= 9056, max= 9062, avg=9059.00, stdev= 4.24, samples=2
  lat (usec)   : 100=97.83%, 250=2.17%
  cpu          : usr=0.63%, sys=14.22%, ctx=10002, majf=0, minf=139
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=10000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
512Kwrite: (groupid=3, jobs=1): err= 0: pid=1108725: Thu Jan 27 18:19:23 2022
  write: IOPS=8680, BW=4340MiB/s (4551MB/s)(5000MiB/1152msec); 0 zone resets
    slat (nsec): min=12514, max=69411, avg=14155.44, stdev=1903.93
    clat (usec): min=68, max=150, avg=100.51, stdev=11.86
     lat (usec): min=95, max=179, avg=114.74, stdev=12.00
    clat percentiles (usec):
     |  1.00th=[   83],  5.00th=[   89], 10.00th=[   91], 20.00th=[   92],
     | 30.00th=[   93], 40.00th=[   94], 50.00th=[   95], 60.00th=[   97],
     | 70.00th=[  103], 80.00th=[  114], 90.00th=[  122], 95.00th=[  125],
     | 99.00th=[  128], 99.50th=[  129], 99.90th=[  133], 99.95th=[  135],
     | 99.99th=[  143]
   bw (  MiB/s): min= 4329, max= 4356, per=100.00%, avg=4342.91, stdev=19.67, samples=2
   iops        : min= 8658, max= 8713, avg=8685.50, stdev=38.89, samples=2
  lat (usec)   : 100=65.92%, 250=34.08%
  cpu          : usr=2.00%, sys=11.56%, ctx=10001, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
4kQD32read: (groupid=4, jobs=1): err= 0: pid=1108729: Thu Jan 27 18:19:23 2022
  read: IOPS=322k, BW=1258MiB/s (1320MB/s)(5000MiB/3973msec)
    slat (nsec): min=1322, max=163529, avg=2530.62, stdev=954.33
    clat (usec): min=6, max=262, avg=96.44, stdev= 6.85
     lat (usec): min=9, max=267, avg=99.04, stdev= 6.97
    clat percentiles (usec):
     |  1.00th=[   90],  5.00th=[   90], 10.00th=[   91], 20.00th=[   91],
     | 30.00th=[   92], 40.00th=[   95], 50.00th=[   97], 60.00th=[   98],
     | 70.00th=[   98], 80.00th=[   99], 90.00th=[  101], 95.00th=[  109],
     | 99.00th=[  117], 99.50th=[  133], 99.90th=[  167], 99.95th=[  174],
     | 99.99th=[  186]
   bw (  MiB/s): min= 1244, max= 1264, per=100.00%, avg=1259.66, stdev= 6.81, samples=7
   iops        : min=318615, max=323708, avg=322471.86, stdev=1743.53, samples=7
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=88.12%, 250=11.87%
  lat (usec)   : 500=0.01%
  cpu          : usr=17.80%, sys=82.10%, ctx=382, majf=0, minf=44
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1280000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
4kQD32write: (groupid=5, jobs=1): err= 0: pid=1108751: Thu Jan 27 18:19:23 2022
  write: IOPS=222k, BW=869MiB/s (911MB/s)(5000MiB/5757msec); 0 zone resets
    slat (nsec): min=1392, max=218863, avg=1979.21, stdev=1364.32
    clat (nsec): min=632, max=386831, avg=140038.81, stdev=18260.31
     lat (usec): min=7, max=388, avg=142.11, stdev=18.49
    clat percentiles (usec):
     |  1.00th=[  119],  5.00th=[  128], 10.00th=[  133], 20.00th=[  135],
     | 30.00th=[  135], 40.00th=[  137], 50.00th=[  137], 60.00th=[  141],
     | 70.00th=[  141], 80.00th=[  143], 90.00th=[  145], 95.00th=[  151],
     | 99.00th=[  255], 99.50th=[  260], 99.90th=[  273], 99.95th=[  277],
     | 99.99th=[  314]
   bw (  KiB/s): min=876816, max=904656, per=100.00%, avg=890629.09, stdev=11169.13, samples=11
   iops        : min=219206, max=226164, avg=222657.27, stdev=2791.80, samples=11
  lat (nsec)   : 750=0.01%
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=98.71%
  lat (usec)   : 500=1.28%
  cpu          : usr=13.20%, sys=58.44%, ctx=1140823, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1280000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=5376MiB/s (5638MB/s), 5376MiB/s-5376MiB/s (5638MB/s-5638MB/s), io=5000MiB (5243MB), run=930-930msec

Run status group 1 (all jobs):
  WRITE: bw=4690MiB/s (4918MB/s), 4690MiB/s-4690MiB/s (4918MB/s-4918MB/s), io=5000MiB (5243MB), run=1066-1066msec

Run status group 2 (all jobs):
   READ: bw=4525MiB/s (4745MB/s), 4525MiB/s-4525MiB/s (4745MB/s-4745MB/s), io=5000MiB (5243MB), run=1105-1105msec

Run status group 3 (all jobs):
  WRITE: bw=4340MiB/s (4551MB/s), 4340MiB/s-4340MiB/s (4551MB/s-4551MB/s), io=5000MiB (5243MB), run=1152-1152msec

Run status group 4 (all jobs):
   READ: bw=1258MiB/s (1320MB/s), 1258MiB/s-1258MiB/s (1320MB/s-1320MB/s), io=5000MiB (5243MB), run=3973-3973msec

Run status group 5 (all jobs):
  WRITE: bw=869MiB/s (911MB/s), 869MiB/s-869MiB/s (911MB/s-911MB/s), io=5000MiB (5243MB), run=5757-5757msec

Disk stats (read/write):
  nvme9n1: ios=1360000/1324507, merge=0/3, ticks=14360/16044, in_queue=30403, util=89.31%

I tried the same benchmark with a couple quad NVME RAID0s but I got some suspect results gonna check BIOS and run again

t/io_uring

I’m not quite sure what is going on here. This looks like a program that taskset is setting, but where are you getting it from? I don’t really understand the t/ with my limited linux knowledge, that’s doesn’t look like any location I’ve seen.

Ah, figured it out

git clone https://github.com/axboe/fio.git
cd fio
./configure
make
make install

Then ~/fio/t/io_uring works

To put to bed on “not getting full Optane performance”, I was able to reproduce the 1.5M IOPS 4KB Q8T8 test that @wendell posted here: How fast is a P5800X Optane?

I used this script: https://gist.github.com/snizovtsev/1e57da4bf1cdae16599d59619600ee07 with SIZE restored to 1024. fio --version fio-3.29-31-g2b3d . I think Wendell might have a different version of fio, but I got comparable results.

It looks like the issue was mostly just reproducing the right test for apples to apples numbers. Though the correct cable did help get full PCIE 4.0 performance from the P5800X.

I’m not sure if Allyn from Intel would like this ( Intel Optane and a Whole lotta IOPS: A Chat with Allyn Malventano - YouTube ) but I’m really seeing the value of NVME RAID w/ consumer stuff over Optane. 4KB Q8T8 for NVME RAID is at least same order of magnitude, and it’s hard for me to think of even a modern database workload where you wouldn’t be able to benefit from parallelism. Latency can definitely be a big deal for niche stuff (e.g. financials), also the durability is unrivaled.

I think for Optane to be more competitive, they need to release the drives in the M.2 form factor and in smaller sizes. You need to be able to have a higher ratio of Optane per PCIE lane. Maybe this is in the works and Intel was waiting to release CPUs / chipsets for PCIE 5.0? Maybe VROC team is behind a little?

The Samsung consumer 980 pro / 970 Plus have also fallen quite a bit in price lately despite the NAND shortage. For the retail price of a 1.6TB P5800X, or about 4x 400GB P5800X, today you can get about (retail) 14TB of Samsung 980 PRO. It’s hard to argue with 10x space when the perf (in RAID0) would be pretty close for most software.

I have a few other machine learning workloads to try for the P5800X, but for a lot of use cases (most deep learning, NAS, database, even large-scale virtualization?) it seems like NVME is the sweet spot. I’m hoping to see opportunities to put more PCIE lanes behind Optane-- it would become a much more appealing expansion versus memory.

Interesting Optane vs DDR4 RAMDISK:

  • 4KB IOPS is roughly same order of magnitude, unlike NVME.
  • Sequential read is higher than RAM, but maybe that’s just fio with tmpfs weirdness.

Optane 800GB P5800X

Sequential Read: 6368MB/s IOPS=6
Sequential Write: 4834MB/s IOPS=4

512KB Read: 4571MB/s IOPS=9142
512KB Write: 4346MB/s IOPS=8692

Sequential Q32T1 Read: 6320MB/s IOPS=197
Sequential Q32T1 Write: 5300MB/s IOPS=165

4KB Read: 463MB/s IOPS=118746
4KB Write: 388MB/s IOPS=99538

4KB Q32T1 Read: 1257MB/s IOPS=321965
4KB Q32T1 Write: 874MB/s IOPS=223939

4KB Q8T8 Read: 6097MB/s IOPS=1560904
4KB Q8T8 Write: 4954MB/s IOPS=1268395

For comparison, 4x Samsung 970 Evo Plus in mdadm RAID0:

Sequential Read: 9481MB/s IOPS=9
Sequential Write: 9922MB/s IOPS=9

512KB Read: 2600MB/s IOPS=5200
512KB Write: 2587MB/s IOPS=5174

Sequential Q32T1 Read: 9715MB/s IOPS=303
Sequential Q32T1 Write: 12337MB/s IOPS=385

4KB Read: 65MB/s IOPS=16734
4KB Write: 250MB/s IOPS=64068

4KB Q32T1 Read: 1328MB/s IOPS=340181
4KB Q32T1 Write: 1116MB/s IOPS=285933

4KB Q8T8 Read: 3665MB/s IOPS=938364
4KB Q8T8 Write: 5595MB/s IOPS=1432506

4x 980 Pro RAID 0:

Sequential Read: 11962MB/s IOPS=11
Sequential Write: 13726MB/s IOPS=13

512KB Read: 2831MB/s IOPS=5663
512KB Write: 3610MB/s IOPS=7221

Sequential Q32T1 Read: 11130MB/s IOPS=347
Sequential Q32T1 Write: 12736MB/s IOPS=398

4KB Read: 90MB/s IOPS=23189
4KB Write: 291MB/s IOPS=74650

4KB Q32T1 Read: 1153MB/s IOPS=295340
4KB Q32T1 Write: 790MB/s IOPS=202365

4KB Q8T8 Read: 4134MB/s IOPS=1058490
4KB Q8T8 Write: 4693MB/s IOPS=1201456

Ramdisk (8x DDR4):

Sequential Read: 4413MB/s IOPS=4
Sequential Write: 4684MB/s IOPS=4

512KB Read: 8722MB/s IOPS=17444
512KB Write: 5412MB/s IOPS=10824

Sequential Q32T1 Read: 4417MB/s IOPS=138
Sequential Q32T1 Write: 4688MB/s IOPS=146

4KB Read: 3080MB/s IOPS=788640
4KB Write: 2583MB/s IOPS=661311

4KB Q32T1 Read: 3108MB/s IOPS=795822
4KB Q32T1 Write: 2442MB/s IOPS=625343

4KB Q8T8 Read: 23644MB/s IOPS=6052940
4KB Q8T8 Write: 3015MB/s IOPS=771917

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.