Toward 20 Million I/Ops - Part 1: Threadripper Pro - Work in Progress DRAFT

wendell · April 30, 2021, 3:11am

Is 20 million I/Ops possible on a desktop?

This is a write up for a future video from Level1techs and currently a place I’m gathering my numbers and tests. Work in process. Video soon! Probably around the first week of May 2021

Threadripper Pro to the rescue! Maybe!

We’re talking about random reads. Just a few years ago, regular NAND flash would top out at about 400,000 I/Ops per device, and most devices were more in the 250-350k range.

These days with top-end drives we’re seeing more. Much More.

Of course Intel is at the top end with it’s just recently released P5800x. However the cost per unit storage is astronomical as compared with NAND (and they do not use NAND storage under the hood).

# Intel Optane P5800X, 6 Threads, I/O Monster
...
fio-3.21
Starting 6 processes
Jobs: 6 (f=6): [r(6)][100.0%][r=1264MiB/s][r=2588k IOPS][eta 00m:00s]
onessd: (groupid=0, jobs=6): err= 0: pid=210497: Thu Apr 29 22:30:56 2021
  read: IOPS=2589k, BW=1264MiB/s (1326MB/s)(37.0GiB/30001msec)
   bw (  MiB/s): min= 1244, max= 1282, per=100.00%, avg=1266.10, stdev= 1.47, samples=354
   iops        : min=2547804, max=2626604, avg=2592965.76, stdev=3016.38, samples=354
  cpu          : usr=18.51%, sys=60.26%, ctx=504, majf=0, minf=36
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=77674292,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=1264MiB/s (1326MB/s), 1264MiB/s-1264MiB/s (1326MB/s-1326MB/s), io=37.0GiB (39.8GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=77343935/0, merge=0/0, ticks=375255/0, in_queue=375255, util=99.98%

It has sequential read speeds of 7.8 gigabytes per second and random queue depth 1 reads are about an order of magnitude faster than the fastest nand-flash based device (under heavy loads). That translates to about 2.5 million IOPs in the best case scenario (!!!) on the 800gb model that I ordered.

Of course, it was also pricey, at around $1.50 per gigabyte.

If just one drive gets us 10% of the way there – imagine what could be done with ten of these drives! But (nervous chuckles) Level1 isn’t doing that well! This is CrazyTown Expensive. Use Sparingly!

Great for SLOGS, great for write cache, great for SQL transaction logs. But, like salt, don’t over do it or you’ll have a heart attack.

Enter Kioxia.

# Kioxia CM6-v

onessd: (g=0): rw=randread, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=io_uring, iodepth=32
...
fio-3.21
Starting 4 processes
Jobs: 4 (f=4): [r(4)][100.0%][r=695MiB/s][r=1424k IOPS][eta 00m:00s]
onessd: (groupid=0, jobs=4): err= 0: pid=230113: Thu Apr 29 23:00:46 2021
  read: IOPS=1423k, BW=695MiB/s (729MB/s)(20.4GiB/30001msec)
   bw (  KiB/s): min=707939, max=715554, per=100.00%, avg=712508.44, stdev=488.95, samples=236
   iops        : min=1415878, max=1431108, avg=1425016.85, stdev=977.89, samples=236
  cpu          : usr=14.22%, sys=65.89%, ctx=375, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=42693560,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=695MiB/s (729MB/s), 695MiB/s-695MiB/s (729MB/s-729MB/s), io=20.4GiB (21.9GB), run=30001-30001msec

Disk stats (read/write):
  nvme11n1: ios=42501644/0, merge=0/0, ticks=3614651/0, in_queue=3614651, util=99.87%

I only have exactly one CM6 drive from Kioxia. It’s fast.

They can manage about 1.3-1.4 million random read IOPs, per drive (depending on capacity), under the most ideal conditions. This is class-leading speed for NAND. And they’re around 40-50 cents per gigabyte (in large capacities), which makes them ideal for bulk storage.

I also have several CD6 and CM5 from Kioxa – these are more bulk-storage oriented parts. I borrowed a few of these drives from Kioxia. We can expect them to operate around 770k iops.

We might not be able to hit 2 million i/ops per device, but we can make up for that with the sheer number of devices.

And, just for Giggles, I’m also testing the Samsung 980 Pro. These NAND-based devices can manage sequential I/O parity with the Intel P5800x, and their queue depth 1 performance is among the best of nand-based devices. (But even that still can’t touch the Optane P5800x. )

I have 8 of these to throw into the mix as needed.

# Samsung 980 Pro, following "warmup" prior to initial drive use. 
...
fio-3.21
Starting 6 processes
Jobs: 6 (f=6): [r(6)][100.0%][r=538MiB/s][r=1103k IOPS][eta 00m:00s]
onessd: (groupid=0, jobs=6): err= 0: pid=233494: Thu Apr 29 23:06:54 2021
  read: IOPS=1095k, BW=535MiB/s (561MB/s)(15.7GiB/30001msec)
   bw (  KiB/s): min=538404, max=552030, per=100.00%, avg=548130.22, stdev=366.06, samples=354
   iops        : min=1076810, max=1104060, avg=1096260.68, stdev=732.08, samples=354
  cpu          : usr=6.65%, sys=83.98%, ctx=270, majf=0, minf=36
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=32843188,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.7GiB (16.8GB), run=30001-30001msec

Disk stats (read/write):
  nvme16n1: ios=32678677/0, merge=0/0, ticks=5566667/0, in_queue=5566667, util=99.77%

Still, at 1.1 million iops, these Samsung drives still offer class-leading nand performance. They’ll hurt your pocket book a bit more at north of 45 cents/gigabyte, though. For the system configuration, I’ve gone as high as 29 NVMe – 116 PCIe lanes dedicated to storage. This system is DENSE. In reality, no one would ever do this, and it is stretching the limits of what Infinity Fabric is capable of.

There is specialty equipment out there like the Liqid Honeybadger which makes more sense – 8 m.2 NVMe slots behind a high-speed PLX bridge. One or Two of these x16 monsters is all you’d ever really want to put in a system.
For me? I’m just pushing the envelope to see what happens.

Process - Working out the Initial Config

With the system, one of the initial configurations we had was:

1x Intel P5800x
3x Intel P4801

1x Kioxia CM6-v
7x Kioxia CD6

In reality, I went through many many permutations of this hardware. Initially, I was having trouble Booting TR Pro with more than 10 NVMe installed. However, after an update from Asus on the SAGE E WX80, I was was able to boot and troubleshoot further. All of my Aorus-branded Add-in Cards that split one x16 PCIe slot to 4 * x4 PCIe slots were causing tons of PCIe errors. This was somehow related to the hard system locks.

In and of itself this is very curious – The Aorus-branded cards feature PCIe4 redrivers. The Asus AIC (which worked fine) does not have redrivers and works fine. It figures Asus would have tested their motherboard with their own add-in card (and this motherboard comes with one). However this took some time to troubleshoot.

I also have a limited number of PCIe4 to U.2 adapters (that work at PCIE4 speeds). PCIE4 cables, risers and splitters are still uncommon and it can lead to headache.

In testing the Kioxia CM6 SSD, PCIE4 vs PCI3 made the difference betwen ~800k iops and ~1.35 million iops, which is quite a large difference. Of course maximum theoretical sequential transfers are also cut from 8 gigabytes/sec on PCIE4 to 4 gigabytes/sec on PCIE3.

As such, I went through several iterations and configs. I shared some of my progress with our subscribers on on Patreon and Floatplane.

Process – Working, but slower than expected

In setting this up, I settled on the fio utility. I have some scripts for it that I have posted here which exactly re-create the numbers one can expect from common utlities like CrytalDiskMark and I find it useful for testing these kinds of storage systems at scale. Building Actual Real Things one should always test to be sure exactly what reality is. Assume nothing.

The tests are posted above – but restarted

1x Intel P5800x – 2.5 million iops (wow!)
3x Intel P4801 – ~800k iops

1x Kioxia CM6-v – ~1.4 million iops
7x Kioxia CD6 – ~770k iops, each

Testing each drive individually in the working setup (some PCIE4 errors detected, but not a huge amount, and they were all corrected. PCIE4 is fast, ok?)

As I said in the video, if we do the math, that’s 11.6 million iops. But with the fio test? The output was 6.8 million.

What’s going on? This has already been a black hole of time. Ulgh.

Process – Troubleshooting

The first clue was in the output of lstopo. This is a handy command that will graphically show you how stuff in your system is connected. It’s a logical diagram (not a physical diagram) so you can understand which devices are tied to what pcie root complexes and this will help visualize the arrangement of things in the system.

Here we can see that all my NVMe is hanging off one PCIe root complex. This isn’t good because it can be a bit of a bottleneck. It takes so many fio threads to fully saturate each device that many i/o requests end up having to traverse the infinity fabric and everything gets all jammed up on this one node. Think hallway analogy from the video.

So, just physically re-arranging the cards gives me a little more headroom by redistributing the traffic across multiple root complexes. (Instead of everyone going into and out of the hallway at one end, we can now go in and out of the hallway at both ends and in the middle.)

AMD/Asus also provide two important tunables in the system bios:

Tunable - Preferred I/O

This one I don’t see documented very much but these root complexes have numbers. That’s the I/O bus number. You could actually pick one to priortize the traffic from this root complex will be prioritized across infinity fabric. In the liqid honey badger use case, this would be the fix for you. The reason is because you have this one single device that can deliver the full PCIE4 bandwidth and shuffling the card around the system isn’t going to help you (unless you have other devices also generating traffic on that node). In our case I will leave this alone for now (though one could argue I should prioritize the node that has all the Optane on it just because Optane is So Darn Fast).

Tunable – NPS1, NPS2, NPS3, NPS4.

Buried in the memory controller options is a very important tunable. By default TR Pro presents to the system as a single unified NUMA node. This is the best way. Astute thinkers may realize that the chiplet architecture of Threadripper (Pro, Epyc and even most Ryzen desktop CPUs) that this isn’t technically true. Chiplets that are farther from the I/O die are going to have slightly different performance. The time it takes to get data from CPU complex to CPU complex (and from I/O complex to I/O complex and from memory controller to memory controller (!!)) is also going to be somewhat variable. Imagine the scenario where a chiplet is doing I/O on the farthest possible memory controller and I/O node on the I/O die. Not only is this sub-optimal, it potentially creates a storm of traffic on infinity fabric which would negatively impact other data traffic that needs to get from A to B. If the system were smarter about which core, memory controller and I/O node were closer and which were father, the process scheduler could make smarter choices about where to run code.

I set NPS4 chasing more than 14 million I/OPs. With that, and careful CPU pinning, I was able to get from ~14.1 million iops to 15 million I/OPs, but overall I don’t think it would be worth it to change from the default NPS1 (even for this scenario).

From 6.8 million iops to 10 million to 15 million iops

TODO

Isolating just NPS4 Benefits

So here’s the same system and benchmark with NPS1:

--procs--- ----total-usage---- ------memory-usage----- --io/total- ----system----
run blk new|usr sys idl wai stl| used  free  buf   cach| read  writ|     time     
 67   0    |                   |  22G  477G   63M   21G|           |09-05 12:42:59
 67   0 9.0| 32  68   0   0   0|  22G  477G   63M   21G|13.6M    0 |09-05 12:43:00
 67   0 6.5| 32  68   0   0   0|  22G  477G   63M   21G|13.6M 3.00 |09-05 12:43:01
 67   0  10| 32  68   0   0   0|  22G  477G   63M   21G|13.6M    0 |09-05 12:43:02
 68   0 4.0| 31  68   0   0   0|  22G  477G   63M   21G|13.8M    0 |09-05 12:43:03
 67   0 9.0| 31  69   0   0   0|  22G  477G   63M   21G|13.9M    0 |09-05 12:43:04
 68   0 5.0| 32  68   0   0   0|  22G  477G   63M   21G|13.9M    0 |09-05 12:43:05
3.0   0  72| 27  63   9   0   0|  20G  480G   63M   20G|12.1M    0 |09-05 12:43:06

vs. the 14.8-15.2 million IOPs we saw in NPS4. In reality what’s happening here is the kernel and interrupt machinery is tending to have cores physically more near the SSDs service the interrupts. That better distributes the I/O load across infinity fabric, and that’s good for another million IOPs or so. Your specific application may suffer in other ways from NPS1 vs NPS4, so it’s up to you to test. Generally I recommend sticking with NPS1 in the real world. Still, this is useful knowledge in understanding how it works behind the curtain.

What else is there for tuning? What about _______ ?

Look, I am still chasing that 20 million iops white whale. I haven’t given up. Will Zen3 do it? Epyc Milan? I’m 95% sure it’ll be no problem on Dual Socket Epyc (whether that’s Rome or Milan, sounds like a future video…)… but one socket? I’m not so sure. It is true there will be less infinity fabric traffic for core-to-core communications on the same chiplet. Hmmm…

For the software end of things we have here, this is kernel 5.12. And a recent distro (Fedora 33/34 for this testing). It is already pretty well-tuned out of the box. I love that! Linux has really moved as fast as the state of the art here and that’s great for users.

It uses the new hybrid polling right out of the box (low cpu overhead, low latency, and less overhead than a storm of interrupts in this kind of workload). Technically that is ‘none’ for the i/o scheduler.

The reason is that a proper i/o scheduler has cpu overhead. We’re on the order of about 10k cpu cycles to service a 4k I/O request with any of these devices. At ~4.2GHz this is a vanishingly small amount of time to worry about anything. ‘None’ really does make the most sense here.

It is possible to go straight polling and maybe improve the overall latency, but the CPU usage would then be astronomical. It doesn’t make good logical sense to do this for real-world usage. Hybrid polling in the Linux Kernel really is the cleanest and most brilliant solution here.

Read-ahead is, maybe, another thing that could go away, or not. In reality I think you’d want it.
smartctl -a /dev/nvmeX and read ahead can be disabled. Conventional wisdom is that it isn’t needed for SSD-based systems but I’ve found there is so little penalty from reading “a little bit more” on fast disks that it mostly doesn’t matter. Still, for benchmarking purposes, we can disable it here and maybe squeeze out a few extra i/ops.

Of course the CPU performance governor should be set to performance. It’s probably not worth checking, but be aware that nvme have different power modes. In at least one instance in my career I’ve seen either the backplane or bios of the system set the NVMe to a lower power teir (which was worse performance). The NVME CLI utilities on Linux will unambiguously tell you about those variables.

Conculsion

As for now, the question remains unanswered. 10+ million? E-Z. 15 Million? Achieved, but we’re starting to see that exponential fall-off curve and we’re having to look at the physical system topology to make sure NVMe devices are evenly distributed across all our possible PCIe resources. The CPU usage is kind of high and we might be looking at excessive spinlock contention within the kernel. Further investigation is required. Stay tuned for part 2! And if you have any ideas for performance tuning, let me know.

# Fifteen high-speed PCIE4 NVMe devices + One P5800X Optane

# dstat -pcmrt 

---procs--- ----total-usage---- ------memory-usage----- --io/total- ----system----
run blk new|usr sys idl wai stl| used  free  buf   cach| read  writ|     time
 85   0 9.0| 32  67   0   0   0|  23G  476G  168M   21G|14.1M  142 |29-04 22:53:30
 85   0 4.0| 33  68   0   0   0|  23G  476G  168M   21G|14.1M    0 |29-04 22:53:31
 85   0 9.0| 33  67   0   0   0|  23G  476G  168M   21G|14.1M    0 |29-04 22:53:32
 85   0 3.0| 33  67   0   0   0|  23G  476G  168M   21G|14.1M    0 |29-04 22:53:33

14.1 million IOPs! wow!

Of course, I’m not doing anything useful. Yet.

I need to know what I’m up against, in terms of main memory bandwidth. Do I have enough storage to saturate how fast the processor can do stuff with main memory? Let’s find out:

# sysbench --test=memory --memory-block-size=8M --memory-total-size=400G --num-threads=32

409600.00 MiB transferred (111097.22 MiB/sec)

General statistics:
    total time:                          3.6861s
    total number of events:              51200

Latency (ms):
         min:                                    0.29
         avg:                                    2.24
         max:                                   15.62
         95th percentile:                        3.07
         sum:                               114538.23

Threads fairness:
    events (avg/stddev):           1600.0000/0.00
    execution time (avg/stddev):   3.5793/0.08

So it looks like yes, that’s something to worry about. While on the EPYC server platform I could potentially have two sockets and 16 memory channels to work with, Threadripper Pro is 8 memory channels. That’s a theoretical 200 gigabytes/sec at 3200MT/s, but real world? This is what you see from sysbench. ~115 gigabytes/sec. I can work with this, though.

Update: After a bit of tuning

BigBodZod · April 30, 2021, 3:20am

Wendell are you saturating the PCIe Gen4 bandwidth with these devices or will more specialized devices be needed ?

And what is the best use case for even the Kioxia devices, DB or some combination of workloads ?

And yes the Threadripper Pro with 8 memory channels looks mighty impressive.

wendell · April 30, 2021, 3:21am

If I were building a “big” system like this, I’d probably use a max of 8 of the fastest NVMe. Or possibly two Liqid Honeybadgers (that’s the speed equivalent of 16 regular m.2 nvme). For VFX and some kinds of one-off data processing, this much insanity can make sense. The real-world use cases are limited.

Once I built a system to process transactions and when we changed how we computed something it was necessary to load the entire prior year’s data and re-process using the new algorithm. Something like this for that makes sense vs a server. Right now a lot of companies are doing that kind of one-shot jobs in the cloud and paying Amazon for the one-time compute. They load the result set into a smaller cheaper DB and go with life. That’s a valid strategy, too.

GigaBusterEXE · April 30, 2021, 4:19am

How did that old pci-e card compare when you tested it a few weeks ago I remember you said it got faster with newer processors

wendell · April 30, 2021, 4:32am

On v1 xeons it’s like 220k iops. On those skylake xeons is like 380k iops. Not bad

Dutch_Master · April 30, 2021, 9:20am

Provide and they will come. Just give it time

Anyway, I have flashbacks to Linus’ NVMe-based render server you guys couldn’t really get to work, despite best efforts. But he made a video (or more) so he made money on it anyway

Log · May 27, 2021, 4:51am

@wendell

I came across an interesting comment the other day which I found intriguing but is quite a bit beyond the hardware I have to test it out with. The guy seems to have some familiarity with big systems

AnotherAnonGringo
There is nothing special you need to do for NVMe drives, until you get into many of them where you actually need to start slowing things like the PCIe bus down.
…

Glix_1H
Now I’m curious, does this mean dropping from pcie3 to pcie2 speeds or something else? And what does slowing the bus accomplish?

AnotherAnonGringo
Yep, exactly! Usually, when you have that many, IOPS is what you want, not raw throughput. So when bus contention starts killing all communication, lowering the max throughput of the NVMe isn’t a big deal.

So for your IOPS goal, you might try actually forcing a reduction on the level of PCIe, and see if that helps your scenario.

GigaBusterEXE · May 27, 2021, 5:04am

I think dedicated silicon for flash controllers on the CPU will be the next big thing just like what’s in the PS5 and it’s got like 20 que depths

Ivan_Here · May 30, 2021, 7:58pm

did you keep track of how much is current config cost est is like for someone other than madman who would like to build one …

Chuntzu · September 1, 2021, 4:45am

This video is the reason I got back I to storage benchmarking. Thank you! It’s been a few years but “dusted off” the old servers and got right to it. Couple of e5-2680v2 and 128 go of ram. It’s a super micro x9drx, 11 pci 8x slots. Scaled fio using libaio with 10 intel 750 drives just fine and then started testing ram drives got up to 10 million 4K read iops and decided to try spdk! Boy that’s when it got interesting. Once I sorted out how to work with it is really awesome. Single core in pool mode millions of iops! The big news was once I sorted out the Malloc bdevs (ram drive) I tested against that and hit 25.4 million 4K read iops. I will post and image tomorrow. It took 10 cores…I think I can do it with less but just got done doing it and haven’t tried scaling it down. It was 94ish gigabytes per second speeds at 4K blocks! That bonkers!

wendell · September 1, 2021, 4:54am

Wait till you see what just got merged into kernel 5.15. I have been working on a video with Intel Platinum 8380s for two months chasing some bizzare issues and things like regression if you add more devices which should be able to service more iops… but then its fixed

bio can be re-used in pure polled mode is the big change on 5.15+ and you can do ~2-3 million iops on a single 8380 core.

I really wanna see what youre up to adn if you’ve got a step by step of what you did I’ll include it in my vid

Chuntzu · September 1, 2021, 1:47pm

Well now I need to go find out what you have been up to with Kernel 5.15!

I was surprised the 3rd gen scalable cpus tested so poorly when you were testing. Again I am aware of the fact you wanted to high 20million with just stock Linux and fio. But man the big with intel really capped you out, glad you figured it out! Good job and it must have been a ton of work!

I was reading on spdk or something from intel that they have special io queuing on the scale able cpus to help with spdk and other stuff, I am looking forward to this. But since I didn’t have any of those on hand I wanted to see how well the older/cheaper e5 cpus could do with these. Pretty glad that spdk works so well.

I just saw benchmarking white papers from the spdk team comparing spdk to io_uring and io_uring keeps up well but has problems with the fio implementation.

With intel ocf/opencas they are using spdk as well. Just think of 4 to 8 optane drives caching everything with super low cpu usage and crazy iops!

A lot of me trying spdk and testing the Linux storage stack stems from windows not handling io properly compared to Linux. I was peaking at 800,000 iops with windows using 2x nv1616 pcie cards. Linux maxes those out well north of 2.2 million. What bothers me is that companies like start wind have started implementing spdk and nvmeof for windows but not Microsoft so the windows performance gap is huge.

I originally started working with hyperv vs kvm due to samba 3.0 due to rdma and how crazy fast it was with windows and server2012 r2 back in the day. But this is is mostly moot with nfs over rdma, nvmeof rdma/tcp, ceph rdma, Gluster rdma, samba (they are still working on the rdma) multichannel, srp, etc. Now Linux has a superior io and networking. Kind of makes me regret how much time I poured into hyperv and storage spaces. But it’s all learning.

I haven’t worked on this stuff for 5 years while I was really active on the servethehome forums. It’s good to be back into it. Best part was after a couple weeks trying all this out I told my wife about my 25.5million iops and saw her eyes roll into the back of her head. But she was sweet and said “thats nice” after I gave her a 15 minute explanation of why it was awesome and what it all means!

I will post up info probably tomorrow, it will take me a bit to gather info and put it together in something understandable.

Chuntzu · September 2, 2021, 6:05pm

**New user issues can not post more than 2 embeded images. Maybe someone could unlock that so I can add those images? **

We will start with the numbers and then show how to set it up.

Random read 4k

Random write 4k
Random read 512 bytes - 34 Million
Random write 512 bytes - 26.4 Million
CPU usage htop - 20 cores being used.

git clone https://github.com/spdk/spdk 
cd spdk
git submodule update --init
sudo scripts/pkgdep.sh --all
./configure  --with-rdma
make
./test/unit/unittest.sh

This is the command to get everything functioning once you have made spdk. This set up script lets you then run things in user space using spdk.

sudo HUGEMEM=4096 scripts/setup.sh

To clean up and get everything back to normal use

sudo scripts/setup.sh reset

This first test works nicely by automatically using all nvme attached to the system and running the very highly optimized spdk perf app

sudo ~/spdk/build/examples/perf -q 128 -s 4096 -o 4096 -w read -t 10

Next is to build fio

git clone https://github.com/axboe/fio
cd fio
make
install

Now remake spdk with fio

cd spdk
git submodule update --init
./configure --with-fio=/path/to/fio/repo <other configuration options>
make

path to fio would be something like ~/fio

This is the part that I struggled with due to needing a fio config and inside a fio config you need a bdev config. That and there is a fio spdk nvme io module and and fio spdk_bdev module and different paths and setups for each.

The fio spdk io module is easy to set up.

sudo LD_PRELOAD=~/spdk/build/fio/spdk_nvme ~/fio/./fio ~/spdk/examples/nvme/fio_plugin/example_config.fio '--filename=trtype=PCIe traddr=0000.XX.00.0 ns=1'

So using this command set up you use LD_PRELOAD to add the spdk io module spdk_nvme. Next command is the fio you made. last command is the fio config, which you could use just the command line for fio for this but the config file is nice to use.

Here is what my config fire looks like and I just edited the example one that came with
spdk.

[global]
ioengine=spdk
thread=1
group_reporting=1
direct=1
verify=0
time_based=1
ramp_time=0
runtime=30
iodepth=128
rw=randread'
[test]
numjobs=1

Last part of the command to the pcie address for an nvme drive.

Now here is the big one using spdk_bdev with fio which I used to test my dual e5-2680 v2 set up too see what the max speed I could get in the system with no limitations.

sudo LD_PRELOAD=~/spdk/build/fio/spdk_bdev ~/fio/./fio ~/spdk/examples/bdev/fio_plugin/bdev_config.fio

This looks like the same but its a bit different were are now using spdk/build/fio/spdk_bdev vs spdk_nvme. Next were are using a new fio config which is in a slightly different location.

[global]
direct=1
thread=1
time_based=1
norandommap=1
group_reporting=1
rw=randread
bs=4096
runtime=300
ramp_time=0
numjobs=20
log_avg_msec=15
write_lat_log=/tmp/tc3_lat.log

[global]
ioengine=spdk_bdev
spdk_conf=/home/user/spdk/test/external_code/hello_world/bdev.json
[filename0]
iodepth=128
cpus_allowed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
#filename=Nvme1n1
#filename=Nvme0n1
#filename=Nvme2n1
#filename=Nvme3n1
filename=Malloc0

The things to take note of here is that in global that ioengine is now spdk_bdev, there is a spdk_conf which is pointing to a bdev.json file that needs to be created with a list all “drives” or other bdevs you created or want to use. I also am using cpus allowed to select which processor cores to use and I have number of jobs as 20 vs 1.

The bdevs if they are nvme drives can easily be created and formated properly using this handy script they include. sudo ~/spdk/scripts/gen_nvme.sh --json-with-subsystems | jq
then copy and past that into a file and point the script to it.

But to use Malloc as a bdev they actually have a premade json file in that location in the fio file, /home/user/spdk/test/external_code/hello_world/bdev.json. The Malloc bdev is a ram disk which can be used to test the limits of you system. The file will look like this except I cant get this formated properly so I will post it as an image next.

 "subsystems": [
    {
      "subsystem": "bdev",
      "config": [
        {
          "params": {
            "name": "Malloc0",
            "block_size": 512,
            "num_blocks": 12000
          },
          "method": "bdev_malloc_create"
        }
      ]
    }
  ]
}

Its been fun, now its time to try this over iscsi, NVM-OF, RDMA, and others. I would like to get this running in some sort of productionish home lab level and see how well it improves performance of vms and containers. Should be fun

Chuntzu · September 2, 2021, 6:09pm

correct bdev config.json
spdk bdev config fio the correct one

Here is a new one, bdev per 100 percent 4k writes

wendell · September 2, 2021, 7:08pm

You rock, upped your trust level

Chuntzu · September 2, 2021, 7:40pm

One last set of images

512kb bdev perf reads - 61 million 512kb reads bdev perf

512kb bdev perf writes - 59 million

All of this can be taken with a grain of salt since this is not going onto disk but it shows the up limit performance of the e5 platform, which are pretty cheap to acquire. Now next would be to bump this all off of the NVMe disks I have should be close to 12 million read iops on some old intel 750 drives and a few NV1616 type cards. Would like to hit that 20 milllion mark as well. Could only imagine what epyc and scalable systems can hit doing the Malloc stuff. Would be cool to see.

Chuntzu · September 2, 2021, 7:40pm

Thank you sir!

Chuntzu · September 7, 2021, 5:18pm

@wendell Great job on the 15 Million Iops with 8380! I was eyeballing the phoronix article you mentioned. All of this is very exciting. SPDK for user space and io_uring and new hybrid polling for Kernal. It is leaps and bounds past Windows at this point. I was running into issues with windows 10 when trying this out at first. I had two devices that should have hit 1.1 million 4k iops each and maxed out windows 10 at 690,000 Iops no matter how I set it up! Bummed me out so much I didnt even try it out on the server platforms since I knew it would be more of the same. Interesting idea would be connecting with starwinds nvmeof to spdk nvmeof/iscsi server and actually testing it out like they did in there benchmarks. I agree that Microsoft should just buy them and roll it into there stack. Starwind actually has RDMA isci/nvmeof and TCP accelerated isci/nvmeof drivers and target/server software for Windows which is pretty great. I will try and post up some benchmarks if and when I get that up and running. Going to try and replicate the work/testing you just mentioned with kernel 5.15 with my older hardware and see where it lands in the iops per core vs the 8380, I would be happy if I can get the 1.2/1.3 million per core I was getting with spdk with thie new kernel.

wendell · September 7, 2021, 5:39pm

second video based on your post is coming

Marten · October 27, 2021, 7:05am

@wendell Did you see the phoronix news below. 10M I/Ops per core on AMD even.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-IO_uring-10M-IOPS