Toward 20 Million I/Ops - Part 1: Threadripper Pro - Work in Progress DRAFT

Chuntzu · September 2, 2021, 7:40pm

One last set of images

512kb bdev perf reads - 61 million 512kb reads bdev perf

512kb bdev perf writes - 59 million

All of this can be taken with a grain of salt since this is not going onto disk but it shows the up limit performance of the e5 platform, which are pretty cheap to acquire. Now next would be to bump this all off of the NVMe disks I have should be close to 12 million read iops on some old intel 750 drives and a few NV1616 type cards. Would like to hit that 20 milllion mark as well. Could only imagine what epyc and scalable systems can hit doing the Malloc stuff. Would be cool to see.

Chuntzu · September 2, 2021, 7:40pm

Thank you sir!

Chuntzu · September 7, 2021, 5:18pm

@wendell Great job on the 15 Million Iops with 8380! I was eyeballing the phoronix article you mentioned. All of this is very exciting. SPDK for user space and io_uring and new hybrid polling for Kernal. It is leaps and bounds past Windows at this point. I was running into issues with windows 10 when trying this out at first. I had two devices that should have hit 1.1 million 4k iops each and maxed out windows 10 at 690,000 Iops no matter how I set it up! Bummed me out so much I didnt even try it out on the server platforms since I knew it would be more of the same. Interesting idea would be connecting with starwinds nvmeof to spdk nvmeof/iscsi server and actually testing it out like they did in there benchmarks. I agree that Microsoft should just buy them and roll it into there stack. Starwind actually has RDMA isci/nvmeof and TCP accelerated isci/nvmeof drivers and target/server software for Windows which is pretty great. I will try and post up some benchmarks if and when I get that up and running. Going to try and replicate the work/testing you just mentioned with kernel 5.15 with my older hardware and see where it lands in the iops per core vs the 8380, I would be happy if I can get the 1.2/1.3 million per core I was getting with spdk with thie new kernel.

wendell · September 7, 2021, 5:39pm

second video based on your post is coming

Marten · October 27, 2021, 7:05am

@wendell Did you see the phoronix news below. 10M I/Ops per core on AMD even.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-IO_uring-10M-IOPS

wendell · October 27, 2021, 1:05pm

Yep follow up inbound

bryan_v · February 11, 2022, 8:08pm

Reviving this thread a little. I want to try to PoC how far consumer SSDs can be pushed with SPDK/hybrid polling, before switching to DC or Optane drives. I also want to baseline where everything is at so I don’t repeat something that has already been done before.

@wendell & @Chuntzu can you correct me if something is wrong.

A fully loaded Threadripper/Milan system using a M.2 carrier cards and Samsung 980 Pros tops out at ~1.1M iops, in aggregate, when using kernel <5.15; adding more 980s doesn’t help.
Kernel 5.15+ with an R9 5950X + 2x Optane will yield ~8.9M to ~10M iops/core, but no idea on aggregate upper bound for a system
Kernel 5.17-rc with a 12900K + 2x Optane will yield ~13.1M iops /core (using Jens’ latest patches)
Using SPDK on an E5, the upper limit of a bdev is ~25M iops w/ 4k writes

So I guess the question is: Is it worth grabbing some 980 Pro drives and an M.2 carrier paired with a 12th gen to baseline SPDK/io_uring vs libaio and try to increase the # of drives try to hit the upper limit, or is the 980 Pro physically limited to the point where there will not be any difference between io_uring and libaio? (suspect this would more likely be at the controller level than the NAND level)

The eventual question I want to get to is a $/iops number, and how that $/iops number compares between consumer SLC/TLC vs DC SLC/TLC vs Optane. I have a sneaky suspicion that Optane actually has the best $/iops number followed by consumer drives.

Log · August 4, 2022, 9:50pm

Almost a necro, but turns out this is actually a wiki, so here we go.

@wendell Looks like linux kernel 6.0 is gonna have a load of performance improvements, including IO_uring

122M IOPS in 2U, with > 80% of the system idle. Easy.

Jens Axboe - FIO test showing new IOPS with linux 6.0 kernel1498×1768 363 KB

Also lmao

Jens Axboe: You know, sometimes a new OS from scratch sounds appealing. But then I remember how much work that would be…

nx2l · August 5, 2022, 1:29pm

RIP optane

F

jode · August 15, 2022, 12:26pm

Could be old news, but I just saw Samsung’s development announcement of their PCIe Gen5 SSDs, where they set expectations for 2.5M max iops and 13GB max bandwidth.

Also, they boast about the expected efficiency of 608 MB/s per W. Unless I understand that incorrectly, this means that this line of SSDs will consume at peak more than a whopping 21W (13000 MB/s % 608 MB/s per W) .

On the bright side: you’ll need only 8 of those for 20M iops

I know, pie in the sky, but still somewhat relevant.

Carlo · August 28, 2022, 2:07am

With the right setup and fNVMe drive choices this should be doable right now. The GRAID SupremeRAID SR-1010 card is showing PCIe 4.0 specs of:

While the SR-1000 PCIe 3.0 is showing:

That’s 16 million IOPs on PCIe 3.0 and 19 million with PCIe 4.0. I would not be surprised if the SR-1010 can be pushed over 20 million IOPs already as many newer NVMe drives have come out that are faster then the benchmarked Intel D7-P5510 which has the following performance characteristics:
6500 MB/s Sequential Bandwidth - 100% Read
3400 MB/s Sequential Bandwidth - 100% Write
700,000 IOPS (4K Blocks) Random Read
170,000 IOPS (4K Blocks) Random Write

For example use the SK hynix Platinum P41 which already has double the IOPs.
7,000 MB/s Sequential Bandwidth - 100% Read
6,500 MB/s Sequential Bandwidth - 100% Write
1,400,000 IOPS (4K Blocks) Random Read
1,300,000 IOPS (4K Blocks) Random Write

With PCIe 5.0 being introduced and already seeing preliminary results of 14,000 MB reads and 2.5+ million IOPs the 20 million mark should be toast quite easily.

wendell · August 28, 2022, 2:13am

Jens is hitting 100 million iops these days, and graid wont really be able to touch that.

Graid doesn’t even use the extra nvme data integrity field, sadly.

Carlo · August 28, 2022, 5:25pm

That’s what I get for posting when I haven’t slept in 2 days.
I could have sworn you were trying to achieve 20M IOPs via RAID which is tough with NVMe. Went back and re-read the thread and see you’re talking about all out IOPs.

When you mention Jens I’m assuming this is Axboe you’re referring to correct? If so there are a couple things to keep in mind. This is not meant to take anything away from his accomplishments.

He’s using 24 Optane 5800X setup special. His version of an IOP isn’t what most of us would test against for real world use. I know unless it’s specially mentioned an IOP to me is 4K as that’s the standard unless otherwise specified.

The P5800X is rated at up to 1.55M Random 4K Read IOPs. The P5800X has a special mode that’s meant for meta-data that allows use of 512B and that provides 5M IOPs according to Intel literature.

Back on Aug 1st Jens posted he’s using 24 drives each with 5080 IOPs and said the 122M IOPs was just linear scaling. https://twitter.com/axboe/status/1554115250588471297
24x5080=121,920 so that’s about a linear as you’ll get. The problem I have with the numbers is that each IOP is 1/8 what is typically used so there is a lot less bandwidth going on.

When he tested the same drives using 4K he got 31.8M IOPs which is more like we’re used to comparing IOPs. Don’t get me wrong that’s a super number in it’s own right but the test is essentially assigning different drives to different cores using enough cores to be able to read the data using polling. It’s followed by a No-Op so it’s only doing a small amount of IO compared to real usage such as loading the data into memory to be used or sent across the network to a client workstation or similar. It’s not representative of data base IOPs for example.

So I think that’s important to keep in mind if your trying to compare your own system IOPs from actual use to this testing. Again, not wanting to downplay the work he’s doing because it’s good stuff.

I mention this since I think you were doing 4K IOPs testing. If I’m correct on that, if you would have tested the same way using 512B sectors then you would have likely passed 20M IOPs pretty easily.

As touched on in my first message these IOPs from my understanding are just individual drives and core. The drives aren’t part of any pool or RAID group. This to me is a more fundamental issue. Right now there isn’t any decent solutions for high output RAID of NVMe using software or common tri-phase adapters. The ExtremeRAID board is the only device I’ve seen that does a decent job of giving us some good bandwidth and IOPs numbers when joining NVMe drives.

The hardware for the SR-1000 and SR-1010 appear to be normal Nvidia GPUs. The SR-1000 reports itself as being a PNY Quadro T1000. The SR-1010 appears to be a NVIDIA RTX A2000. Neither are very high end GPUs by todays standards.

It’s interesting that you’re buying a solution for close to $2500 that comes with a $300-$400 GPU and software raid. Of course they are offloading RAID calculations and other functions to the GPU. Storage review mentioned they are also using AI functions of the GPU.

I think it’s just a matter of time before we see something like this done open source. that could benefit equally benefit mdadm as well as ZFS.
I think they’ve shown GPU use to be worth doing.

wendell · August 28, 2022, 9:57pm

Some precocious college students and some company that wants to support the work on a DPU come to mind. I’d love to see ZFS off-loaded to a DPU and “direct storage” on the part of the dpu

Right on all counts, except that even with 31.x million 4k iops, it was such low cpu utilization that it could sail past 100 million assuming theres no hidden bottleneck.

PhaseLockedLoop · August 28, 2022, 10:16pm

So ive got a stupid question. Its probably been answered im really not entirely sure but… what would you use that actually utilizes the iops?

Log · August 29, 2022, 2:51am

Excel

PhaseLockedLoop · August 29, 2022, 2:52am

PowerQuery OP

Carlo · August 30, 2022, 1:20am

He clearly said running 512B hitting 122M IOPs he had > 80% of the system idle.
He then goes on to say this was running on a 5950X which is 16 core, 32 Thread. He said he tested and was able to handle up to 5.5M IOPs at 512B per core. That’s basically one 5800X per core if correct.

What I couldn’t find and don’t understand is this. I’m going to assume with a 5950X he’ll be running a X570 or similar. In a review I found: “X570 PCH actually includes sixteen PCIe lanes, in addition to the 24 PCIe lanes on the CPU”

Is that correct only 40 PCI lanes total many of which are reserved? In any event 24 NVMe drives are going to require 96 PCI lines for full bandwidth but it seems like that is already 2x the total PCI lanes.

Carlo · August 30, 2022, 1:23am

Databases for sure.
If you can access data very quickly and not use much CPU to do this you can get by with less cores/CPUs which can save some real money on licensing. In some cases what you save in a single year on SQL license would pay for the NVMe hardware!

jode · August 30, 2022, 1:50am

Yes, but the X570 is connected to the CPU with just 4 lanes of PCIe Gen4. So, all the 16 PCIe lanes are not contributing more than 4 lanes, really. In my testing it’s even less due to overhead.