Toward 20 Million I/Ops - Part 1: Threadripper Pro - Work in Progress DRAFT

Yep follow up inbound

2 Likes

Reviving this thread a little. I want to try to PoC how far consumer SSDs can be pushed with SPDK/hybrid polling, before switching to DC or Optane drives. I also want to baseline where everything is at so I don’t repeat something that has already been done before.

@wendell & @Chuntzu can you correct me if something is wrong.

  1. A fully loaded Threadripper/Milan system using a M.2 carrier cards and Samsung 980 Pros tops out at ~1.1M iops, in aggregate, when using kernel <5.15; adding more 980s doesn’t help.
  2. Kernel 5.15+ with an R9 5950X + 2x Optane will yield ~8.9M to ~10M iops/core, but no idea on aggregate upper bound for a system
  3. Kernel 5.17-rc with a 12900K + 2x Optane will yield ~13.1M iops /core (using Jens’ latest patches)
  4. Using SPDK on an E5, the upper limit of a bdev is ~25M iops w/ 4k writes

So I guess the question is: Is it worth grabbing some 980 Pro drives and an M.2 carrier paired with a 12th gen to baseline SPDK/io_uring vs libaio and try to increase the # of drives try to hit the upper limit, or is the 980 Pro physically limited to the point where there will not be any difference between io_uring and libaio? (suspect this would more likely be at the controller level than the NAND level)

The eventual question I want to get to is a $/iops number, and how that $/iops number compares between consumer SLC/TLC vs DC SLC/TLC vs Optane. I have a sneaky suspicion that Optane actually has the best $/iops number followed by consumer drives.

1 Like

Almost a necro, but turns out this is actually a wiki, so here we go.

@wendell Looks like linux kernel 6.0 is gonna have a load of performance improvements, including IO_uring

122M IOPS in 2U, with > 80% of the system idle. Easy.

Also lmao

Jens Axboe: You know, sometimes a new OS from scratch sounds appealing. But then I remember how much work that would be…

1 Like

RIP optane

F

1 Like

Could be old news, but I just saw Samsung’s development announcement of their PCIe Gen5 SSDs, where they set expectations for 2.5M max iops and 13GB max bandwidth.

Also, they boast about the expected efficiency of 608 MB/s per W. Unless I understand that incorrectly, this means that this line of SSDs will consume at peak more than a whopping 21W (13000 MB/s % 608 MB/s per W) .

On the bright side: you’ll need only 8 of those for 20M iops :smiley:

I know, pie in the sky, but still somewhat relevant.

1 Like

With the right setup and fNVMe drive choices this should be doable right now. The GRAID SupremeRAID SR-1010 card is showing PCIe 4.0 specs of:

While the SR-1000 PCIe 3.0 is showing:

That’s 16 million IOPs on PCIe 3.0 and 19 million with PCIe 4.0. I would not be surprised if the SR-1010 can be pushed over 20 million IOPs already as many newer NVMe drives have come out that are faster then the benchmarked Intel D7-P5510 which has the following performance characteristics:
6500 MB/s Sequential Bandwidth - 100% Read
3400 MB/s Sequential Bandwidth - 100% Write
700,000 IOPS (4K Blocks) Random Read
170,000 IOPS (4K Blocks) Random Write

For example use the SK hynix Platinum P41 which already has double the IOPs.
7,000 MB/s Sequential Bandwidth - 100% Read
6,500 MB/s Sequential Bandwidth - 100% Write
1,400,000 IOPS (4K Blocks) Random Read
1,300,000 IOPS (4K Blocks) Random Write

With PCIe 5.0 being introduced and already seeing preliminary results of 14,000 MB reads and 2.5+ million IOPs the 20 million mark should be toast quite easily.

Jens is hitting 100 million iops these days, and graid wont really be able to touch that.

Graid doesn’t even use the extra nvme data integrity field, sadly.

That’s what I get for posting when I haven’t slept in 2 days. :slight_smile:
I could have sworn you were trying to achieve 20M IOPs via RAID which is tough with NVMe. Went back and re-read the thread and see you’re talking about all out IOPs.

When you mention Jens I’m assuming this is Axboe you’re referring to correct? If so there are a couple things to keep in mind. This is not meant to take anything away from his accomplishments.

He’s using 24 Optane 5800X setup special. His version of an IOP isn’t what most of us would test against for real world use. I know unless it’s specially mentioned an IOP to me is 4K as that’s the standard unless otherwise specified.

The P5800X is rated at up to 1.55M Random 4K Read IOPs. The P5800X has a special mode that’s meant for meta-data that allows use of 512B and that provides 5M IOPs according to Intel literature.

Back on Aug 1st Jens posted he’s using 24 drives each with 5080 IOPs and said the 122M IOPs was just linear scaling. https://twitter.com/axboe/status/1554115250588471297
24x5080=121,920 so that’s about a linear as you’ll get. The problem I have with the numbers is that each IOP is 1/8 what is typically used so there is a lot less bandwidth going on.

When he tested the same drives using 4K he got 31.8M IOPs which is more like we’re used to comparing IOPs. Don’t get me wrong that’s a super number in it’s own right but the test is essentially assigning different drives to different cores using enough cores to be able to read the data using polling. It’s followed by a No-Op so it’s only doing a small amount of IO compared to real usage such as loading the data into memory to be used or sent across the network to a client workstation or similar. It’s not representative of data base IOPs for example.

So I think that’s important to keep in mind if your trying to compare your own system IOPs from actual use to this testing. Again, not wanting to downplay the work he’s doing because it’s good stuff.

I mention this since I think you were doing 4K IOPs testing. If I’m correct on that, if you would have tested the same way using 512B sectors then you would have likely passed 20M IOPs pretty easily.

As touched on in my first message these IOPs from my understanding are just individual drives and core. The drives aren’t part of any pool or RAID group. This to me is a more fundamental issue. Right now there isn’t any decent solutions for high output RAID of NVMe using software or common tri-phase adapters. The ExtremeRAID board is the only device I’ve seen that does a decent job of giving us some good bandwidth and IOPs numbers when joining NVMe drives.

The hardware for the SR-1000 and SR-1010 appear to be normal Nvidia GPUs. The SR-1000 reports itself as being a PNY Quadro T1000. The SR-1010 appears to be a NVIDIA RTX A2000. Neither are very high end GPUs by todays standards.

It’s interesting that you’re buying a solution for close to $2500 that comes with a $300-$400 GPU and software raid. Of course they are offloading RAID calculations and other functions to the GPU. Storage review mentioned they are also using AI functions of the GPU.

I think it’s just a matter of time before we see something like this done open source. that could benefit equally benefit mdadm as well as ZFS.
I think they’ve shown GPU use to be worth doing.

2 Likes

Some precocious college students and some company that wants to support the work on a DPU come to mind. I’d love to see ZFS off-loaded to a DPU and “direct storage” on the part of the dpu

Right on all counts, except that even with 31.x million 4k iops, it was such low cpu utilization that it could sail past 100 million assuming theres no hidden bottleneck.

So ive got a stupid question. Its probably been answered im really not entirely sure but… what would you use that actually utilizes the iops?

2 Likes

Excel :joy:

:joy: PowerQuery OP

1 Like

He clearly said running 512B hitting 122M IOPs he had > 80% of the system idle.
He then goes on to say this was running on a 5950X which is 16 core, 32 Thread. He said he tested and was able to handle up to 5.5M IOPs at 512B per core. That’s basically one 5800X per core if correct.

What I couldn’t find and don’t understand is this. I’m going to assume with a 5950X he’ll be running a X570 or similar. In a review I found: “X570 PCH actually includes sixteen PCIe lanes, in addition to the 24 PCIe lanes on the CPU”

Is that correct only 40 PCI lanes total many of which are reserved? In any event 24 NVMe drives are going to require 96 PCI lines for full bandwidth but it seems like that is already 2x the total PCI lanes.

Databases for sure.
If you can access data very quickly and not use much CPU to do this you can get by with less cores/CPUs which can save some real money on licensing. In some cases what you save in a single year on SQL license would pay for the NVMe hardware!

1 Like

Yes, but the X570 is connected to the CPU with just 4 lanes of PCIe Gen4. So, all the 16 PCIe lanes are not contributing more than 4 lanes, really. In my testing it’s even less due to overhead.