GRAID: 2025 Graphs

Graphs go here

todo

Background

Video goes here

System Configuration

Dual Intel 6980P (16/8 drive)

8x/16x Kioxia CM7

1s config:
Intel Xeon(R) 6781P
8x Kioxia CM7

TODO

Raw Device Benchmarks

16 drive configuration (two NUMA node, no other background jobs)

8 drive configuration (single NUMA node)

Peak Performance

Real-World

8 drive raid 5 (single NUMA node)

BTRFS Filesystem (8 drive)

Okay, we’ll add BTRFS for extra integrity checking, right? Not so fast there – that will add significant overhead. Without extra tuning, this is what happens to the benchmarks:

And, the biggest of oofs:

With btrfs filesystem we’re at 40% of peak performance read and a tiny tiny fraciton write. (This is mkfs.btrfs “heres my storage volume please make a filesystem” defaults fwiw; tuning here is possible and recovering the write performance can be done however even with that tuning the overhead was still significant – greater than 50% reduction and increases in CPU overhead more than 2x what I’d expect in this kind of scenario).

(ps. the first write problem here is actually interesting.)

XFS Filesystem

XFS has metadata journaling. It also has much less overhead than BTRFS.

Still better – but the fs needs tuning to allow it to use more cpu overhead in order to achieve max throughput.

ik_llama.cpp cross-socket workload

In this workload, we have previously found that the bottleneck for models that exceed the local memory available to a single CPU socket often become bottlenecked by the socket-to-socket links. This is an a-typical workload and modern multi-socket systems were not really designed with the kind of workload an “on cpu” AI model generates.

If you have a workload that does not depend on socket-to-socket bandwidth then running this type of raid workload where half the disks are associated with different numa nodes is unlikely to be affected.

This is something we may explore further in a future video, but…

How Busy is the GPU, really?

The GPU is apparently in a tight loop, so 100% utilization all the time.

Part of this may be to make it harder to suss out that the GPU really isn’t doing all that much except 1) managing the vast swarm of DMA requests/traffic and 2) write parity calculations

It seems clear that, at least for my setup, there is an upper write limit of ~~80 gb/sec. That’s nothing to sneez at, to be sure, but I suspect on an A2000+ it would be better.

Okay, what about Linux MD?

more than 2 cores needed per i/o thread to saturate gen5. Ouch!

image

Raw block device; no filesystem overhead:


Note: This one is a bit misleading because it was really hard to eliminate the OS cache at smaller block sizes and thread counts because of the amount of ram in the system.

Very similar topline performance! But much higher cpu utilization.

What about raid5? That’s where things don’t look quite as good for linux md:

… and with tuning I can get the linux md write speed up back up to par but really overall it still isn’t that great.

I put that in here mostly as a cautionary tale to actually test your stuff. We have a lot of threads here where people do an initial linux md setup and dont get the stripe right or theres some other underlying config issue then you just get abysmal write perf.

Pros of Linux MD:

  • Built in
  • Raid 0,1, 10 almost as good, with tuning

Cons of Linux MD:

  • Much Higher CPU utilization
  • Slower for Raid5/6
  • Linux MD does not handle parity errors during scrub well
  • Raid 6 needs to be better for parity scrub
  • Raid 5/6 write hole journaling needs to be better
  • General Write speed “out of the box” slow because sub-optimal defaults

Note the graph here shows absurdly slow writes. This is easy to overcome with a little config tuning BUT this is also the out of the box defaults with 1mb chunks:

 sudo mdadm --create /dev/md1   --level=5 --raid-devices=8   --metadata=1.2 --chunk=1024 --name=nvme8_raid5   /dev/nvme2n1 /dev/nvme0n1 /dev/nvme7n1 /dev/nvme4n1   /dev/nvme3n1 /dev/nvme6n1 /dev/nvme8n1 /dev/nvme1n1 --assume-clean

Helpful Hints for Filesystem Tuning Addendum

This is useful in any context – linux md, graid,… anything


sudo mkfs.ext4 -F -E stride=256,stripe_width=2048 /dev/device


sudo mkfs.xfs -f -d su=1m,sw=8 /dev/device


# NVMe namespaces: use 'none' scheduler; md can use 'mq-deadline' (A/B test is fine)
for s in /sys/block/nvme*n1/queue/scheduler; do echo none | sudo tee "$s"; done
echo mq-deadline | sudo tee /sys/block/md0/queue/scheduler

# Make sure request queue isn't tiny
echo 2048 | sudo tee /sys/block/md0/queue/nr_requests



TODO maybe explain this a bit better in the xfs and ext4 comment. Maybe an audience member has some BTRFS hints that are similar?

linux md specific tuning hints

sudo blockdev --setra 65536 /dev/md0
echo 32768 | sudo tee /sys/block/md0/queue/read_ahead_kb

2 Likes

The newest PCIe 5.0 hardware raid controllers are comparing very favorably to the newest version of Graid’s SupremeRAID.

Storagereview’s done a sortof apples-to-apples comparison between “real” hardware raid and Graid using GDSIO and FIO and hardware raid is consistently ~2-10 times faster than the SupremeRAID.

​​​ ​ ​
​​​ ​ ​
PS:
I’m curious to see how Microchip’s SmartRAID 4300 compares to Graid’s solution; they both seem to have taken a similar approach, but Microchips the incumbent in raid technology so I wonder if they decided to do things significantly different in the software stack of the whole unattached parity accelerator for NVMe concept.

I could maybe get behind a disaggregated solution on an nvme fabric. that could be pretty awesome. And low latency if an asic or purpose-built something

That kind of sounds like Kioxia’s raid offload technology they talked about years ago, I don’t know what happened to it though:

I think this technology could be very good if someone comes up with a standard so different SSDs could be interoperable.

1 Like

Is the whole point with this is to offload parity calculation to GPU?
That means the GPU has to have larger bandwidth than all the nvme combined. Otherwise it won’t keep it up.
8x nvme running at Gen 5, that’s crazy amount of bandwidth. It about 64x Gen 4. You need 4x GPUs running at 16x Gen 4 each to keep it up. Am I right?

That would only be true if the GPU were ingesting all the NVME data and then calculating parity… which would be the most straight forward and safest method, but apparently Graid has found a way for the CPU to ingest all the data and then only send offload parity calculation traffic to the GPU.

Even with this method of handling parity calculations, the write performance doesn’t compare that favorably to existing hardware raid numbers which do ingest all the data and keep parity calculations within a single domain, which I am convinced is more reliable after seeing what a single misbehaving device on the PCIe bus can do:

From Storage Review’s Graid testing:

versus Storage Review’s “real” hardware raid testing:

The two tests aren’t a perfect apples to apples comparison because the hardware raid has more SSDs (although something tells me the Graid solution wouldn’t scale well to more SSDs even if it had them) than the Graid array; although the hardware raid is still restricted to the same total number of PCIe lanes to SSDs as the Graid array, 64 total lanes.
It’s probably somewhat fair to half the hardware raid numbers compared to the Graid numbers, even still the hardware raid is performing much better at small block sizes and lower thread usage (and presumably lower queue depth too).

What options did you use to run these benchmarks? I’m running some experiments with similar NVMe drives for a storage cluster using 400Gb ethernet and I’d like to get a baseline of ideal disk performance before clustering my storage.

Would you be willing to share the commands to generate the data?