Graphs go here
todo
Background
Video goes here
System Configuration
Dual Intel 6980P (16/8 drive)
8x/16x Kioxia CM7
1s config:
Intel Xeon(R) 6781P
8x Kioxia CM7
TODO
Raw Device Benchmarks
16 drive configuration (two NUMA node, no other background jobs)
8 drive configuration (single NUMA node)
Peak Performance
Real-World
8 drive raid 5 (single NUMA node)
BTRFS Filesystem (8 drive)
Okay, we’ll add BTRFS for extra integrity checking, right? Not so fast there – that will add significant overhead. Without extra tuning, this is what happens to the benchmarks:
And, the biggest of oofs:
With btrfs filesystem we’re at 40% of peak performance read and a tiny tiny fraciton write. (This is mkfs.btrfs “heres my storage volume please make a filesystem” defaults fwiw; tuning here is possible and recovering the write performance can be done however even with that tuning the overhead was still significant – greater than 50% reduction and increases in CPU overhead more than 2x what I’d expect in this kind of scenario).
(ps. the first write problem here is actually interesting.)
XFS Filesystem
XFS has metadata journaling. It also has much less overhead than BTRFS.
Still better – but the fs needs tuning to allow it to use more cpu overhead in order to achieve max throughput.
ik_llama.cpp cross-socket workload
In this workload, we have previously found that the bottleneck for models that exceed the local memory available to a single CPU socket often become bottlenecked by the socket-to-socket links. This is an a-typical workload and modern multi-socket systems were not really designed with the kind of workload an “on cpu” AI model generates.
If you have a workload that does not depend on socket-to-socket bandwidth then running this type of raid workload where half the disks are associated with different numa nodes is unlikely to be affected.
This is something we may explore further in a future video, but…
How Busy is the GPU, really?
The GPU is apparently in a tight loop, so 100% utilization all the time.
Part of this may be to make it harder to suss out that the GPU really isn’t doing all that much except 1) managing the vast swarm of DMA requests/traffic and 2) write parity calculations
It seems clear that, at least for my setup, there is an upper write limit of ~~80 gb/sec. That’s nothing to sneez at, to be sure, but I suspect on an A2000+ it would be better.
Okay, what about Linux MD?
more than 2 cores needed per i/o thread to saturate gen5. Ouch!

Raw block device; no filesystem overhead:
Note: This one is a bit misleading because it was really hard to eliminate the OS cache at smaller block sizes and thread counts because of the amount of ram in the system.
Very similar topline performance! But much higher cpu utilization.
What about raid5? That’s where things don’t look quite as good for linux md:
… and with tuning I can get the linux md write speed up back up to par but really overall it still isn’t that great.
I put that in here mostly as a cautionary tale to actually test your stuff. We have a lot of threads here where people do an initial linux md setup and dont get the stripe right or theres some other underlying config issue then you just get abysmal write perf.
Pros of Linux MD:
- Built in
- Raid 0,1, 10 almost as good, with tuning
Cons of Linux MD:
- Much Higher CPU utilization
- Slower for Raid5/6
- Linux MD does not handle parity errors during scrub well
- Raid 6 needs to be better for parity scrub
- Raid 5/6 write hole journaling needs to be better
- General Write speed “out of the box” slow because sub-optimal defaults
Note the graph here shows absurdly slow writes. This is easy to overcome with a little config tuning BUT this is also the out of the box defaults with 1mb chunks:
sudo mdadm --create /dev/md1 --level=5 --raid-devices=8 --metadata=1.2 --chunk=1024 --name=nvme8_raid5 /dev/nvme2n1 /dev/nvme0n1 /dev/nvme7n1 /dev/nvme4n1 /dev/nvme3n1 /dev/nvme6n1 /dev/nvme8n1 /dev/nvme1n1 --assume-clean
Helpful Hints for Filesystem Tuning Addendum
This is useful in any context – linux md, graid,… anything
sudo mkfs.ext4 -F -E stride=256,stripe_width=2048 /dev/device
sudo mkfs.xfs -f -d su=1m,sw=8 /dev/device
# NVMe namespaces: use 'none' scheduler; md can use 'mq-deadline' (A/B test is fine)
for s in /sys/block/nvme*n1/queue/scheduler; do echo none | sudo tee "$s"; done
echo mq-deadline | sudo tee /sys/block/md0/queue/scheduler
# Make sure request queue isn't tiny
echo 2048 | sudo tee /sys/block/md0/queue/nr_requests
TODO maybe explain this a bit better in the xfs and ext4 comment. Maybe an audience member has some BTRFS hints that are similar?
linux md specific tuning hints
sudo blockdev --setra 65536 /dev/md0
echo 32768 | sudo tee /sys/block/md0/queue/read_ahead_kb




















