Extremely poor write performance in software RAID 5

I am trying to set up a software RAID 5 with 4x fast PCIE-4 NVME SSDs on a Supermicro M12SWA-TF AMD Threadripper platform with a 32 core 5965 CPU. I am using all M2-2280 slots on the board. There is no other option since the PCIe slots are used by 4x 3080 GPUs.

Read speeds are close to 20 GB/sec, while write speeds are just 2 GB/sec.
In RAID 0, I see up to 24 GB/sec and up to 12 GB/sec.

Could this be a parity generation problem?

What software are you using for the RAID 5? Depending on the workload (queue depth, sequential vs random), those numbers could be perfectly reasonable.
CPUs are pretty bad at parity calculations compared to dedicated accelerators aka hardware RAID; even the good hardware raid controller cards are only doing ~12GB/s RAID 5 writes.

I am using out of the box Linux mdadm

I reckon, I will switch from RAID-5 to RAID-10,

mdadm is about as good an implementation of proper software raid as you’ll get, the only thing faster are solutions that play fast and loose (imo) with data, like Intel VROC or GRAID

Yeah you should get much higher numbers on RAID 10.

Yes, you need to use 3 drives or 5 drives in raid5/parity array. Otherwise it doesnt align right and the parity generation cuts your speed way down. This is what I have written in my notes from making arrays:

Interleave (stripe) must be matched to the format allocation unit (block) size. So if 5 disks in parity, that means 4 are getting data written to them. 256K interleave * 4 = 1024MB. So format the volume at 1M allocation unit. Or 512KB * 4 = 2048 so format at 2M. These get very good write performance.
If you have 6 disks in parity, you will never have good write performance since you cannot match data stripes to units written in a direct ratio. Always use 3 drives or 5 drives in a parity array, or 10 drives in dual parity. Anything else has terrible write speed.

Raid0 or Raid10 will still have better write performance though even at the best case parity array.

modern cpus can do parity calc just fine.

more likely, writing to raid 5 requires read-modify-write of a stripe to ensure the data is consistent unless the write is “full stripe”. that’s the cause of parity raid write slowdown, not the cpu.

if your stripe sizes and block sizes don’t line up then doing writes will need a huge amount of read (and write) for small writes. i.e. poor tuning will contribute to performance issues but parity raid has some inherent performance quirks.

reads don’t have that problem.

e,g, stripe size of 64k, a 4k write requires 64k read and 64k write to the baking store. it could be much worse than that. e.g. if your raid controller is using 512b sector size on 4k sector drives on top of that, or the ssd is doing wear levelling in the background too.

ZFS and other proprietary raid types (e.g. netapp RAID-DP) circumvent that somewhat by only doing full stripe writes. maybe look up raid write amplification for more info.

2 Likes

I would expect much lower numbers if alot of read-modify-write was happening to the disks.
Looking back at one of the old threads, Fixing Slow NVMe Raid Performance on Epyc, 2GB/s was basically what mdadm maxed out at on write speed using from 8-24 NVME drives, definitely CPU bottleneck although it wasn’t clear whether the bottleneck was in the CPU parity calc or was something specific architecturally to AMD. My understanding is that the way mdadm implements stripe thread locking is at least partly to blame if the computation of parity data is the bottleneck.

I am now very curious what kind of peak RAID 5 write numbers mdadm can muster on the most modern CPUs in an optimal setup.

I still struggle to believe that as even laptop CPUs now can do many GB/sec of full disk encryption (which is much more complex than a basic parity calc - edit: or is it?).

The numbers may be better than you’d maybe expect of read/modify/write due to drive caching, but doing a write in RAID 5 is fundamentally going to require a read of all drives in order to re-calculate the parity for the stripe on a partial-stripe write (unless the entire stripe is still in cache perhaps?).

There’s no getting away from it with parity based RAID, unless you do copy-on-write (in which case you don’t care about the old parity, you just write the data in a new stripe). Which could have memory bandwidth implications… but the CPU itself should surely be able to handle basic parity calculation.

Enterprise RAID boxes are just x86 cpus in a box with proprietary OS on them… there’s no real magic to enterprise SAN hardware these days.

Don’t most CPUs have crypto accelerators on them nowadays to help with encryption? same way we got video codec accelerators on our GPUs.

100% agree with you that lots of partial-stripe writing will tank performance, but as long as the benchmark is one big sequential write I’d image the disk’s buffer itself will smooth alot of that out, the OS IO scheduler likely helps out there too.

Yeah, they do; but parity calc has been using MMX instructions for a very long time… one would think that if it was a cpu bottleneck problem that intel would be all over it with some new instruction set extension to win some benchmarks with.

1 Like

hahaha. I’m still trying to figure out exactly what the benefits of Intel DSA are *in the real world.

1 Like

I did find some info on parity calculation speed here:

find in page for “How fast are the hash and parity computation?”.

If I’m reading it right, an I5 from some years ago can do up to 50 gigabytes/sec parity calc using AVX2?

That was with “8 data buffers” though, so maybe the limit for a single thread is significantly less.

That’s how I read the chart too, but it doesn’t seem to match up with performance I’ve seen in the wild.
Not exactly sure what the disconnect between the benchmark and writing to real drives is, even mdadm can take advantage of multiple threads so I don’t think it’s threading.

also… SSD raid performance for write could simply be bumping against the underlying bulk storage speed on the SSD.

I’ve seen some really shitty performance from consumer SSDs once you overflow their high speed DRAM/NAND write buffer…. basically a lot of SSDs I’ve got kicking around fall off a cliff once you run over that buffer for writes specifically. like under 150-200MB/sec per drive. I’ve seen some hard drives outperform that for sequential write.

but that wouldn’t explain OP’s RAID0 write performance being decent.

1 Like

True… unless its maybe causing the on-drive cache to get thrashed by reads as well as writes (and thus hitting an internal throughput limit on the cache).

But yes. Doubtful?

Usually SSDs are pretty good about handling a bunch of reads and writes… at least miles better than spinning disks which can get into real trouble in that situation

On an i5-7600K with parity arrays I get 30MB/s if I don’t have the array set up with the stripe and block size properly, and 280MB/s when I do have them set up properly. The CPUs may or may not be able to handle the speed just fine, IDK. I always assumed it was because you were tasking the drives and CPU with a bunch of reads and writes at the same time and it just bogs down the storage with it all.