Strange RAID10 storage performance

ItsAnArse · July 24, 2020, 7:43pm

Hi all,

I’m hoping someone with a bit more server experience can help me work out what’s going on here. I don’t get my hands on much proper kit so I’ve not much to compare to.

We’ve recently purchased a SuperMicro server with a 32 core EPYC 7452 on a H11SSL-C motherboard, 256GB DDR4 3200MHz, Broadcom MegaRAID 9580-8i8e with 8GB cache and backup battery.

I have 9x Samsung 1.92TB PM1643 SAS SSDs, 8x in a RAID10 and the other as a dedicated hot spare. 2x Samsung 240GB PM883s are running from the onboard MegaRAID 3008 in a RAID1 for OS.

Now, the RAID card is PCIe 4.0 and the motherboard isn’t. I’m aware of that and was told that SM wouldn’t have a PCIe 4.0 board out for a month or so.

I seem to be getting rather low random 4K performance. I have been testing with CrystalDiskMark 7.0.0 on Server 2019 DC Eval on bare metal and Hyper-V Server 2019 in a VM to find roughly the same results.

I have the latest MR drivers from the Broadcom website installed and LSI Storage Authority to manage the card.

I’ve re-configured the RAID10 a few times, read ahead on/off, write through on/off, disk cache on/off etc.

Stripe size I’ve tried most of them, 64K, 256K, 512K, 1MB etc.

Here’s a screenshot of CDM:

Here’s the same run on my WD Black NVMe at home, just the one drive:

I’ve tested with ATTO too and the performance is the same.

I’m really hoping I’ve missed something massive that is causing the speed issues because this is some expensive kit being beaten by a sub £100 consumer drive.

Any input much appreciated.

Thanks,

risk · July 24, 2020, 8:45pm

If you’re doing one write at a time (random 4k q1t1), waiting for it to finish, before starting the next one, it doesn’t matter how many drives you have. It’s all about that single small write latency at that point.

Only thing that might help are battery backed caches or zil on optane or some other “commit to ram”/“writeback caching” approach

ItsAnArse · July 24, 2020, 10:20pm

It is a battery backed write cache, what you’re saying is that the latency is what matters and not the speed shown?

risk · July 25, 2020, 5:31am

If looks like the broadcom battery backed cache controller card is not doing it’s job.

if it was, your random write throughput for small writes should be much higher than your random read throughput for small reads. Writes shouldn’t be hitting any of the disks before coming back to the os as complete - not waiting on them is what the cache is for.
Considering you have SSDs your cache shouldn’t be caching reads, or issuing prefetches, they should be happening at a low enough latency anyway and os/apps can cache reads way more cheaply. Your controller should only ensure your drive caches are always saturated, and that there’s enough space in the command queue for reads (balancing act, depends on drives) and that data has reached the disks before getting evicted from it’s 8gb ram.

Try some software that measures latencies at single thread writes. I’d expect <20 microsecond writes

Your consumer drive is likely serializing those 4k random writes into something like a write ahead log batch and will write them out as a single batch write to flash, regardless of the block size and eventually re sort them. Most databases do the same. ZFS ZIL does the same, your 8G controller should do something similar with its ram, but it seems to suck.

ItsAnArse · August 3, 2020, 8:55am

It says it’s optimal in LSA so I don’t understand, I’m seeing around 75 microseconds latency for read and writes in CrystalDiskMark for Random 4K Q=1 T=1.

For Q=32 T=16 it’s up at 3700 microseconds

risk · August 3, 2020, 2:46pm

Yep, that result is farcical.

The virtual disk settings in megaraid for 9580 should be: writeback enabled or write back always (if write through currently then try writeback always first), disable read ahead, 64k blocks, direct io.

If that’s not how it’s currently configured, maybe reconfigure and try again.

ItsAnArse · August 12, 2020, 8:29am

I’ve done some more playing around and created a single drive RAID0, I’ve used ATTO to get some better graphs. Check this out, default settings but QD of 64, note the file size is 256MB:

Then I changed the file size to 16GB which is double the cache size:

So the cache is doing something…

ItsAnArse · August 14, 2020, 2:48pm

I’ve spoken to the vendor and they mentioned it might need the other connector from the SFF cable connecting to the backplane as we may be IOPS limited with just the one.

system · May 15, 2021, 8:48am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.