Broadcom MegaRAID 9271-8i very slow writes in RAID5 with SSD's

I’ve got my MegaRAID 9271-8i SAS 8port RAID controller, which I’ve always run with classic HDD’s. Now, I’ve upgraded those to:

-4x Intel D3-4510 1.92TB RAID 5
-4x Intel D3-4510 3.84TB RAID 5.

Due to not having a battery backup, currently I’ve got Write-Through caching instead of Write-back. So I’m expecting to see just the raw write-speeds the SSD’s can handle.

However, I’m seeing just 80-90 MB/s sequential write speeds (reads at 1500-ish).

If I set them up as RAID 0 with Write Through, it writes at 1500MB/s.

Does this mean the controller is only capable of generating RAID5 parity data at 80-90MB/s, or is something else wrong?

Other settings:

  • No Read Ahead
  • Direct IO enabled
  • Enable Disk Caching Enabled/Disabled doesn’t make a difference

Enabling Write Back, the number is 6000MB/s, but that’s just the cache filling up.

So, hardware limitation of the controller, or bad setup? I feel like enterprise grade back in the day should be able to do better than 90MB/s, even conventional drives were well over those speeds back then already.

This doesn’t really answer your question, but why on earth are you using hardware RAID? You will get so much better performance with software RAID

Not too sure at this point, I guess because I figured it would be less of a pain. I generally like the idea that I can just put the card in another PC if I need to. Plus, once I do have the backup battery, performance with the cache on that controller through the PCI-e 3.0 8x bus is a lot higher than what software RAID can do, no?

I have yet to orient myself on what software RAID solutions are out there. Just kinda hoped this would be a “drop the drives in, make a VD and letsgo”.

I generally like the idea that I can just put the card in another PC if I need to. Plus, once I do have the backup battery, performance with the cache on that controller through the PCI-e 3.0 8x bus is a lot higher than what software RAID can do, no?

Basically incorrect on all counts here. Your RAID controller will only work in devices that it fits in. If you change device form factors or architectures, you’re at the mercy of your controller’s support.

Buying used enterprise equipment, which is common for homelabbers, means you might not get more support going forward.

By contrast, software RAID will work if you move systems, including moving from x86 to ARM for instance. Portability is better with software raid, as is expected supported lifetime.

Fortunately, Wendell has you covered in all the gritty details.

1 Like

Something isn’t right with the setup, you should be getting a couple gigabytes per second writes with that card in raid 5.
If you enable the write-back cache and write significantly more than the controller’s 1GB ram, what are your speeds?

​ ​ ​​​ ​ ​​

I’m going to disagree with the others posting; hardware raid is more performant than software raid in basically all scenarios and there are benchmarks to prove it.

Regarding software raid superior portability, I’d only believe that to be true for mdadm, every other software raid scheme I’ve dealt with breaks pool compatibility at some point in version rolls or the company straight up goes out of business and everything breaks (or open source project depreciates).
Also in my experience software raid has a tendency to corrupt the raid when it fails as opposed to hardware raid where this is far less likely. There is a reason the newest sapphire rapids/genoa servers from dell/hp/lenovo come with hardware raid controllers.

I’m not sure I understand the hate against hardware RAID either. Software raid has its charms I’m sure, but “PCI-e might become EOL at some point” doesn’t convince me.

Just did what @twin_savage said, with a test file of 64GiB I do get 1500 read and 1500 write. 128MiB results in 6341 and 6936 - here you see the massive advantage of the cache the hardware controller has.

So it’s more than capable - still, why the enormous hit with write-through caching?

No you aren’t. Linux mdadm will read and use LSI MegaRAID signatures just fine, assembling the drives into a software RAID volume if they aren’t already on a hardware RAID controller.

1 Like

I saw the same thing doing that on an old LSI PERC H700 many years ago. Enable write-back, enable read-ahead, and do a long enough test that it isn’t all to controller cache, and you should see much more reasonable numbers. Even better is to do RAID-10, not RAID-5.

I’d be willing to bet that the raid controller is leveraging it’s volatile cache to accelerate parity calculations when in write-back cache is enabled. The PowerPC cpu of the raid controller likely only has several megabytes of sram in it which will reduce parity throughput. I’m not a real embedded engineer but I don’t think all hardware raid controllers work like this, or at least they don’t anymore.

An interesting experiment could be to increase your raid stripe size to see if this increases performance when in write-through mode.

Thanks for the info, guess I’m getting myself a new BBU supercapacitor. I am still puzzled why it would tank so much on write-through, there is no physical reason it should do that, right?

Raid 10 is too big a capacity hit for my taste. Raid 5 is slightly more risky, but I’ll take that. It’s non-essential data (my entire steam library) and I regularly back-up.

@twin_savage, great idea. Will test that right now and then I’m off to bed.

This is super useful information, I always forget about this.

Very interesting. Here’s some quick stripe-size results, identical settings to OP and Write-Through in all cases:

Stripe size; read MB / Write MB:

8kiB; 1266 / 94
128kiB; 1675 / 1402
1MiB; 1738 / 1139

Theories? I could test all stripe-sizes tomorrow morning for science.

Your card exposes the setting to force write-back, doesn’t it? You should be able to test it out before buying the batt/cap.

I was, too. Only insight I can offer is that old RAID controllers were designed for old, slow HDDs, so they made a lot of assumptions and did some optimizations to try to speed up HDDs. Those optimizations apparently weren’t very well suited to SSDs.

I did test force write-back, yes. HD-tune Pro with a 10gb file nicely shows 6GB/s for the first 500-or-so MB, then drops to 1000-something and stays there for the rest of the test. Regardless of stripe-size, interestingly enough.

Only on write-through we’re apparently seeing a massive drop in throughput with any stripe-size below 128kb.

Notably, btw, the controller does recognize the disks as SSD’s and automatically labels them as such.

Thanks for all the insights so far :slight_smile:

Also on other (bigger) stripe sizes, I still have low sequential write speeds at Q1T1, even though Q8T1 looks normal then.

Forced Write-back means everything looks normal.

Could it have anything to do with not having done full initialization, but merely fast?

Edit// Done a full init just to be sure, no difference. Without WB enabled, there is no scenario where I get more than 150MB/s sequential write in a Q1T1 situation.

I swear the older I get the more highlevel/generic the things I say are; the corporate world is taking its toll on my mind.

When using a writeback cache, disk IOPS are being saved to cache; the controller can batch up smaller writes (smaller stripes) into one big write and in effect use less of the controller’s total discreet write “budget”.
–think SIMD instructions, by having more smaller chunks that can’t be batched up you are starving the vector execution units.

This makes some sense with the low queue depths since you can’t bunch up the writes together if they are lower depth, however if write-back is enabled they can be saved up in the cache and written out quickly in a “batch”.

Running a full initialization (overwriting all data) is inadvisable on SSDs due to wear leveling and other SSD controller magic.

With hard disks however, this logic is correct and could impact performance

I’m learning plenty here, thanks for that! I’ve set it up as a 128mb stripe size for now, and am doing some runs with ATTO to see the effects of all settings (Direct/Cached IO, Y/N Read Ahead, Disk Caching, WT/WB).

Anyone have any idea how a Direct IO WT setup has better write performance when Read Ahead is enabled? Significantly better, especially in all IO sizes below 2MB all the way down to 1kb.

This is perfectly valid, but why buy a Ferrari to keep it in the garage?

I thrash my nine drive SSD RAID 6 array daily and have had zero issues going on three years with multiple rebuilds and initializations. If something does go wonky I would replace the SSD with a new one and continue, no different than regular spinning rust.

The only time I’ve had an unrecoverable RAID fail was when using an AMD Software @ RAID 1 set to 8TB. Lost a drive and it was unable to recover on an identical spare drive. Luckily, I had access to the good drive and copied everything off. I’ve had a dedicated RAID card since with no problems and survived three recoveries for dead / problematic drives.

YMMV, of course.

The redundancy in a RAID-5 setup comes from the parity information on the drives and depending on the specific implementation by the controller’s hardware and software, the parity information might need to be written in full to be considered valid. There are implementations, and with them an entire school of thought, that argue that on an uninitialized system the data blocks that got written will have produced valid parity information for them, while the rest of the array does not need parity data, however how the specific controller and it’s initialization process behaves is dependent on the implementation and with that the manufacturer or specific controller model. Some implementations might need to be initialized, some might not, some might see performance benefits from being fully initialized, some might not.

This comes from the fact that SSDs will wear when being written to, but as explained in my previous paragraph depending on the RAID implementation the attached controller uses this still might be required.