I’m not convinced that stripe size in an issue.
See here for a discussion of stripe size:
A misunderstanding of this overhead, has caused some people to recommend using “(2^n)+p” disks, where p is the number of parity “disks” (i.e. 2 for RAIDZ-2), and n is an integer. These people would claim that for example, a 9-wide (2^3+1) RAIDZ1 is better than 8-wide or 10-wide. This is not generally true. The primary flaw with this recommendation is that it assumes that you are using small blocks whose size is a power of 2. While some workloads (e.g. databases) do use 4KB or 8KB logical block sizes (i.e. recordsize=4K or 8K), these workloads benefit greatly from compression. At Delphix, we store Oracle, MS SQL Server, and PostgreSQL databases with LZ4 compression and typically see a 2-3x compression ratio. This compression is more beneficial than any RAID-Z sizing. Due to compression, the physical (allocated) block sizes are not powers of two, they are odd sizes like 3.5KB or 6KB. This means that we can not rely on any exact fit of (compressed) block size to the RAID-Z group width.
Although this blog post is more concerned with space efficiency, and @sgtawesomesauce was talking about performance.
As for the performance claims, the most thorough discussion I can find on this is in a series of HardForum threads. Ultimately the reason given for the performance degradation seems to suggest that properly setting
ashift should avoid the issue:
So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.
Performance numbers? Well i talked to alot of people via email/PM about these issues. I don’t have any 4K sector drives myself yet. I did perform geom_nop testing with someone, which did increase sequential write performance considerably. geom_nop allows the HDD to be transformed to a 4096-byte sector HDD; as if the drive wasn’t lying about its true sector size. Now ZFS can detect this and it’s even theoretically impossible to write to anything other than multiples of the 4096-byte sector size. So the EARS HDDs never have to do any emulation as well; only ZFS has to do some extra work but is much more efficient. The problem is that this geom_nop is gone after reboot; it’s not persistent. It’s basically a debugging GEOM module. But i hope to find a way of attaching these modules upon boot everytime on 4K disks with my Mesa project. That might solve alot/all of these issues.
So perhaps this recommendation comes from a time when the
ashift setting wasn’t implemented yet, or had problems on FreeBSD?
At work I have a server with a 15-bay JBOD of 10TB drives in RAIDZ3 and some people in another department use 12-bay JBODs of 10TB disks in RAIDZ2 as their capacity storage building blocks. The performance we get out of it has always been sufficient, so choosing to build in 4+2 or 8+2 vdev increments instead would have a very real impact on capacity per dollar in exchange for alleged performance enhancements we don’t need. Matching the vdev size to the JBOD size and adding JBODs as required is a very natural way of expanding if you can afford to buy in such large chunks.