ZFS storage server advice

kad · November 12, 2017, 7:11am

Hello!

I’m in the process of planning a Linux based ZFS storage server. This server consist of a 4U server chassis with 20x3.5" front facing HDD trays. Currently I have 10x 4TB drives, but plan to increase this to max 20 drives someday. My question is whats the reccomended way to split into vdevs. There is a lot of conflicting information regarding whats the best practice.

So far i have figured out that multiple vdevs means more IOPS. This server will only be used for media files and generally large files. Performance doesn’t have to be top notch. Redundancy and storage capacity is far more important.

I have come up with some potential configurations:

Configuration #1:
2 vdevs with 10 disks each (raidz2)
55TB usable, with decent redundancy.

However, the large vdev size would drastically increase the scrub time. Provides a good balance between redundancy and storage capacity. Is having this large vdevs recommended?

Configuration #2:
4 vdevs with 5 disks each (raidz1)
57.7TB usable, with moderate redundancy.

This configuration has the most usable storage capacity. But not that great redundancy. If a disk fails in a vdev it might lead to unrecoverable errors (URE) during the rebuild. Some articles have also made a strong point in not using raidz1 for drives larger than 1TB. (Source: http://nex7.blogspot.no/2013/03/readme1st.html)

Configuration #3:
4 vdevs with 5 disks each (raidz2)
42.7TB usable, with decent redundancy.

More redundancy over config #2, at the cost of storage capacity.
Greater potential IOPS. Easier expansion possibilities.

What configuration would you choose in this scenario, and why? Feel free to come with suggestions.

reavessm · November 13, 2017, 6:30pm

I would personally, definitely go with config 3.

Raidz1 is dead. If one drive dies, the rebuilding process puts extra stress on the other drives, which means they’re more likely to fail. If they fail during the re-build process then everything is lost. Especially if they are 4TB drives, the rebuild process can take a looooooooong time.
I would choose config 3 over config 1 because of the size of the vdevs. I have seen several articles listing the ‘sweet spot’ for vdevs is between 6 and 8 drives, although I can’t remember exactly why. I’ll do some more research and get back to you.
Also, the more vdevs, the better. You touched on how mirroring has better IOPS, which is true, but its more like x many vdevs have IOPS than x-1 many vdevs. Mirroring just provides the most possible vdevs while maintaining some level of redundancy.

oO.o · November 13, 2017, 6:51pm

I believe that is correct for drives in the 4-6TB range. That’s based on failure rate and time to rebuild.

SgtAwesomesauce · November 13, 2017, 7:12pm

I recommend against 5-disk RAIDZ2. It’s a non-optimal stripe size. Allow me to explain:

ZFS has a default cluster size of 128KiB. You divide this by the number of drives in your vdev (not including parity) and you arrive at the amount of data written to each disk. Keep in mind that sectors on disks of this size are 4KiB, so you want your result to be a multiple of 4KiB or else you’ll have performance degradation.

Here are a few examples:

3-disk RAID-Z = 128KiB / 2 = 64KiB = good
4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
5-disk RAID-Z = 128KiB / 4 = 32KiB = good
9-disk RAID-Z = 128KiB / 8 = 16KiB = good
4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = BAD!
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good

All this said, your configuration has introduced unfortunate constraints. Additionally, you did mention that performance isn’t extremely important, so if you’re not concerned about it, go ahead and do Configuration 3.

Levitance · November 13, 2017, 7:45pm

Sounds like that might be a Norco case. There have been some reports of bad back planes across a few of their chassis. It’s possible that they’ve upped their game, but just be aware.

reavessm · November 13, 2017, 7:55pm

You’re right. When I set up my FreeNas box I read about how their compression means that you don’t have to necessarily stick to the ‘optimal’ number of drives per vdev. I always forget that not all ZFS is FreeNas…

SgtAwesomesauce · November 13, 2017, 7:56pm

I don’t have any performance metrics on how non-optimal stripes perform vs optimal stripes, but I can imagine we’re talking about a 15-20% peak performance hit.

kad · November 13, 2017, 11:35pm

First of all. Thanks for all the replies so far!

Config 3 gives the least amount of storage available. Two out of five disks will be dedicated to parity, which is a huge loss per vdev. What if the server cabinet had 24 (or 36) bays? That gives each vdev an additional data drive. Bringing the disk total up to 6 per vdev.

Can’t you change the cluster size to make it write out the data in powers of two? I think i read something about this. For example if you changed the cluster size to 96KiB. That would give a multiple of two written to each data disk. 5-disk RAID-Z2 = 96KiB / 3 = 32KiB

Also is it beneficial to have larger cluster sizes for large files?

The server cabinet is not bought yet. Just found a good deal on the case with 20x bays. Not sure if the cabinet even has a brand. Here is a link if you’re curious. (It’s in Norwegian) https://www.finn.no/bap/forsale/ad.html?finnkode=106438409
server

Although looking at the non optimal vdev configurations, i might want to look for a different one. Perhaps a 24 or 36 bay chassis? If so, does anyone have recommendations to a decent storage server chassis?

What kinds of problems are we talking about? Reliability issues? Incompatibility?

magicthighs · November 13, 2017, 11:51pm

Yes, and a faulted pool has 0 storage capacity.

kad · November 13, 2017, 11:55pm

Thats what backups are for. But it’s a fair point.

magicthighs · November 14, 2017, 12:05am

Sorry, didn’t mean to sound like a dick, by the way. I’ve been in your position, and learned the hard way that raidz1 is not a good option.

I have a 6 5TB disk raidz2 in my NAS here, running on a slow, soon (tomorrow actually, yay!) to be upgraded amd e350 motherboard. Scrubs take 24 hours. Resilvering a drive takes over 12 hours. Restoring all my data, about 10TB, from a backup over the 1Gbit network takes even longer, days, and that’s with SATA III drives and controllers. The case you linked is SATA I/II, so unless you’re going to massively upgrade the internals, it’ll take longer than that for you.
If I were you I’d save myself future headaches and go for the raidz2, or even a mirrored stripe option.

Vitalius · November 14, 2017, 12:09am

I’m going to quote myself from another thread even though I know you’ve specifically said Redundancy and Space is more important to you:

And here’s why:

4 Disk ZFS

http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/

This is where a lot of people get themselves into trouble. “Well obviously I want the most usable TB possible out of the disks I have, right?” Probably not. That big number might look sexy, but it’s liable to get you into a lot of trouble later.

…

Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.

…

TL;DR
Too many words, mister sysadmin. What’s all this boil down to?

don’t be greedy. 50% storage efficiency is plenty.
for a given number of disks, a pool of mirrors will significantly outperform a RAIDZ stripe.
a degraded pool of mirrors will severely outperform a degraded RAIDZ stripe.
a degraded pool of mirrors will rebuild tremendously faster than a degraded RAIDZ stripe.
a pool of mirrors is easier to manage, maintain, live with, and upgrade than a RAIDZ stripe.
BACK. UP. YOUR POOL. REGULARLY. TAKE THIS SERIOUSLY.

You have to remember, RAID is NOT backup.

RAID’s purpose is to have the data be accessible even if drives fail. Accessible. Not safe.

The only reason anyone should ever use RAID is so that they don’t have to sweat if a drive is dying or dies on them. So that they don’t have to immediately drop everything and deal with the problem the moment it comes up.

Even then, they should still fix the problem the moment they become aware of it.

Again, say it with me, RAID is not backup.

Rebuilding an array is just as important as how you lay it out to begin with. Having good performance is necessary to rebuild arrays reliably and quickly.

I also want to point out that your end game of 20 drives doesn’t mesh well with the vdev sweet spot of 6-8 drives each. 6 through 8 don’t easily divide into 20.

Edit: I now see the posts mentioning you looking at different chassis. /edit

People thought it was. Now, not so much.

Where RAID 5 is RAIDZ1.

TL;DR: Drive specs were upped to meet the necessary requirements for RAID5 to continue to be reliable.

You’re still right about it taking forever though.

kad · November 14, 2017, 12:17am

No worries, didn’t see it as a ‘snarky’ comment or anything. It’s a good point to bring forth. Especially since i won’t be having a backup server anytime soon…

Interesting it takes that long time on six 5TB drives. I cant even imagine the time it would take on a 10 disk raidz2 vdev. Thanks for the time estimations. Really valuable advice.

One drive has the max throughput at around 150MB/s (WD red 4TB). SATA II can handle about 3Gb/s, which is about 375MB/s. So SATA II shouldnt be a bottleneck. Unless I’m doing the math completely wrong, which might be likely.

I assume you mean 4 vdevs with 5 disks in raidz2?

magicthighs · November 14, 2017, 12:21am

No, avoid 5 disk raidz2 for the reason @SgtAwesomesauce mentioned earlier on in the thread, performance will take a pretty big hit. I’d go for either 4 or (edit: preferably) 6 disks per vdev in that case.

Vitalius · November 14, 2017, 12:23am

You’d want a chassis that has a number of drive bays that is divisible by the ideal vdev disk capacity of 6-8.

So 24 and 36 are multiples of both 6 and 8, but 20 is not for either of them.

kad · November 14, 2017, 1:04am

Thanks for the detailed post! The article you linked was a good read. Could you also give more details in how you think the vdevs should be configured?

Yeah, i get that RAID is a redundancy, not a backup. The server room is gonna be located 100km (62miles) away from where i normally live. So redundancy is important because of that. I’m planning a backup server, but that’s a few years out.

Fair point. I agree with the points about the rebuild taking less stress off the other drives in the vdev. But doesn’t a mirrored vdev reduce the redundancy quite a bit? Since when a disk fails, the entire pool depend the remaining disk to complete the rebuild. If that disk fails, the entire pool is lost.

It might also bring in unrecoverable errors (URE) since all the data comes from the remaining disk. Raidz2 solves this since it has double the parity. It can then double check the information if the data is conflicting during rebuild. However, if two drives fail, it might bring UREs. Unrecoverable errors could cause corrupted data, which is not at all ideal.

The added speed and IOPS is a nice bonus though.

Sorry this was a bit confusing. I meant 20 drives to max out the server with maximum disks. If i had a 24 bay server, max would be 24. Same goes for a 36 bay.

My gut tells me raidz1 is far too little redundancy for a server of this size. Plus i won’t be having a backup solution for a while. (I know i should back it up, but no one has infinite money )

Vitalius · November 14, 2017, 1:50am

You probably meant striped.

In the thread I quoted myself from, I talk about the redundancy of RAID 10 VS RAIDZ1 VS RAIDZ2. Here you go:

4 Disk ZFS

So let’s compare RAIDZ1, RAIDZ2, and RAID 10:

Name

Redundancy

Storage

Performance

Minimum Drives Required

Storage upgrade path

RAID 10

1 failure without losing data, 2[-n] possible (see above example)

50% of original raw space

2[-n/2]x writes; 4[-n]x reads; no parity calculations

4

Upgrading Drive Size:
Progressively replace smaller drives with larger drives and wait for them to re-silver (mirror) before doing the next drive until all are upgraded.
Upgrading number of Drives:
Stripe additional vdevs on newly added disks. This has to be pre-planned.
OR
The more complicated way to expand the stripes themselves is to drop one whole mirror, add larger or more drives, then re-silver (mirror) to the larger (in drives or capacity or both) mirror, then do the same thing with the original smaller stripe.

RAIDZ1

1 failure without losing data

~70% of original raw space [(percent increases with more drives used)]

Less than 1x writes; depends on type of data for reads; CPU works on parity calculations

4

Upgrading Drive Size:
Progressively replace smaller drives with larger drives and wait for them to re-silver (mirror) before doing the next drive until all are upgraded.
Upgrading number of Drives:
Stripe additional vdevs on newly added disks. This has to be pre-planned.

RAIDZ2

2 failures without losing data

~57% of original raw space [(percent increases with more drives used)]

Less than 1x writes; depends on type of data for reads; CPU works on parity calculations

5

Progressively replace smaller drives with larger drives and wait for them to re-silver (mirror) before doing the next drive until all are upgraded.

Upgrading Drive Size:
Progressively replace smaller drives with larger drives and wait for them to re-silver (mirror) before doing the next drive until all are upgraded. You can do 2 at once, but that’s definitely not recommended.
Upgrading number of Drives:
Stripe additional vdevs on newly added disks. This has to be pre-planned.

So you lose some storage space (about 20%), but gain a ton of performance for CPU and read/writes, while keeping “better than RAIDZ1 but worse than RAIDZ2 redundancy”. Also, you have a bit more complicated upgrade process for storage or what is the same process, but a fairly similar process for speed.

My perspective is that the odds of 2 drives failing is so low (presuming enterprise quality drives) that worrying about it is pointless, and even if it does happen, I now have a 50% chance to still keep my data.

Yes, 24 or 36 would be more ideal when using ZFS.

However, here’s something else to consider: Cold swap drives that wait for use.

What I mean is, pretend you still used the 20 drive chassis. Instead of using all 20 drives, use 16 and have 4 turned off, but in their bays.

Presuming you have some remote ability to manage the server, if you have a drive failure, you can remove that failed drive from the array, turn one of the unused drives on, then add it to the array and have it resilver.

This means you could repair RAID failures remotely, and just go replace the failed drive when you feel like it.

Four drives sitting there doing nothing is a bit significant though, so I can see why you may wish to not do that. However, another thing to consider is that 4 drives is the minimum RAID 10 number.

So you could have a second array for whatever reason. Performance or otherwise.

These are just ideas, but if I were you and I wanted max capacity, I would go for 24 or 36 to maximize capacity and efficiency.

What is the Mean Time Between Failure for the drives you intend to use? What are their URE specifications?

These two things along with their capacity should give you all the needed info to use the online calculators that exist specifically for calculating the odds of a RAID array failing.

IMO, even with 20 drives, 2 drives failing is very unlikely. Unlikely enough that I personally would trust that array to not lose my data.

HOWEVER, you should note in my above quote, when I mention “two drives failing”, I’m talking about a RAID 10 with 4 drives.

With RAIDZ1 and RAIDZ2, the more drives you use, the more capacity efficiency you get. With RAID 10, you’ll always get 50% capacity efficiency.

This means that yes, you sacrifice half your storage, but you also get a lot more potential redundancy.

What do I mean by potential redundancy?

You have 20 drives. A RAID 10 is a RAID 1 of a RAID 0 (or vice versa but this way is simpler to setup). So you have a RAID 0 of 10 drives, and a RAID 1 of that RAID 0.

If you lose 2 drives, one from each RAID 0, you lose your data.
If you lose 10 drives from one of the RAID 0’s, you lose no data.

You can see how that can be a blessing and a curse. So that’s not exactly ideal, because while it gives you the possibility of huge redundancy, the lower bound is 2 and that means this is a bit better than RAIDZ1 but a bit worse than RAIDZ2 (as my quote above talks about).

@SgtAwesomesauce

I actually have a question for you. What are the options for increasing RAID 10 redundancy when you have this many disks? I’m trying to consider if there’s a way to reasonably setup the RAID such that redundancy is >=2 disks without losing space. I don’t think that’s possible however.

Still, kad, it all depends how likely you think 2 disk failures is. I think it’d be low enough to not worry about, and even if it does happen, you have a 50% to not lose your data anyway.

Up to you.

SgtAwesomesauce · November 14, 2017, 5:30pm

Sorry to take so long to get back to you.

So, changing the cluster size is definitely an option, but keeping it as a power of two, as opposed to a multiple of two will significantly improve performance. That said, this will reduce ancillary wear on the disks.

On the large cluster sizes topic: The more spindles you have, the more IOPS and raw throughput you’ll have.

This looks just like an unbranded Norco case.

reavessm · November 15, 2017, 7:58pm

You could argue that any improvements to a single drive also affect RAIDZ2, thereby canceling any benefits RAIDZ1 has over RAIDZ2.

SgtAwesomesauce · November 15, 2017, 8:02pm

This may be true, but depending on your budget, Z1 may be far more affordable than Z2.