Request input on ZFS storage layout

H-i-v-e · May 19, 2023, 5:18pm

Hi there,

I have an decision plaguing me for the last couple days and decided I needed some input on the topic. I will soon buy a 1U sever with 12 LFF bays for the purpose of placing it in a datacenter of my choice. As I said the server has 12 trays for 3.5inch HDDs, so I will place twelve 16TB HDDs in there to get a good amount of space. I plan to dedicate 2 drives as hotspares because I do not want to need to touch the server anytime soon. The question is do I go with a single 10 drive raidz2 vdev and live with a single drives worth of IOPS or rather go with two vdevs, 5 drive raidz1 each, and hope that replacing a failed drive by a hotspare and subsequent resilvering will work.

Edit: I am primarily looking for stability, IOPS is only second requirement

Here is a poll, but I’d still like to hear you opinion!

1 vdev, 10 drives, raidz2
2 vdevs, 5 drives each, raidz1 each

0 voters

Thank you!

Exard3k · May 19, 2023, 5:32pm

My vote is on RAIDZ2. Just because retaining redundancy and self-healing after a failure.And resilver can be tedious on 150TB of RAIDZ. 8+2 is a good size for Z2. 80% storage efficiency, even number of “data disks”, and double redundancy. Best you can hope for if IOPS are of secondary concern.

And having a spare in the chassis when the server is located somewhere else and you don’t have access all the time is a wise move. Not sure if 2 are necessary. You could go 9+2 with a single spare. That’s very much a personal choice.

H-i-v-e · May 19, 2023, 7:18pm

This is actually why 8+2 had been my favorite, too. It would not be the first time another disc dies during a rebuild/resilver. The sever will be a substantial distance from my home, so the first disc failure I’d like to be able to shrug off. Having to mirror about 80% (quota) of 128TB, so about 100TB, with my measly home connection would be no option.

Trooper_ish · May 19, 2023, 8:03pm

Question, why not raidz3?
if you are gonna have spare disks, and one large array
~~iops~~ speed look fine from back in the day when camelo did a bunch of testing:

11x 4TB, raidz3 (raid7), 30.2 TB, w=552MB/s , rw=103MB/s , r=963MB/s
12x 4TB, raidz2 (raid6), 37.4 TB, w=317MB/s , rw=98MB/s , r=1065MB/s
12x 4TB, raidz3 (raid7), 33.6 TB, w=452MB/s , rw=105MB/s , r=840MB/s

(I know you already referenced the site elsewhere, and it is old, but…)

https://calomel.org/zfs_raid_speed_capacity.html

Exard3k · May 20, 2023, 3:30am

I think this isn’t a bad idea. Just get one more drive working instead of being spun down and on-demand business.
Problem is performance decreases in degraded state. So running the pool with one failed drive (which is still double redundancy) is handicapped.

Another attack vector on this situation would be dRAID. I didn’t use it myself so far and have little experience with it outside of checking wikis and OpenZFS devs talking about it, it’s a rather new feature (1-2 years old). But it is made to incorporate idle spare into the pool and put them to work and having faster resilvers.

It’s a bit more abstract, but I think it fits into the use case. A dRAID2:8d:2s:12c would be the equivant of a 10-wide RaidZ2 with 2 spares. dRAID2 = RAIDZ2, 8d = 8 disks worth of data, 2s= 2 spares, 12c = 12 children aka disks in total.

Ghan · May 20, 2023, 5:10am

I agree, this sounds like a good use case for dRAID.

Z2 otherwise. I have had a drive fail while rebuilding, so having double parity can be very nice.

thro · May 20, 2023, 6:20am

I didn’t vote…

What is your requirement for uptime? What are your performance expectations (i.e. what are you planning to use this storage for? VMs? DBs? Live file store? Backup archive?)

All the VDEV types have their strong points…

Well here’s the thing. You still need to monitor for and respond to drive failure. Hot spares don’t get you off the hook for that. Having 3 drives fail will take you out eventually otherwise.

If a drive fails you need to replace it to maintain your resiliency goal, hot spares just give you a little more time to get to it as the hot-spare drive will look after you between the first failure and potential other failures whilst you source a drive. You’d still want to replace the failed drive as soon as you can - you just can sleep easier whilst waiting for the new drive to arrive or be purchased.

So given that… If you can swap in a cold spare within 4-5 days with hot swap drive bays, i’d be tempted to forego the hot-spares and use them for more VDEVs. (I say 4-5 days as Netapp have been taking 1 week to get disks to me lately and they don’t seem to fussed vs. RAID-DP with a single hot spare per aggregate)

I.e., i’d maybe consider 2x 6 drive RAIDZ2 with zero hot spares (maybe keep a spare drive on hand for responding to failure - but with 2x raid6 you could sustain 4 failures (vs. 2 for the single raidz2) and no loss if they were evenly distributed between both VDEVs. Which is a bet… but if we consider that failure chance should be evenly distributed it’s a bet i’d be willing to take… especially as i’d be planning to replace failed drives ASAP anyway. You could still have all the failures in one VDEV but… stats/probability/etc… i’d risk it. But again depends on your risk profile.

Yeah its a bit more of a risk but i think the 2x IOPs performance is worth it, and you should have backups anyway. For the number of disks involved, so long as they’re CMR i think cold spares and 2x raidz2 would be plenty of resiliency.

I agree with avoiding RAIDz1 (if this is not test/play with an acceptable window for restore or not-mission-critical) - 16 TB drives are way too big and take too long to rebuild for that. Tis why i didn’t vote - i’m always for getting as many VDEVs as you can with an acceptable fault tolerance objective - but RAIDZ1 with huge modern drives is skirting the bounds of pointlessness.

Enterprise array vendors were not recommending single parity disk raid for drives > 1 TB 15 years ago due to rebuild time required and considered risk of possible second disk failure during rebuild. Even with 1 TB disks. Enterprise array vendors are very risk averse so take that one with a pinch of salt, but the drives you’re using are 16x that size

All that said, i currently have an OLD out of support Netapp that’s doing non-critical file restore temporary storage and its had a single failed disk (i think 15k RPM 600gb SAS) in one of its RAID-DP (think raid6/raidz2) aggregates for 4+ years now…

It’s still ticking (if it dies i’ll re-create the aggregate with less drives, its just scratch space).

I do not suggest doing that, but as above it really depends how mission critical the storage is and how fast you can respond…

thro · May 20, 2023, 6:25am

Sequential throughput rate (mb/sec) is not IOPs.

If you don’t care about random IO (i.e., it is small number of users, no VMs and mostly streaming data) that’s fine, but if not… you’re leaving the ability to respond to multiple concurrent requests behind.

It really does depend what you’re planning to use the storage for - there are trade-offs. Single RAIDZ3 fine for archive… but throw concurrent access/hot data at it and its a big trade-off…

Again, caching can help somewhat, but only with data that can be cached, and only once it is in the cache. A burst of requests for not-yet cached data (or data that aged out) will still hit the disks.

Whizdumb · May 20, 2023, 9:56am

Hot spares are a waste of money. You’re spinning the disks up anyway, why not make it part of the redundancy. I’m with @Trooper_ish on the raidz3 9+3 in a 12 bay. Ultimately in a failure scenario, the goal should be to replace a failing drive with a properly “burned in” cold spare. And in this proposed topology, raidz3 buys you the most time to ensure that you can make that happen.

H-i-v-e · May 20, 2023, 10:44am

I have read all replies, but not to quote everyone, I have decided to get a quote for a 2U server, since there are some 2U servers available with up to 24 LFF drive bays. This would give me a little more breathing room. This way I could use, say 14 drives, and go with 2 vdevs of 6 drive raidz2 each and still have 2 hotspares. I will have to see what the additional space will cost me, tough.

This would be a problem. The server will be about 500 miles away. I need to book vacation and travel before I could exchange any failed drives. This is why I planned with hotspares. With hotspares I can just trigger a replacement from from anywhere and return to a healthy pool without having to touch the server and when I have replaced a drive for the second time I could plan to place new hotspares in an acceptable time frame.

As good as it gets. I know that in case that a controller or say the servers motherboard fails I would need to be on site, but the most failure prone part of the server are the drives. While the server might run fine for another 10 or so years in a clima controlled datacenter the drives will wear out and are bound to fail because I will use them daily.

I read about this yesterday as well. I will need to evaluate the option and try it out on my computer since I have no experience with it.

I agree with you, this is too risky with 16TB drives. It’s off the table.

Exard3k · May 20, 2023, 10:55am

Yeah. Spin up some TrueNAS VM with like 16 disks and test this out, including resilver and replace. I didn’t bother doing it yet for dRaid, but that’s what I would do. Keep track of zpool status to see progress behind the scenes. If you use 64GB+ virtual disks, time difference in resilver will be well measurable.
Doing this in a test environment gives practice and confidence in using it. And you can also test other configs like dRaid:8d:2s:12c for RAIDZ1 with two vdevs.

And with dRaid, all 12 disks are running and working. You can have 2 disk failures and pool is still ONLINE and only degrades after the 3rd disk fails. I really like the sequential resilver feature. This is what is used for resilvering mirrors and is very fast compared to RAIDZ. And zed automatically kicks in the distributed spare just like autoreplace=on.

Keep us updated

H-i-v-e · May 20, 2023, 11:24am

I will certainly do that.

I just send out a request for a quote on a 2U server. 24 LFF drives would give me the opportunity to use the two vdev raidz2 setup plus hotspares and even add a third vdev later on if I’d require more space.

thro · May 20, 2023, 12:45pm

500 miles is only a weekend road trip I’d still be tempted to run 2x raidz2 with no spares but that’s me. Again though - what’s the expected workload? If its just dumb file store (e.g., ftp site, archive location, etc.) then multiple VDEVs are less important. But if its expected to perform for many concurrent users or processes… single vdev will hurt.

Which leads to another factor:
What are the consequences of pool loss? If its going to lose money and restore time is either long or (heaven forbid) not possible (due to no backups) then definitely go for hot spares or as above raidz3.

Are you using 16 TB drives due to sweet spot cost, or do you need all that capacity?

If the array is going to be running half empty, ZFS re-silvers only need to do the consumed space (not full disk like traditional raid)… rebuild times will be a lot quicker.

Exard3k · May 20, 2023, 12:56pm

Just tested this. TrueNAS doesn’t support dRaid yet. You can’t do anything in the GUI but zpool create in CLI works, but the GUI bugs out after that. So if you want a fancy GUI, you’re out of luck.
Pool didn’t auto-import and had to manually import it on reboot.
Also zed didn’t kick in the spare after I pulled a drive. Don’t use TrueNAS with this.

Update:
Proxmox does fine and works as intended. Did two successful disk failures and replace. dRaid lists as degraded as long as a spare is used and not yet replaced. This is visually unappealing but normal for spares.

Initial sequential resilver was like 6 seconds. Replace and healing resilver took like 6 times as long. I was using 32GB disks (qcow2 on NVMe) and pool at 70% cap. So not really helpful, but a noticeable difference.