ZFS Setup with 10tb drive

Lauritzen · March 7, 2018, 9:12pm

So let me get this right.
you have 12x10TB storage right, which is 120TB is my public school math serves me right?

now
1.
do you have 120TB of data?

is your total amount of data < (120TB/2) of data?

whats your nighttime sleep/stress level worth.

4.1(see point 2)
if no then pool all of your harddrives, and pray it all works out, and be prepared to replace some harddrives in a year.

4.2(see point 2)
If yes mirror 60TB and sleep “well’er” at night(no one wants to risk 60TB of data).

btw 120TB of storage, daym guy
I was proud when i got my 6TB home media/backup server up and running(running zfs), and im deffinetly not close to running low.
If you have less then 60TB of data id split it into 2 nodes e.g. a network node, and a backup site,
on a seperate system, preferbly offsite.

oO.o · March 7, 2018, 9:27pm

None of the options that @vlycop laid out involve 120TB of space, so we can assume that isn’t the case.

The options are all RAIDZ2 or RAIDZ3…

vlycop · March 7, 2018, 9:29pm

This is not the idea, i only put 10gb for my desktop and my 2 esxi server, but more because i wanted 2-4 gbps, i don’t think i will saturate 10gb anytime ^^ By the way i can’t put more ram without changing MB, and i got this one last week, brand new, so not gonna happens
@Lauritzen

Gosh no, i don’t have that data right now, but i build my 15tb nas last years, thinking “i would never run out !” and now i’m under the 20% threshold and risk corruption. and right now, none of my vm have backup system because of a lack of place, so i will consume way faster.
right now, yea for sure, but it grow so fast. it’s pretty dumb, i know it, but i have to save everything locally, and that include my dash-cam, my videosurveillance at home, all my call and texts, series, movies, every paper i get… i backup the internet i’m browsing and video i’m watching to, just in case (youtube copyright have already delete so much video i loved …)
i’m to young to have ever encounter any … cataclysmic failure, so i don’t stress, for now
i can’t take a 50% cut … this is way to expensive as a cost/Tb, but i feel like 2, maybe 3 disk redundancy is well enough to be safe, right ?

and yea… it’s a lot … and i’m just guy with a home lab … but gosh it’s thrilling to work with this kind of stuff XD

oO.o · March 7, 2018, 9:31pm

I don’t think you risk corruption, it’ll just get slow.

For future expansion, you’ll probably just want to attach a jbod unit and add more drives, right?

vlycop · March 7, 2018, 9:37pm

This is the issue, i won’t be able to expend further than this 120TB raw. my rack is a half size one … and it’s full XD (almost but you get the idea) i could have got some net-app casing at work, but i needed a self contain system on 2u, couldn’t afford a 1u space for a controller.

oO.o · March 7, 2018, 9:54pm

Well, at this point all you could really do to get more capacity is use 12TB drives instead of 10, but that’s not really cost effective.

Hopefully it will take a while for you to fill this guy up and by then maybe you have a second rack or 20TB+ drives exist.

I would recommend you set up an unlimited G Suite account ($10/user/mo) and push all your data to Google Drive via Duplicati. It is the best/cheapest cloud backup option. There’s supposed to be a capacity limit for Google Drive until you hit 5 users, but I’ve heard that it’s not enforced (I have over 5 users).

If you have Google trust issues, just encrypt everything before backing it up (and maybe mangle the filenames).

Trooper_ish · March 7, 2018, 11:07pm

It might be worth considering the future as well.

I’m presuming you don’t have the drives yet, just in the planning phase?

ZFS is designed to last and last, but you may stumble with upgrading in the future.

Having 12 drives in a RaidZ2 / Z3 would be stable enough to replace a drive when it goes down, with a comfortable chance that not more than one other drive will fail during the re-silver, so you’d be fine with it from that side.

However, you would not have more space available until you have replaced all 12 drives in the future, so I would say 2 x 6 drives in raidZ2 would allow for security, speed, and a lopsided upgrade later on, presuming larger drives come out and are affordable, as you would have more useable space after only upgrading 6 of the drives in one of the vdevs.

Or, by the time you have filled the array, you may find that you are in a position to just get an additional machine, like the JBOD / shelf suggested by @oO.o

oO.o · March 8, 2018, 4:46am

Yeah, I really think that’s the sweet spot.

Qosmo · March 8, 2018, 5:35am

I run a pool of 8 10TB drives in RaidZ2. Haven’t had any issues, and will probably add another pool of 12 10-12TB drives in a RAIDZ2 configuration in the future. I do not have 10gig networking but it saturates my 1gig connection without an issue.

oO.o · March 8, 2018, 5:48am

There is the issue of stripe size, although I’ve heard that it’s (magically?) not an issue in FreeBSD.

See here for specifics.

tl;dr: 6 drives is a good number for RAIDZ2.

Qosmo · March 8, 2018, 7:13am

Yeah, I’ve seen 10 drives is about the maximum they recommend so I did 8 drives for my pools with 24 disks in all. Thought about doing a RAIDZ3 with 12 drives in the future but never configured it.

carnogaunt · March 8, 2018, 9:08am

I’m not convinced that stripe size in an issue.

See here for a discussion of stripe size:

A misunderstanding of this overhead, has caused some people to recommend using “(2^n)+p” disks, where p is the number of parity “disks” (i.e. 2 for RAIDZ-2), and n is an integer. These people would claim that for example, a 9-wide (2^3+1) RAIDZ1 is better than 8-wide or 10-wide. This is not generally true. The primary flaw with this recommendation is that it assumes that you are using small blocks whose size is a power of 2. While some workloads (e.g. databases) do use 4KB or 8KB logical block sizes (i.e. recordsize=4K or 8K), these workloads benefit greatly from compression. At Delphix, we store Oracle, MS SQL Server, and PostgreSQL databases with LZ4 compression and typically see a 2-3x compression ratio. This compression is more beneficial than any RAID-Z sizing. Due to compression, the physical (allocated) block sizes are not powers of two, they are odd sizes like 3.5KB or 6KB. This means that we can not rely on any exact fit of (compressed) block size to the RAID-Z group width.

Although this blog post is more concerned with space efficiency, and @SgtAwesomesauce was talking about performance.

As for the performance claims, the most thorough discussion I can find on this is in a series of HardForum threads. Ultimately the reason given for the performance degradation seems to suggest that properly setting ashift should avoid the issue:

So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.

Performance numbers? Well i talked to alot of people via email/PM about these issues. I don’t have any 4K sector drives myself yet. I did perform geom_nop testing with someone, which did increase sequential write performance considerably. geom_nop allows the HDD to be transformed to a 4096-byte sector HDD; as if the drive wasn’t lying about its true sector size. Now ZFS can detect this and it’s even theoretically impossible to write to anything other than multiples of the 4096-byte sector size. So the EARS HDDs never have to do any emulation as well; only ZFS has to do some extra work but is much more efficient. The problem is that this geom_nop is gone after reboot; it’s not persistent. It’s basically a debugging GEOM module. But i hope to find a way of attaching these modules upon boot everytime on 4K disks with my Mesa project. That might solve alot/all of these issues.

So perhaps this recommendation comes from a time when the ashift setting wasn’t implemented yet, or had problems on FreeBSD?

At work I have a server with a 15-bay JBOD of 10TB drives in RAIDZ3 and some people in another department use 12-bay JBODs of 10TB disks in RAIDZ2 as their capacity storage building blocks. The performance we get out of it has always been sufficient, so choosing to build in 4+2 or 8+2 vdev increments instead would have a very real impact on capacity per dollar in exchange for alleged performance enhancements we don’t need. Matching the vdev size to the JBOD size and adding JBODs as required is a very natural way of expanding if you can afford to buy in such large chunks.

vlycop · March 8, 2018, 12:27pm

Thank for your advice @carnogaunt, @oO.o, @Qosmo adn @Trooper_ish. Sorry for the late reply, i’m in Europe.
You convinced me onto a 2vdev raidZ2, it will be both easyer to upgrade, and to fill-up with drive since i can buy one set and wait a bit for the next XD (i only win 20K so this all project have eaten much of my money ^^)

Well that’s a relief, because i was a bit spooked by this “threshold”.

i saw on /r/DataHoarder/ that google had start to enforce it in some country, and tbh i can’t afford a solution that is based on “work for now”. i will try to get a lto 5 drive to have some backup in the future, but first i will ship a microgen8 to my grandfather for offsite backup :3

DeusQain · March 8, 2018, 12:56pm

I’m with folks who say 2Vdev RaidZ2.

I have a 24 Drive shelf, and I have 4Vdev RaidZ2

it’s the best dispersal of drives, and cuts out 8~ of the 24 drives. But it also means if you lose a drive, you get to rebuild one Vdev, instead of the whole shelf.

I found that my current arrangement, gave the best performance.

Also, You will want to put as much RAM in that machine as possible. 16GB isn’t going to cut it with 120TB of raw storage. 64GB at least. More the better.

ZFS while awesome, does utilize a lot of system memory to operate at full capacity.

Uque · March 8, 2018, 1:47pm

Didn’t the OP say:

DeusQain · March 8, 2018, 3:37pm

Indeed. Misread it.
Doesn’t make my statement any less true.

oO.o · March 8, 2018, 3:46pm

@DeusQain what’s your position on the stripe-size issue?

Curious for myself, not specifically for this case where number of drives naturally aligns with stripe size.

vlycop · March 8, 2018, 4:11pm

Very true, i think (and hop) that with 128gb i’ll be safe, adding a 120gb ssd for l2arc. But it’s not like my MB would accept anything more ^^
I’m actually glad i didn’t had to pay for that memory, ddr4 ecc memory is damn expensive !

do you have a specific use for each vdev ? like media, backup, steam lib … or do you end up “merging” them afterward ? (i have no idea if such thing is actually possible, never tried)

oO.o · March 8, 2018, 4:29pm

Yeah that should be fine. I also misread your OP when I recommended more memory.

DeusQain · March 8, 2018, 5:14pm

It’s all one big pool.
I discussed this whole thing with @wendell. We both have a similar setup.