ZFS Advice - Drives per vdev

Hi All,
(Not sure if this topic belongs here or in the software/operating systems category)
I’m new here (be gentle :slight_smile: ) and about to convert an old Windows box to be a Freenas box. It’s an Intel Xeon E3-1220v3 CPU on a Supermicro X10SL7-F board with onboard LSI 2308 controller, 32GB ECC RAM.

I’m new to Freenas and ZFS, tried it about 6 or 7 years ago when I first set up this server, but decided to try out Windows Storage Spaces and ReFS (not very experienced in Linux/BSD in general).

After Windows had decided to forget that the pool exists (again - this was the second time it happened), I decided to give Freenas another go.

I have 9 x 4TB drives that can be used.

My question is: how many drives per vdev would you use?
I can go:

  • 2 x vdevs, each with 4 drives, RAIDZ1 = 24TB Usable space
  • 3 x vdevs, each with 3 drives, RAIDZ1 = 24TB Usable space

Is there an advantage to one or the other?
Option 1 has 2 x Drive redundancy, option 2 has 3 x Drive redundancy, but requires an extra drive and controller. My case only has 8 spots for drives, and I have another 8 drive enclosure to use later as I expand the pool.

My second question is: Whe I eventually expand the pool, do I need to add drives in the same vdev configuration? Can the drives be larger (e.g. 3 x 6TB drives per vdev)?

3rd Question - Should I do this now or shohuld I wait for TrueNas/ZFS2.0 (no idea when that would materialize).

Once this is set up I also plan to build an Unraid box to back up the Freenas box.

Any advice would be appreciated.

Thanks,
Vu

  1. The more VDEV’s the better. This is where the bulk of performance comes from.
  2. Don’t use RAIDZ1, or any RAIDZN for that matter. Setup mirrors instead. The reason for this is three fold:
    • Rebuild times are way faster. Its just a disk to disk copy instead of recomputing all the missing pieces.
    • Performance is much higher. Big time.
    • If you use RAIDZN then data will only be repaired when you run a scub. If you use mirrors then for example: you read some data a month after its been written and on one of the drives in the mirror the data became corrupted. In a RAIDZN scenario this is only corrected when a scub is ran but on a mirror config the data is automatically compared with the checksum on the other disk in the mirror and is fixed. This is dramatically better than waiting for a scrub. You should still do scrubs keep in mind.
  3. Read this https://jrs-s.net/2018/04/11/primer-how-data-is-stored-on-disk-with-zfs/
  4. Refer to cheatsheet for futher tuning: https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/

For your system I would create 4 VDEV’s of 2 disk mirror, with 1 hot spare. Then tune via the cheatsheet above.

For reference here is my disk pool on FreeNAS:

root@freenas[~]# zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:34 with 0 errors on Thu Jun 25 03:45:34 2020
config:

	NAME        STATE     READ WRITE CKSUM
	freenas-boot  ONLINE       0     0     0
	  ada0p2    ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 0 days 06:06:19 with 0 errors on Mon Jun  1 22:57:14 2020
config:

	NAME                                                STATE     READ WRITE CKSUM
	tank                                                ONLINE       0     0     0
	  mirror-0                                          ONLINE       0     0     0
	    gptid/4ee093f3-64a6-11ea-a18e-002590868ef8.eli  ONLINE       0     0     0
	    gptid/7f130aa7-6533-11ea-a18e-002590868ef8.eli  ONLINE       0     0     0
	  mirror-1                                          ONLINE       0     0     0
	    gptid/603c786c-6549-11ea-a18e-002590868ef8.eli  ONLINE       0     0     0
	    gptid/e58ff733-6578-11ea-9aca-002590868ef8.eli  ONLINE       0     0     0
	  mirror-2                                          ONLINE       0     0     0
	    gptid/54f5c2c1-3670-11ea-8fb3-002590868ef8.eli  ONLINE       0     0     0
	    gptid/5baedcb0-3670-11ea-8fb3-002590868ef8.eli  ONLINE       0     0     0
	  mirror-3                                          ONLINE       0     0     0
	    gptid/62aaf5d9-3670-11ea-8fb3-002590868ef8.eli  ONLINE       0     0     0
	    gptid/f5a6c59d-660f-11ea-8345-002590868ef8.eli  ONLINE       0     0     0
	  mirror-4                                          ONLINE       0     0     0
	    gptid/706f4cc3-3670-11ea-8fb3-002590868ef8.eli  ONLINE       0     0     0
	    gptid/723ae161-3670-11ea-8fb3-002590868ef8.eli  ONLINE       0     0     0
	spares
	  gptid/7929076c-3670-11ea-8fb3-002590868ef8.eli    AVAIL   

errors: No known data errors
root@freenas[~]# 

And my ssd pool on proxmox:

root@pve:~# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:06:23 with 0 errors on Fri May 22 14:00:03 2020
config:

	NAME                        STATE     READ WRITE CKSUM
	tank                        ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    scsi-331402ec01005df70  ONLINE       0     0     0
	    scsi-331402ec01005df71  ONLINE       0     0     0
	spares
	  scsi-331402ec01005df72    AVAIL   

errors: No known data errors

Welcome to the community.

Nice board. Perfect for a NAS with some minimal containers. If you are looking at running some VMs you may want something with more cores (E5). Freenas doesn’t need high power cores.

I agree with @Dynamic_Gravity, do not go for RaidZ1. Your risk of data loss with drives over 2TiB is not worth it, not for the price of more drives. However I am not as adverse to raidz2 as he is, but you need at least 2 disks of redundancy.

This is not accurate. In both cases you have one disk redundancy. As your pool will stripe across the 3 vdevs you will lose all data on the whole pool if you lose 2 drives in a single vdev. This is why RaidZ1 is bad.

So the options are really mirrored vdevs or raidz2, or Raidz3.

In @Dynamic_Gravity’s suggested setup you get a high performance 16TiB array with high (50%) redundancy. This is expensive use of disk space and if you only have gigabit Ethernet it is “too fast” for the network. So you can take a bit more risk.

With 9 drives you can create a 9 disk Raidz3 vdev. This gives 30% redundancy and high recoverability but you lose write performance. In my experience this is not material for most “NAS” workloads. Your capacity would be 24TiB per your original scheme.

The one factor I’m not sure about for your use case is whether you need to keep any of the drives intact to migrate data first. Both the above schemes need you to build the array once and done. You can’t expand these types of vdevs once built (easily). That means @Dynamic_Gravity’s scheme of mirrored vdevs has a benefit as you can just add more pairs of drives as vdevs to the pool to expand.

If using mirrors you can just add more mirrored vdevs to the pool, there is no limit on size. If using raidz schemes you can’t expand without adding a whole extra raidz vdev.

Do it now. Ixsystems allow in situ upgrades and the truenas core feature set will be feature compatible with your array.

Remember backups and test it first before you commit sensitive data.

1 Like

Thank you for comment good sir.

It is also worth mentioning that you can grow your pool by replacing all drives in a vdev one at a time and resilvering.

Forgot to mention that’s why I love mirror vdevs.

1 Like

Hi All,
Thank you for your responses. Some extra info I forgot to include was that the case I am using is a Fractal Node 804, holds 8 x 3.5" disks comfortably, can squeeze in a 9th, adding a 10th will be very tight fit (and I probably can’t do that due to power supply cable requirements.
The second enclosure is a Silverstone DS380 with hot swap trays (technically an ITX case but I can just use it with a PSU a SFF-8088 to SFF-8087 adapter.
I have also installed an Intel X540-T2 10Gbe network card with a Macgyver’d cooling solution (these things run hot):
image
image

@Airstripone - I don’t think the motherboard supports an E5 CPU according to the manual:
image
However Idon’t think I have a need for VMs on this box (unless you can suggest something that will be handy to run).

I was hoping to avoid using mirrors as the space cost is pretty big (50%), and living in Australia means everything is just way more expensive. Just looking at Newegg alone, their prices in AUD are about $50 to $70 more than the converted USD prices :frowning:

I do understand the drive redundancy is per vdev, in that losing 2 drives in a single vdev is catastrophic failure.

However, using 4 x mirrored vdevs would allow for easy expansion either by adding mirrors or replacing disks in mirrors.
(interestingly according to the primer linked above, you can expand a RAIDZ vdev the same way)

So it looks like I’ll be using 4 x Mirrored vdevs to start with.

If you were to lose 2 drives in a single mirroed vdev, do you also lose the entire pool or just what was on that vdev?

I had a read through the Tuning page linked above by @Dynamic_Gravity, and note that I should set the ashift to what is suitable to the drives. However, I cannot get a definitive answer on whether my drives should be configured as 4k or 512B. I have 7 x Western Digital WD40EFRX and 1 x Seagate ST4000VN008.
The neither company has the secor size in their PDF info sheets - shit kind of pisses me off as I would assume the purpose of these PDFs is to provide users with all relevant information if you’ve gone to the effort of looking for it.

I’m pretty sure the Seagate drive uses 4k sectors (according to some specifications on retailer web site). The WD drive on the other hand I think uses 4k sectors but reports 512B to the OS - what do you do in this case?

For the rest of the tuning, it will be based around my main use of storing and editing photos (raw .CR2 files, and some video footage). I’ll be accessing the data via a Windows machine. Using Lightroom also writes small .xmp side car files, but I don’t think they will factor in to the tuning discussion.
The parameters:

  • xattr = sa
  • compression = lz4 (not sure if it will have a large impact on the file types I’ll be using)
  • atime = off (don’t care about access time, only modify time)
  • recordsize = 1M (large files mostly being used - about 30MB per photo)
  • SLOG = don’t think I’ll need one
  • L2ARC = don’t think I’ll need one

Sorry for the long post - are there any other gotchas I need to look out for?

Thanks in advance,
Vu

I run my freenas system with 1 vdev of 10x4 tb in raidz2.
When I built mine, the genereal recommendation on the freenas forum was to not go over 12 disks per vdev.

I have 2 disk failure tolerance and 29,4 TB capacity after formatting and so on.

I’d do a 1 vdev 9x4 TB raidz2 if I where you.

I you drop a vdev, you lose the entire pool. But on the other end, a resilver on mirror is much quicker so risk gets lower there. Performance on 4 mirror vdevs will be really good compared to RAIDZ.

Thanks @StrY.

@Donk - one of the main reasons I am looking at smaller vdevs is to be able to expand slowly. If I have an 8-Drive vdev in RaidZ2, will I be able to add incrementally (e.g. a 4-Drive vdev, or a t-drive mirror vdev)? If so, will it hurt performance?

Thanks,
Vu

lots of comments on here about performance. For home use there is “enough” performance in RAIDZ2. I have a 6 drive RAIDZ2 array running my fileshare and it gets sustained 350MiB reads and writes on large file transfers… and ive not tuned it / the network much. For small files that sit in cache i dont care how long it takes to write as the transfer is “instant”.

If you really need 900MiBps + then you need mirrors, but given the price penalty of 50% redundancy I’m ok with 2 disk redundancy, reasonable speed and 30% wasted capacity.

On your point about the E3 being the max the board will take, don’t worry about it, it will be fine for FreeNAS in modest workloads. It could probably run one or two small VMs but as you are using FreeNAS it is not really the best platform for that. Keep it as a data store.

The only way to grow a pool is to add more vdevs in most situations.

If you have a device which maxes out all available data connections you are stuck with what you have when using raidz.

If you use mirrors you can slowly grow your pool without having to add additional vdevs. See my post above.

You can always add a vdev to a zpool. From what I understand, the speed of the pool is set by the slowest vdev.

You mention performance, what are you doing with it? Storing stuff or something more intensive?

I disagree that using mirrors is a requirement, unless you are storing virtual machines and need the maximum amount of I/O possible. And if that were the case I would tell you to do it in an array for 120GB flash/ssd drives in a separate pool.
The recommendation that you shouldn’t ever use more than 12 drives in a single dvev is sound advice, as what that generally means is that you have one VDEV per shelf. If you use RAID Z2 you have two disks of redundancy. If you watch the level one storage server build from a few years ago, I believe that’s how Wendell built his pool.

I can say from experience I would feel more comfortable with a 9 drive raid Z2 than multiple raid z1 VDEVs. If a failure occurs you don’t have to worry about what drive failed. They are all equal partners, and if a second drive fails during resilvering s raid z1 you are effed. The same is not the case in a z2. The resilver time for a raid z2 pool wouldn’t be horrendous, it would take about a day depending on how busy you pool is. But that is still a day with risk.

I have ~8 servers out there in the ether that have 12 drive raid Z2 configurations doing security server recording, and I can confirm that those servers with older hardware than yours and a single SAS connector in the back plane have 'completed resilvering successfully over about 36 hours with 4tb drives.

If you are married to doing a RAID z1 you can consider is two 4 drive raid z1 VDEVs and a hot spare. The only reason I would consider that option is so that the cost of upgrading the size of the pool would be easy and cheaper. You could in theory buy 4x 8 TB drives at a time to expand your pool, rather than having to buy all 8 or 9 at once.

My final words would be that whatever you do, make sure you have enough storage planned in your final configuration such that you shouldn’t need any more for a while. Your next investment should be in building some sort of system that would act as a backup for your data. If your data is important to you in any way, it is better to have two copies of your data in two separate systems than building more redundancy in your main system. If your SAS controller goes crazy, or your motherboard shits the bed, or your power supply fries your system and kills drives, you will be out money but you will NOT be out data. No amount of drive parity or redundancy can save you from an epic failure. Only a backup system can.

A backup system will also give you a place to pivot off of if you want to upgrade your main system. As an example you could yank all of the drives out of your main system, sell them, buy new drives, install them and then copy your data back.

3 Likes

For ashift, correct, set it to 12 for hdds, and 13 for sdds, unless you have intelligently benchmarked something for a particular workload and have a compelling reason to choose something else. Setting ashift too low is much worse performance wise than setting too high. Ashift also can’t be changed without destroying the pool, and even if you truly had a 512 sector device, you are unlikely to buy more in the future.

For reference, for most of my ZoL datasets, I have the following set, with the intention of just using them as storage with rare changes.

zfs set acltype=posixacl compression=lz4 xattr=sa atime=off relatime=off recordsize=4M aclinherit=passthrough dnodesize=auto exec=off your_pool/your_dataset

I address how to change the max recordsize here.

If you use large recordsizes, you ALWAYS want compression, even on incompressible (like video) data. If you had no compression, set recordsize to 1M, but had a slightly larger than 1M file, it would take up 2M. While single blocks are variable in size, data that needs multiple blocks requires all the blocks to be the same size. However, compression allows that last partially filled block to have the zeros compressed and no take up ridiculous amounts of space.

You can consider Ashift as a minimum block size, and recordsize as a maximum block size.

1 Like

I’m just curious, why do you say that?

Thanks for your responses.
I think I’ll stick with mirrors for now.

@NicKF - In terms of a backup - I have my old PC (was my main rig for about 7 years - Core-i7 3820 4-core/8thread, 32GB RAM) which I plan to use as an Unraid box for backups. If I was going to run VMS it would be on this box.

Where @Log says always enable compression for large record sizes, I assume it’s because for large record sizes (e.g. 1M), if you wite a small (200kB) file, it will take up 1MB on the disk. If you have lots of small files, the wasted space adds up. Enabling compression reclaims that space back. Is that correct?

@Donk - The FreeNAS box will be used primarily as storing photos and videos for editing. Speed it not a massive issue, although as mentioned I did put a 10Gbe card in it.

I also have 2 x Samsung 840 Pro 256GB SSDs from old systems. Is there anything useful I can do with those in this box? Or should those be saved for the Unraid box?

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.