Torn Between ZFS and BTRFS for a new general purpose storage pool, need advice

Hulkstern · September 12, 2022, 10:22pm

Hello, I’m not terribly experienced posting to forums, so please let me know if I should post this in another section or if I should format this differently.

Allow me to set the scene.
While not being busy as a college student, I have been working on a server I like to call TR-Server that I have been pouring my time into recently. Getting into hosting my own web servers and nextcloud instance via Docker, Hosting several Minecraft servers at a time for my friends and family, A super basic plex media server, and other small related tasks like constantly running mcoverviewer to generate maps of my minecraft servers you can view in the browser, and a self-written discord bot or two for my friends. Very general purpose, doing a little bit of everything (as I’m sure I’ll add more).

My current Hardware setup is as follows:
OS: Ubuntu 20.04LTS
CPU: AMD Threadripper 1920x
RAM: Corsair Vengence 4x16gb DDR4 2400 Mhz (I think cl14?)
MoBo: Asrock Fatal1ty X399 Professional Gaming (It’s a terrible name I know)
Boot Disk: SAMSUNG MZVLB512HAJQ (512GB)
Data Disk: INTEL SSDPEKNW010T8 (1TB)
GPU: AMD RX 460 2GB (Just for video out)

Recently my mother gave me a wonderful birthday gift of:
4X ST14000NM001G-2KJ103 (14TB)
And as a birthday treat for myself I got:
2X Samsung 970 EVO Plus (1TB)

I’m really excited for the things I could do with this hardware, but I don’t want to rush it as I don’t want to deal with the headache of migrating many TB of data when I realize a few weeks down the line I screwed up and should have gone with another setup.

To get to the point.
I want to add a general storage pool to my server. Essentially, I would like to store all my docker containers, Minecraft servers, CDN files on this storage pool and if possible have it cache commonly accessed files on the SSDs. That way I could, for example, not have to manually move Minecraft servers that are no longer in use, to and from a fast SSD and slower storage array. Considering the large amount of storage space I’d also be interested in the possibility of setting up one or two SMB shares to store some large media (mainly shadowplay recordings) without having to deal with the overhead of Nextcloud.

My main issue is that I can’t seem to figure out if I should go with btrfs or if I should stick to ZFS. ZFS seems a lot more mature, and would allow me to make the most of the storage I have with a raidz1 setup. While btrfs seems much more integrated with the Linux kernel and much more flexible in terms of expansion in the long term but not as stable in raid5/6 setups.

Also it’s not clear to me if I made the right choice purchasing two Samsung 970 EVO Plus drives (planned on doing mirrored for redundancy on the cache) since most of the utilities around btrfs or zfs seem to either not be able to take advantage of multiple “cache” drives or aren’t intended to work like a traditional cache at all.

All and all I’m not sure what I should do.
Should I return the Samsung drives and invest in more ram instead?
Should I give up on the notion of trying to have my cake and eat it too with this array?
Should Valve release half life 3 already?

Thank you for any advice you’d be willing to give.

Dutch_Master · September 12, 2022, 11:05pm

I don’t see anything in your requirements that mandate the use of either ZFS or BTRFS. You can set up your system traditionally, using JFS or XFS, LVM and/or RAID (1, 5 or 6 is your call), using bcache and bcachefs for the SSD cache, but you’d give up on the slick (almost-fool-proof) GUI of solutions like TrueNAS or Unraid.

IMO your Mom’s b’day gift would be better suited in a separate NAS, build with low(er) spec hardware. Find yourself an old PC case (PSU included!) with at least 6 HDD bays, a decent AMD 500 series mainboard, a Ryzen 5 4600G (6c/12t) and 2x 8GB of RAM and you’re off. Doesn’t have to look pretty, just be functional

Hulkstern · September 12, 2022, 11:15pm

Ah I forgot to communicate a few details that are pertinent.

I guess there isn’t a huge compelling reason I wanted to go with zfs or btrfs, I guess I was super interested in the technology (though snapshotting seemed super interesting for when I inevitable cock something up and want to roll back the changes I just made to my docker containers).

Also, I’m not so sure about a seperate computer for the storage. This is system actually used to be running unraid before I got frustrated with it and wanted to move back to Debian/Ubuntu. I guess I found the gui too limiting and the native environment too alien to feel comfortable. This was on top of having abysmal CPU performance when running VMs.
I’m also a bit space constrained, hence why I seem to have all my eggs in one basket, running everything on one system.

Cheetobandit · September 13, 2022, 12:35am

It is almost universally agreed upon at this point that there is no point in using BTRFS. Wendell has mentioned in many videos that ZFS simply has far more engineer hours poured into it and this has resulted in a more efficient and reliable filesystem in every regard. Also it is possible to stripe NVMEs for a cache drive, I have never heard of anyone mirroring them. Even if it is possible there is no point. RAID1 is usually employed for redundancy as it only gives a minor read bonus and no write bonus. Stripe offers both and if an L2ARC fails you will never lose data. It is not like a cache metadata drive that will kill the whole pool if it fails or a regular vdev stripe that will lose all data if one drive fails. A striped L2ARC will also provide double the capacity of a mirrored one. And with a 4 drive setup you could implement a RAID-Z1 for a total capacity of 42TB but I would actually recommend a striped mirror. (The RAID-Z equivalent of a RAID10). This will allow you to survive a single drive failure and a 67% chance of surviving a 2 drive failure. While it will cost you 14TB of usable capacity this vdev arrangement will be much faster and slightly more resilient. Keep in mind the ZFS compression will let you get more out of the pool than even that 28TB theoretical limit with basically no processing overhead. If you are using VMs and docker containers with an L2ARC the bottleneck will definitely be your rust drives. Of course up to you what you prefer your fault tolerance and storage capacity to be. Those enterprise drives are known to be very reliable. ZFS has great backup features also so you can ZFS send to a separate location either in the cloud or a different NAS for offsite if you’re that worried about failure. Also a SLOG NVME will speed up a lot of your docker, KVM, and NFS performance since synchronous writes get written to your rust twice without it. Your motherboard has 64 PCI-E lanes so there is room for expansion for whatever you have the physical space for. Kind of ramble-y but those are my suggestions. Last note: if you’re really feeling brave you could actually install root on ZFS but this is very complicated and not for the faint of heart. Just something fun to maybe look into later.

Hulkstern · September 13, 2022, 1:08am

While I definitely like the idea of keeping more storage if possible, I’m definitely willing to accept the trade off if I get some better performance out of the pool/ For a striped mirror setup (I’m guessing the same as a raid 10) would that be that also be beneficial for read/write performance, or would I be mainly getting gains in resiliency?

As to the SLOG and L2ARC setup. Could I use both NVME drives striped together for both? Or would it be better to use one drive for L2ARC and one Drive for SLOG?

Also, to touch on something @Dutch_Master said. Would you say ZFS would be applicable for this use case? I won’t lie I hadn’t put much thought into whether or I should or shouldn’t use ZFS, just that I wanted to.

Cheetobandit · September 13, 2022, 1:20am

Yes this will guarantee recovery of a 1 drive failure and a 67% chance of recovery with a 2 drive failure whereas RAIDZ1 only guarantees survival of 1 drive failure period. It will cost you 14TB capacity but will allow for much faster read and write. If you would rather have more space that is a personal call and up to you. Here are some charts to help you make your decision. (RAIDZ1 is the same as RAID5 and RAID10 is the striped mirror)

RAID5 vs Raid10 Performance Benchmark MDADM – Helm & Walter IT-Solutions (helmundwalter.de)

RAID 5 vs RAID 10 - Difference and Comparison | Diffen

If you lose a SLOG you will lose only the last 5 seconds or so of data. If you were running a high transaction business off the pool this would be an issue and you would need power-loss protection drives in mirrors but any other use case you don’t need that. And any of those drives are very big for SLOGs. You could get away with the smallest NVME you can find or even partition one of yours although that is kind of complicated. SLOGs actually only speed up sync writes anyway so many forget about that part for now altogether.

Yes, I think this sounds like a great use case. You will be getting redundancy, speed bonuses, compression, and most importantly a very valuable learning exercise. ZFS will be with us for a very long time and is absolutely worth learning. You will gain a lot from it both now and in the future. You have a nice system that can really take advantage of it.

ThatGuyB · September 13, 2022, 1:37am

The question should be put like this:

Q: Can you run ZFS?
A1: Yes
Then run ZFS
A2: No
Then run BTRFS*

*Unless you want to run anything above RAID mirror, then run md.

I never heard a single person besides the not-Linux Linus being happy with Unraid. Most people use it for a while, then drop off from it. Besides people like me, I never heard people complain about Proxmox. Proxmox is a good choice and is more used in the enterprise than unraid. And you can control it via CLI, you’re not limited to the GUI. Ubuntu is fine too, but even less preferable than proxmox (the later is running actual debian).

For the storage, I don’t know how you’d configure your pool, but I agree with Dutch_Master, getting a second PC that runs on peanuts and using that as a NAS would be preferable. I wouldn’t go as far as getting a 6 bay one, just one with 4 bays and 2 or 3 nvme drives should do. You can run TrueNAS Core or Ubuntu on it (or Proxmox).

Anyway, I would put those 4 drives in stripped mirrors and if your motherboard supports it, up the RAM. Your motherboard has 8 slots, so I would buy another kit of 4 sticks. ZFS can deal with the caching in L1Arc with its RAM.

I would not make an L2ARC, you are not really going to need that much capacity for caching reads. What I would do if I had the space to add the additional disks, would be to add a metadata special device:

This basically functions like an index for the entirety of the pool, so ZFS doesn’t have to spin the rust to find stuff, it knows each file’s exact locations, so it should improve access times.

I would not go for a SLOG either. I ran a Samba server on a core 2 duo with 8 GB of RAM on 2x (md) RAID1 pools for 70 people. Later I moved to a Celeron dual core HP ProLiant MicroServer gen8 (because that core 2 duo was running on borrowed time) with the same amount of RAM (only DDR3 this time), with newer drives in RAID10 (still md). Do you think minecraft will really benefit from all the caching and stuff? I highly doubt it, unless you got a few TB of data that are being accessed frequently. I don’t think minecraft can benefit from a large l2arc IMO.

Correct.

You get 4x the reads and 2x the writes. If you go with 2x disks in RAID1, you get 2x the read, resiliency and no write increase. If you got with 2x disks in RAID0, you get 2x read and 2x rights. Combine them and you get 4x reads and 2x writes and resiliency.

It’s just a very good file system, nothing else. How you use it matters more. But it does have nice features, like compression, which basically means free storage. Use zstd.

Besides the L2ARC (which is just caching), I am not aware of any option for tiered storage, which would be more useful.

In your config, I would do the spinning rust for data and make a separate pool from the SSDs and place the OS boot vdrives there. Well, I’d be using ZFS, so I’d just make a FS for each, like I already do:

$ zfs list
NAME                                 USED  AVAIL     REFER  MOUNTPOINT
sgiw                                 144G  8.83T       96K  none
sgiw/kvm                             144G  8.83T       96K  none
zroot                                772G   150G       96K  none
zroot/ROOT                          19.6G   150G       96K  none
zroot/kvm                            751G   150G       96K  none
zroot/lxd                            960K   150G       96K  legacy

zroot is my root mirror for my system, on which I put the OS drives and sgiw is my spinning rust. I cut out the cruft from the output, I got like 30 lines, I wanted to keep it short. Under zroot/kvm, I have my OS file systems, under sgiw/kvm, I have their data drives. As you can see, I currently have more stuff on zroot than sgiw, but that’s going to change.

^ This. Really no point in having a ZIL.

Cheetobandit · September 13, 2022, 1:44am

That is an excellent write-up by Wendell that I have read many times but I think the last thing this kid needs is a metadata drive. Those are really only useful on large pools with TONS of different files and folders. Usually company-wide NFS shares or something. Definitely not a workstation/homelab for a single user. That will make the entire pool much more fragile as well.

Sorry if we overwhelmed you with all this pedantic jargon. Go without the SLOG and metadata and all that just do RAID-10 with striped L2ARC and start from there is my suggestion.

ThatGuyB · September 13, 2022, 2:04am

The metadata special devices is basically the place where things like the file system tree is, so you can do an ls or find or similar commands that involve browsing the pool as if you were on a SSD array. Anything can benefit from it. Sure, you get exponentially more benefit if you have a lot of users that would normally slow the spinning rust down with tons of parallel file lookup, but anyone can benefit from it.

This is my suggestion. And do a second RAID1 pool with the 970 evos, use that for OS boot drives and the RAID-10 for data.

No need for L2ARC either, you are not going to benefit much from it.

Dare I say that the L2ARC is more dependent on having tons of users than the special device, caching on a storage with <20 users doesn’t make a big difference, since people will just move to browsing other stuff. And again, I have experience with more underpowered hardware running a lot more crap on them, people nowadays are really overspec-ing their systems. If OP didn’t already have the TR 1920x, I would have recommended he buy a 5600X.

risk · September 13, 2022, 5:28am

Your mom is really cool

Here’s a ZFS recipe:

Partition manually.

Partition the EVOs using GPT.

First partition will be your md mirrored /boot (give it 1G of space before starting next partition, but you may be limited to 512M in size), second partition from 1G+ should use LUKS.
When you’re luks formatting, make sure you pick version 2 of luks and make sure you pick 4k physical sector, round the gpt partition size to 1M size by typing in the size manually.

On the luks partitions on the EVOs, bring up LVM physical volumes - you’ll use this for mirrored ZIL, L2ARC and special metadata device.

On spinning rust add a single gpt partition, leave first 10G empty on each drive, round partition size to 1G. put LUKS on with 4K sector size, bring up a pool with a pair of mirror vdevs, use large record sizes, try 16M.

Take some space from each of your EVO PVs for ZIL (16G is fine), make this mirrored.

Take some EVO space for special metadata device (basically for your OS and file directory stuff). e.g. 512G each, make this mirrored.

Take some space for L2ARC, e.g. 256G from each of the EVOs.

Because you used LVM on your EVOs you can grow/shrink L2ARC size and grow metadata vdevs as your needs grow.

Because you used LUKS, you can mail your drive to NSA without formatting it once it starts throwing bad blocks or once it starts making weird noises.

Before doing any of the formatting on HDDs (not on SSD) run badblocks tool, it’ll take a few days.

I wouldn’t mess with ZFS encryption … too new, and how do you unlock disks on boot?

Oh, and do back up the LUKS keys, on a piece of paper somewhere. You can have multiple keys on each volume and one of the keys can be a passphrase. You can unlock all 6 disks on boot using the same passphrase that you type in once… And you can e.g. have a USB flash drive for unlocking maybe if you want to reboot unattended.

On your zpool, separate the dataset for bulk storage for movies from backups, from your OS dataset(s).

thro · September 13, 2022, 1:40pm

I’ll add my “me too” to picking ZFS vs BTRFS.

My only suggestion would be to read up on it properly before diving headfirst into it.

ZFS can do amazing stuff. You can also screw your self properly with it if you do stupid things without reading the docs first.

cowphrase · September 16, 2022, 5:08am

From the little testing I’ve done with BTRFS I’ve been a little disappointed with its read performance, especially when more than one thing is doing IO. It’s worth mentioning that ZFS will get improved read performance with a mirror, but BTRFS doesn’t (I think technically it can but only when you have multiple processes doing IO).

With the above said, I do use a BTRFS “RAID1” in my main linux desktop, and it works well enough as a separate storage pool. I’ve also rebuilt it through multiple disks, so it is reliable enough.

@Hulkstern

For situations like this I think the best way forward is to simply try both options. Setup a ZFS mirror with two drives, and a BTRFS “RAID1” with the other two drives. See which option you prefer. I’d also deploy the new ZFS pool separately to your current root pool, then it’s more portable between systems.

I’ll also mention that using an SSD cache can be very hit and miss, it depends on what you’re using it for. ZFS especially has 3 types of SSD caches that do very specific things. And in generaly RAM will act as a read/write cache and is usually good enough. So another vote for a separate SSD pool.

thro · September 17, 2022, 1:15am

You can also play with ZFS via using files as VDEVs. If you want to experiment.

wommy · September 16, 2023, 12:49pm

thank you so much, this really saved my bacon