Soliciting ZFS pool layout advice for a multi-purpose homelab server

Hi All,

1st post here, although I’ve been watching Wendel’s videos for a while. I’m not a professional sysadmin (I was in the late 90’s, but that’s just giving away my age)… these days I earn a living playing with lasers. My nacent love of homelabbing hasn’t been well-fed for the past couple of decades, but I’m getting frustrated with the organic growth path things have followed in my house, and I’ve decided to finally do something about it.

Step 1 is to tear down my Windows-based Hyper-V server and rebuild it into something better. I’ve watched a lot and read a bit here and there, and I’ve decided against TrueNAS (or any other all-in-one solution)… just not enough fun/challenge for my tastes. Way back when I was a FreeBSD fan (it was a great option for DNS and web servers during the dot-com boom, probably it still is), so I’ll be going back to my roots.

Here’s what the re-imagined server will be:

  • FreeBSD 14.3-RELEASE on the hardware
  • Hyper-V VMs will be migrated to bhyve (2 of these, one is a Windows PDC and the other runs Exchange server)
  • Samba integration with Active Directory so I can re-create shares, etc, and have things like folder redirection look the same for my wife and kids.
  • another bhyve VM running jellyfin, serving media from my new big pool of HDDs

Here’s the hardware I’m working with:

  • Supermicro X11DPH-i (I wish I’d had the foresight to get the -T, because now I need to get a 10GbE NIC)
  • dual Xeon 8260s
  • 128GB RAM (only 1/2 the slots populated, so there’s room for more if I have to spend $)
  • 4x Samsung PM983 m.2 SSDs, 960GB. 2 are in the mainboard slots, and the other 2 are in an Asus HyperM.2 x16
  • 6x Samsung PM983 m.2 SSDs, 1.92TB. All are installed in Asus HyperM.2 cards
  • 6x Western Digitial recertified SAS HDDs, 14TB, will be connected to a Broadcom 9400-8i

The plan is to re-rip all my physical media without compression (my current library has a lot of titles that were ripped before I owned a 4K display, so I optimized for small file size due to limited storage space).

  • Newly ripped titles will sit on a RAIDZ2 pool formed using all 6 of the 14TB drives.
  • Other storage needs are modest, < 3TB for all of our photos, music, and files. These are accessed regularly as part of my Active Directory folder redirection setup, so I’m thinking these should live on an SSD-based pool.

My question for the experts: what’s a good pool and dataset configuration to serve my needs for fast (personal files) and slow (media files) storage, as well as my 3 VMs? I’m getting a little lost with all the options: SLOGs, special Vdevs, L2arc, all the options for dataset creation (e.g. block size).

I’d like some resiliance / redundancy, and ZFS seems great for that. Getting it set up so that I’m not kicking myself for a stupid mistake later is my big worry. Thanks for the help!

1 Like

FreeBSD is interesting as a choice. I don’t think it’s unfair to say it’s less well developed than Linux/KVM for a VM host, though I don’t see that it couldn’t work for you, only suggest to understand that going in.

If you’re going manual (great!), the main thing I’d recommend is make a trial install on an old device and play with it as a host, create pools (on loopback files if you need to), replicate datasets from a to b. At least try each operation once (including replace).

This is more the of an expert view -

but to quickstart, for zfs -
ashift=12 means 4k block size (not to be confused with recordsize), the other parameters people would typically tune are later configurable (I don’t think there’s a strong reason not to enable compression in most cases, you can toggle this property later but if you already copied everything, then you might have to again).

There’s some level of opinion not to use the top level dataset, e.g. make a zpool/stuff even if you only think you’ll need one dataset.

Into the more subjective, I’d be tempted to configure all the 2T SSD, & directly to the host, like:
2=2
2=2
2=2
= 6t pool, 1 failure tolerant

Additional resource for your consideration

benefit:
a fast configuration, can later replace a 2 then a 2, on one of the pairs then you have additional storage.

If you were to arrange for the OS vhd files to be here, and your ‘fast’ served files from a VM with a vhd here that should be performant. Generally it’s not ‘clean’ to serve anything from the host directly (maybe you could run it from a jail), but you don’t have to start with that, and it’s always best to break migrations down into different problems you can attack at other times.

Snapshot, and schedule replicate the pool to a dataset on the HDD RAIDZ2.

To which:

  • an optane/nvme slog (mirror) can be attached, or ignore it and sync=disabled
  • special vdev for metadata is a point of failure but elsewhere on this forum is demonstrated to offer performance benefit. However, I think, if you’re not multithread writing it 24/7 it’s probably not worth it, considering you will have a fast pool.
  • L2ARC - if you have a spare nvme left, then certainly attach it.

I think it’s fun though to consider yourself how to best deploy, and you’ll probably want to make some adjustments over time.

2 Likes

Ha! Yes, I’m looking forward to the challenge of getting bhyve to behave… thankfully I’m not looking to go crazy with it, and I did do my homework and picked up an Intel Arc to pass to the jellyfin VM for transcoding. Word on the street is that Nvidia pass-through doesn’t work well with bhyve, and the Arc was cheaper than the AMD options I could find.

Serving files from within a Windows VM isn’t something I’d considered. I can see some logic for arguing that it’s not ideal to have the host serving up files directly. I’ll take a look at putting the Samba setup in a jailed server… eventually. For an early start I’ll probably set up a bhyve VM to do the job, just to limit the amount of new learning I’m trying to cram in to my aging head.

This is pure gold. I’m already sort of setting up for that, with an old Asus Z8P68 motherboard and an ebay special 3rd-gen Core i7 with 32GB of cheap no-name RAM. Every old SSD in the house is getting tossed in, mostly Samsung 850 EVOs, along with 3 old Seagate HDDs. The plan is to spin up bhyve on this clunker to learn how to migrate my Hyper-V VMs… I guess I’ll have to add some ZFS tasks to the list of things this machine can teach me. Thanks!

1 Like

OpenZFS has added lots of features in the last few years, one of the more interesting is the Special VDEV which is intended to use fast (typically NVMe or at least SATA/SAS SSD’s) to store metadata and small files to speed things up.

You could set up multiple pools as you suggest, or you could center your pool around the hard drives and use your various NVMe drives in ways that speed up the hard drive pool.

In my experience, while I absolutely love ZFS, all-NVMe pools under ZFS never really perform as well as one would hope. Don’t get me wrong. They perform faster than hard drives, but for whatever reason they never quite reach their potential. This likely has something to do with how ZFS is optimized around hard drives from the core up. Maybe this will improve over time.

My main pool (which resides on my main proxmox box) has similar use cases to yours, and it is structured like this:

	  raidz2
	    16TB Seagate Exos x18 (7200rpm, SATA)
   	    16TB Seagate Exos x18 (7200rpm, SATA)
	    16TB Seagate Exos x18 (7200rpm, SATA)
	    16TB Seagate Exos x18 (7200rpm, SATA)	    
	    16TB Seagate Exos x18 (7200rpm, SATA)	    
	    16TB Seagate Exos x18 (7200rpm, SATA)
	  raidz2
	    16TB Seagate Exos x18 (7200rpm, SATA)
   	    16TB Seagate Exos x18 (7200rpm, SATA)
	    16TB Seagate Exos x18 (7200rpm, SATA)	    
	    16TB Seagate Exos x18 (7200rpm, SATA)	    
	    16TB Seagate Exos x18 (7200rpm, SATA)	    
	    16TB Seagate Exos x18 (7200rpm, SATA)
	special	
	  mirror
	    2TB Gen3 MLC Inland Premium NVME drive
	    2TB Gen3 MLC Inland Premium NVME drive
	    2TB Gen3 MLC Inland Premium NVME drive
	logs	
	  mirror
	    375GB Optane DC p4800x
	    375GB Optane DC p4800x
	cache
	  4TB WD Black SN850x (gen4)
	  4TB WD Black SN850x (gen4)

This pool stores all of my data, including personal files, house shared folders, medial library etc. etc.

The main data pools are RAIDz2. I have two of them, in part because I started with a 6-drive pool like you are intending, and added a second VDEV as my needs grew. ZFS also uses the two RAIDz2 VDEV’s in parallel which speeds things up. I rather accidentally fell into this configuration, but as luck would have it it has worked very well for me.

The small files and metadata lookups are lightning fast because I have configured the special VDEV to handle these.

The special VDEV is a three way mirror, as if you lose it you lose the entire pool, so I wanted to match the redundancy level of the main data VDEV’s (the two RAIDz2’s) Either of those can lose two drives without losing the pool, so I wanted to do the same with the SSpecial VDEV.

The log drives (or slog as some call them) are dedicated drives for the ZFS Intent Log (or ZIL). These drives are only ever written to in normal operation. They speed up disk writes by committing write intents to NVRAM and reporting this back to the writing process, so it can continue writing. The log drives are never read from unless something goes wrong. Data is instead committed directly from RAM to the main pool. Thge only time they are ever read from is during boot/import of a pool that has suffered an unclean shutdown. ZFS will then read the ZFS intent log and re-assemble the final writes where they belong before finishing the import.

In this application, low latency is king, thus the Optanes. They only ever hold the last few seconds worth of writes, so they can be very small. Smaller, and older Gen3 optanes work very well in this application, and these can still be found relatively inexpensively.

A slog drive isn’t necessarily required. They just speed up sync writes. Without them you may find that some writes are very slow as they go to a ZIL inside the main data pool instead. You can speed this up by configuring your pools to disable sync writes, which makes writes really fast, but now if something happens (server crash or power failure) you lose your in-flight data that is in RAM but not yet written to NVRAM.

Depending on the workload this can be either important or not at all. If you are in the process of transferring a ripped bluray to your storage, this does not matter at all. You are going to re-start that transfer from the beginning anyway, as a partial blueray isn’t of much use. For database writes or VM disk images - however - those last few seconds of inflight data can mean the difference between corruption or no corruption. So you evaluate your needs, and set things up based on those needs.

The cache devices are essentially read cache. They also go by the name L2ARC. (ARC is the cache in main RAM, L2 ARC is the cache on the log devices)

You can lose the log devices without losing your pool, so they don’t need to be redundant. IN my case the two drives are essentially spanned providing a total of 8TB of NVMe read cache for the hard drives.

These can either help speed things up a lot, or not at at all, and it really depends on your working data-set. ZFS really loves RAM, and each time you add features to it, it likes to have even more RAM. So if you add smaller log devices to your pool than your active working dataset, they can actually in some cases slow things down, as now there is slightly less RAM for the main ARC, and the L2ARC is just constantly being thrashed as your working dataset is too large.

Over the years (as SSD’s have become more affordable) I have used the following for log devices:

  • none at all
  • two 128GB sata SSD’s
  • two 512GB SATA SSD’s
  • two 1TB SATA SSD’s; and finally
  • two 4TB NVMe SSD’s.

In all but the last configuration, in my use case, the cache hit percentages were absolutely atrocious. I kept increasing the size hoping I would get to the point where finally a reasonable enough proportion of my random pool reads would be cached in the log, such that it sped up reads overall. This did not happen (for me, with my use) until the log hit 8TB.

Just like you, my box is a all-in-one system. (though I have added a second smaller node since). It contains storage for my house, but it is also my VM box. I do this with proxmox.

I found that sharing a pool for both my VM’s and my storage needs was suboptimal, as the VM’s would take a hit dsuring heavy disk activity (like dumping a 2TB disk image to the NAS at high write speeds over 40Gbit networking)

because of this, I have additional pools on my box:

My boot pool. (proxmox sets this up for you on install if you’d like, and it works very well)

	  mirror
	    64GB Optane M10
	    64GB Optane M10
	logs	
	  mirror
	    375GB Optane DC p4800x
	    375GB Optane DC p4800x

This was primarly to give myself some redundancy on the boot pool for the proxmox machine. The Optane M-series are those small Optane drives Intel used to sell as dedicated cache devices on consumer machines.

They aren’t exactly “enterprise” but they feel a lot like it, and they seem way more robust than anything else consumer I have ever used. such is the glory of Optane I suppose.

They are only 2x Gen3, so sequentiuals aren’t blisteringly fast, but they still have better random performance and IO latency than any non-optane SSD.

The best part is, a couple of years ago I bought a 20-pack of the little 16GB variants on eBay, brand new for less than $50 just to play with. The proxmox install on my system is only 5GB, so the 16GB variants would be plenty for this and dirt cheap. (they have gone up a little as peole have discovered how cheap they were and used them for mini-PC’s and as mirrored server boot drives, but they are still a very affordable option)

And here is a real controversial part of my configuration.

You’ll notice I have the same optane drives as log devices on this pool as my main storage pool. I noticed that the Optanes can take pretty much anyhtng you throw at them without slowing down, up to crazy high threaded and queue depth loads.

So I decided to just partition these drives, and use them as SLOG’s across all of my pools in this box. I’d be lying if I said it was my idea though. I got it from the main reviewer over at ServerTheHome (blanking on his name right now) in one of his reviews of Optane drives from years ago, heaping praise on these drives.

Some will tell you this is a terrible idea, but it has worked very well for me and I have had zero problems in my 5 years of using Optane drives this way in the SLOG role (first two 280GB 900P’s and now two 375GB p4800x’s) And since you can lose them without losing your pool, it is pretty low risk. The only way a mirrored pair of SLOG’s will lose you your data is if your box goes down, and both of those SLOG drives fail at the same time.

I expected that doing this might result in slowdowns when more than one pool tries to sync-write at the same time, but this simply hasnt happened. I can’t explain it. Optane’s just are that amazing. Pure magic. it’s a shame they weren’t profitable enough for Intel to keep them around.

Next pool is my VM Data pool:

	  mirror
	    1TB Samsung 980 PRO
        1TB Samsung 980 PRO
	logs	
	  mirror
	    375GB Optane DC p4800x
	    375GB Optane DC p4800x

This pool is a dedicated mirror for VM drive images.

Mirrors work best for performance, which is why I used that. I also don’t have a lot of storage needs for my many hosts, many use only a few GB each, so I never needed more than this.

Proxmox - by default - when you assign a ZFS pool as the VM storage pool will create subvols for containers and creates emulated block devices (instead of drive images) on top of the ZFS pool for KVM drive image use. This is actually really cool as these virtual drive images appear on the host system as /dev/zd0 zd1, etc. and behave like real block devices. They seem to perform well in VM’s as well. And you can snapshot them using zfs snapshot which is great too.

I will admit to making a mistake the last time I upgraded this pool though. And that mistake was to use consumer “Samsung pro” drives. I have since learned that Samsung consumer SSD’s (which their 980 and 990 Pro models are, despite the “pro” name do not work well in server applications, especially where there is heavy load or long uptimes. They have a tendency to suffer weird firmware lockups and become unavailable until you power cycle the server.

My next upgrade is going to be to replace these with some more serious enterprise NVMe drives.

Some consumer parts work very well in servers, and I have a long history of using consumer SSD’s from many brands in my servers (including old SATA Samsung PRO drives) and in the over a decade I have been doing this, the Samsung 980 Pro and 990 Pro drives have been the first to bite me in the ass. Even my Micro Center house brand (Inland Premium) Gen3 drives have been bulletproof.

Luckily I haven’t had any data loss result from it though. I guess that is one of the benefits from ZFS and how it handles redundancy.

Yeah, so these Samsung drives won’t be around much longer. You can read more about these issues with Samsung Pro drives in this discussion over on ServeTheHome.

And then there is my final pool in this box.

	  mirror
	    1TB Inland Premium Gen3 NVMe
	    1TB Inland Premium Gen3 NVMe

This is a dedicated pool for one of my VM’s, (or actually, it’s an LXC container, but that’s not relevant).

That container runs a dedicated MythTV backend. MythTV is an open-source project that works as a DVR/PVR system, recording your scheduled shows from your TV service. (Yeah I know, not too many people keep cable subscriptions anymore)

I used to record straight to the hard drive pool, but I found that occasionally my recordings had stutters in them, so I decided it would be better to record to a dedicated SSD pool. It is a TB in size, but a script runs every night at 4am, and moves the oldest recordings to the main hard drive pool in a kind of semi-manual storage tiering solution.

The MythTV database is great like that. it doesn’t care where a file is, as long as it can find it on one of its filesystems that are defined for recording use and currently mounted, so by just moving the files every night it still knows where everything is and just works.

Note the absence of the slog drives in this pool. They are unnecessary here. The entire pool is set to sync=disabled, and thus operates only with async writes. This is one of those situations where if the TV recording isn’t complete, I don’t want it anyway, and incomplete recordings are purged and re-recorded during a different airing instead. So last few seconds of data written is completely irrelevant.

Anyway.

I figured by sharing how I have things set up in a similar configuration, you could learn a little about it, and decide what might work for you. Maybe that is - as you originally proposed - a dedicated fast NVMe pool for your files, separate from your rips, or maybe that is using some of those NVMe drives in Special or log configurations to just speed up the hard drive pool and keep everything in one place.

Or maybe - like me - you want to have a dedicated VM mirror and boot pool.

The possibilities are endless :slight_smile:

I hope this was helpful.

5 Likes

This is how you zpool!

1 Like

Thank you mattlach, your description of your setup and the thinking behind it is tremendously helpful. Follow-up question: what’s the thinking behind attaching SLOG devices to nvme pools? I suspect it has something to do with ZFS only dumping from arc to disk every 5 seconds, but as a noob I’m really not quite putting the pieces together, yet… this guess could be totally wrong.

For homelab.

  1. You don’t need to mirror SLOG. SLOG is never read during normal operation. It is only read when the zpool is shut down unsafely. Thus, SLOG is the “backup” routine already. It would be very unlikely your SLOG is dead at the same time.
  2. You don’t put SLOG for NVME. For NVME-only pool, use logbias=throughput instead. The performance difference between with or without SLOG is very limited.
  3. You don’t need a big SLOG. SLOG won’t exceed your memory size.
  4. You don’t need special devices in most cases. Special devices only benefits normal datasets, not zvol. The scpecial devices create another failure point in your system.

If you have unlimited budget, please ignore this message.

1 Like

I agree, the risk is very small.

I started doing this a long time ago when mirroring the SLOG was the recommendation, but is it really necessary? No. You’d have to have the system go down uncleanly and have a SLOG device fail at the same time for this to result in data loss.

Is that impossible? No. But it is rather unlikely, unless you have a catastrophic failure (server under water. Things on fire, etc.) in which case restoring from backup is the more likely course of action.

If I were creating this setup today I would probably have used only the one SLOG device.

Not sure I agree here.

With logbias=throughput, the ZIL is bypassed entirely (even if a SLOG is installed) and data is instead written sequentially to the pool. On non-SSD’s this is not recommended as it results in horrific fragmentation, but to your point this is less of a concern on an SSD (traditional or NVMe) where seek times are not a factor to the same extent as on a hard drive with mechanical heads.

I’m still not a fan though.

The justification for using logbias=throughput is that NVMe devices write fast enough that you no longer need a ZIL for sync writes as they will complete fast anyway. In my testing this just hasn’t been the case.

I have tested normal NVMe devices and found that sync writes are still very disappointing on most of them. The fact that your data pool is on NVMe does not seem to negate the benefits of a proper SLOG device. I’d argue that this belief is mostly a myth, and if you want fast sync writes - regardless of what devices your main data pool is on, a good SLOG is the way to achieve that.

Optanes are one awesome way to accomplish this, but they are no longer being manufactured, and the existing supply out there is quickly drying up. Another way to do it is with battery backed RAM drives (like ZeusRAM) which work very well because SLOG’s don’t need to be large.

That said, the question is “do you really have enough sync writes to warrant spending money on a SLOG device?”, and that is really going to be dependent on workload.

These days that I have a good slog device in the Optanes I set sync=always on my vm data pool to err on the safe side and avoid drive image corruption. Though admittedly, I do take this approach mostly because the Optane SLOG makes it essentially free from a performance cost perspective. If I didn’t have the Optane SLOG, I’d probably be using sync=standard to try to minimize the performance cost.

Essentially, it is a bit of a luxury which I can afford because I pounced on my Optanes when I saw a good deal. It is far from something I would suggest to anyone is a hard requirement.

The positive here is that SLOG’s can be added and removed at will. If you set up a pool, and find that performance is disappointing due to sync writes, you can always add a good SLOG later. (provided you have the available PCIe lanes for the low latency NVMe device.)

I agree that SLOGS don’t need to be large. I disagree that they have anything to do with memory size.

It has more to do with transaction group (TXG) lengths. Typically these are 5 seconds long. So the maximum necessary size of a SLOG is going to be enough to sustain 5 seconds of writes.

On a system that uses the pool locally (like for VM’s) this can get a little larger, but for systems that primarily see sync writes over a network connection, it is very likely going to be limited by the NIC.

For my local VM’s on the server, they could potentially write at the max write speed of my Optane p4800x drives for 5 seconds. The sustained write speed spec for these Gen3 devices is 2000MB/s, so the largest I could ever possibly need here is 10GB, though this is going to be overkill, as something else is almost always going to be limiting the write speed other than the Optane device itself.

For cases where network devices are the limiting factor, think of it like this:
Gigabit - Max theoretical write speed=125MB/s, so SLOG only needs to be 625MB
10Gig - Max theoretical write speed=1250MB/s so SLOG only needs to be 6.25GB
Etc. etc.

The practical realities for SLOG’s - however - is that with a few notable exceptions (like ZeusRAM devices) they aren’t available in sizes that would be suitable for SLOG’s. This is especially true when you limit yourself to only devices that make for good SLOG devices.

Most SSD’s (even enterprise NVMe drives) are utterly horrible as SLOG’s. They simply do not have low enough latency to be good in that role.

Servethehome did a good writeup on this a few years back:

The lowest latency possible is key here, and as you can see, Optane drive absolutely trounce everything else. This particular test is a bit old, but nothing much has changed here. Even the most fantastically expensive Solidigm NVMe drives are still much higher latency than Optane and will perform worse as SLOG.

So you might be able to find smaller drives, but they are unlikely to be drives that are good for SLOG duty.

An exception there can be the 16, 32 and 66 GB Optane M series drives Intel used for cache back in the day. They are significantly better in this regard than any non-Optane SSD I have seen, but the fact that they are only 2x Gen3 holds them back quite a lot compared to their bigger brothers.

I used to use dual 280GB Optane 900p’s, I chose the 280GB models because they were the smallest models. I now use dual 375GB Optane DC p4800x drives. Again, these are the smallest drives in that product family.

If you have sufficient redundancy they are no more of a risk than the data pool itself failing. But yeah, they need to be added with that level of redundancy, which is why mine are a three-way mirror. Two drives can fail, and it still survives, just like with my RAIDz2 data vdevs.

I do agree that they are not necessary for zvols. They don’t really do anything there, as their purpose is to handle metadata and (if you allow them) small files. This is why I use them on my main data pool, but not on my VM pool. They would be of no use on the VM pool.

(Of course, the special devices would also be of limited usefulness on an SSD pool (especially an NVMe pool) as you already have that performance in the main pool at that point.

On the main data pool -however - they speed things up tremendously. It is particularly noticeable if you have folders with very large quantities of files, particularly small files, where a traditional hard drive would need to seek like crazy to find all of this data.

Where they can have a positive effect on zvols is if you use the same hard drive pool for both file storage and zvols. Having a special device vdev - in this case - can free up some load from file data access that would otherwise keep the hard drives busy, and allow them to spend more of their time available to work on zvol requests.

In a configuration like that, anything you can do to lighten the load of the pool can speed up slow zvol access due to file access keeping the hard drives busy.

3 Likes

So I couldn’t help myself, and blew the last bit of budget for this project on a pair of 280GB Optane 905P drives and a carrier card. I had a single unclaimed PCIe x8 slot, you see… Based on all the verbose (yet helpful, thank you all!) conversation above, I’m thinking:

  • 3x 960GB nvme SSDs in a 3-way mirror → zroot
  • 6x 14TB HDDs in a raidZ2 → store media AND user files here
    ** 3x 1.92TB nvme SSDs in a 3-way mirror → special vdev: store files < 128kB on the special, along with metadata
    ** 2x 10GB partitions on the 905Ps in a mirror → SLOG slightly oversized to handle 10Gb networking
  • 3x 1.92TB nvme SSDs in a 3-way mirror → VM storage, sync=always
    ** 2x 25GB partitions on the 905Ps in a mirror–> SLOG that’s slightly oversized to handle the sequential write speed of the nvmes for 5 seconds. (each can write 1.4GB/s sustained → * 3 drives * 5 seconds = 21GB)
  • 1x 960GB nvme SSD → L2arc

I have a slight redundancy gap on the SLOGs, since they’re only single fault tolerant and the pools they’re backing are double fault tolerant. I’m hoping this won’t bite me later (I don’t think it will).

I’ll probably do some forensics to figure out the correct cutoff size for files that will be stored on the special - 128kB is a WAG. I’m not quire sure what to do with the remaining space on the 905Ps… I guess I could mirror this space and make another tiny vdev for something, but I don’t know what I’d use it for.

Thoughts/comments? Implementation starts next month…!

Edit: should I double the size of the SLOGs? There’s a post over on the TrueNAS forums that suggests the SLOG should have enough capacity to store two transaction groups…

1 Like

i would seriously question the need for l2arch and slog unless you have 25+ users or will see comercial levels of use. Also while i would stand behind using special, and even recommend them over l2arc and slog i would also warn they chew through ssd drive lifetime just as fast.

On the topic of l2arc, you will only benefit from a l2arc in the case of reading the same file multiple times in quick succession, and its size being larger than the arc in memory. Considering that you have 50+tb of storage im assuming your planning to allow for the recommended 1gb of arc for every 1tb of zfs storage. meaning your arc in ram would be 50+gb. Point here is how many 50+gb files need to be re read at a greater than 500mb/s speed (10gbe nic) in quick succession(time frame less than a few mins). People who benefit from l2arc are doing things like installing custom windows 11 images with preloaded applications from a hdd pool simultaneously to 10+ devices over the network all day long.

Ok on specials, after having run a special with a 10disk raidz1 and chewing through 50% of the drive endurance/life utilizing only 10% of the capacity. I would not put small blocks on the special, first the specials would need to be enormous in my use-case for the amount of small blocks i have, and second any nand cells that are storing data long term can not be rotated through for wear leveling of the huge amount of meta data writes the specials will have to endure.

If i re-architect-ed my servers as a whole i would move the storage of active programs(homeasssistant, pihole, jellyfin… basically anything that has a db and does alot of small reads and writes) off zfs and onto faster filesystems like ext4 , then snapshot an replicate the data back onto large zfs pools as a backup in case of loss or corruption. This data should be treated like vm’s and be assumed to be accessed often and needs speed.

Im personally not a fan of vms, as any separation you may need containers can usually provide. Using homeassistant os as an example its just a os, to run all the containers that make up the homeassistant experience, of which all of them you can just run directly yourself on the container platform of your choice. Did they do a nice job of packaging it all up and giving nice buttons to start and stop containers related to homeassistant, yes, however i dont think it warrants the extra cpu and ram overhead of running homeassistant os over the containers themselves.

Redundancy of log is independent from the pool, as it doesn’t store data. You can run log without redundancy and be totally fine. Nothing bad will happen if one dies, same with L2ARC. You lose their utility, but the pool doesn’t care.

Log on NVMe pool is probably a waste. Haven’t seen any benchmark showing any improvements. 1.92TB Enterprise Flash has no problem with sync writes, they’ll do fine.

All the vdev classes are made for HDDs in mind. Enterprise SATA SSD as special and maybe a small 3DWPD NVMe for log is all you need really to boost those HDDs. ARC will take care of the rest (and that’s where your money should go)

I run sync=disabled btw :wink:

1 Like

i would hold off on all slog devices, you can always add a slog to a pool if the pool is under performing on sync writes due to double writing the data on the pool what, slogs solve. Same with l2arc, you can always add later, and it will be fairly obvious if you need it.

Main arc is king with zfs, more are == more better, arc is is in ram so more ram = more better

Not sure what nvme you are using that degrades so fast.
Here is my old intel 1TB nvme. Yes, SLOG is on that one.

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        43 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    9%
Data Units Read:                    12,089,340 [6.18 TB]
Data Units Written:                 95,856,443 [49.0 TB]
Host Read Commands:                 335,943,585
Host Write Commands:                563,381,399
Controller Busy Time:               6,559
Power Cycles:                       1,142
Power On Hours:                     30,605
Unsafe Shutdowns:                   139
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
2 Likes

If you want to use Linux, Windows, and FreeBSD VMs, you’re much better off with Proxmox.
For the NAS part, you can virtualize TrueNAS. I recommend passing the HBA to TrueNAS with PCIe pass-through and using a network card with SR-IOV. This will give you very good performance for NAS and the best virtualization features with Proxmox.
If you just need NAS functionality, and it is not about the AppStore in TrueNAS, then simply install SAMBA and Webmin directly in Proxmox, so you have a GUI and other features that come with Webmin.

Proxmox has Software-Defined Network (SDN) for some time now, which makes networking super easy.

My guess is your intel ssd is a way better higher quality ssd than mine.

Mine are 512gb sata ssds, that is now 2.24 years old by the smart data with over 40tb written. Also its a mid tier consumer drive so its not anything special, right specs at the right price. I picked it specifically for its iops performance which is critical for a special holding metadata, not write or read throughput. Honestly idk how much writing or reading the main pool would need to do to even come anywhere close to making a sata ssds in the special saturate its link. For what it is it performs flawlessly and has made the pool perform way above what my previous pool of just 8hdds did.

Why, i hear people doing this al the time it makes zero sense zfs runs on freebsd and linux. Proxmox being linux already has zfs installed and can already manage all the drives. Why do people feel the need to incur all that overhead for truenas’s gui, when there are plenty of amazing alternatives, that would not require visualizing an entire os to get a gui.

If you have enough RAM and want the special TrueNAS features, like simply enable MultiChannel SMB with a checkbox, iSCSI/iSER via the GUI, the App Store, there are many reasons.
For example, if you want to use the data protection features of TrueNAS, but also want the better virtualization functionality of Proxmox, then this is a valid configuration.
Everyone has their own idea and whether efficiency is the priority is something everyone has to decide for themselves.
If you just want to share data via SMB and have a static configuration that you only need to change every few months, you can simply install SAMBA.

edit:
I’m not sure, but isn’t the FreeBSD version of TrueNAS being phased out anyway?
I’ve always used the Linux version for the last few TrueNAS installations.

It is, which is one of the reasons I won’t be using it. The biggest reason, though, is that I enjoy the learning process and I’m not looking for a quick fix. I’d rather take my time and gain the satisfaction of getting everything working from the ground up. As far as needing a huge variety of options, I really don’t… it’s nice that there are products out there that can install new apps with the click of a button, but that’s just not me… everything I need was described in the OP.

Core will live on as fully open source!

https://www.zvault.io/

If I ever run TrueNAS again, it’ll probably be that. iXSystems lost a lot of faith from me with Scale and all the bugs and cluster approach. If I just want a NAS ,Core excels at it.

2 Likes

Don’t worry here. The SLOG’s can completely fail, and your pools will still survive.

The only data that is at risk is the in-flight data at the time things go down. The last 1-5 seconds of writes or so.

And this is only going to be a problem if your SLOG fails and your system goes down at the same time.

There is a small risk, but nowhere near the smae level of risk as if a main data or special VDEV are lost, as that will result in the loss of the entire pool.

The Optanes are much larger than you really need for a SLOG.

What I did was I just created 40GB partitions for my mirrored SLOGS. This is way overkill, and at the same time leaves some spare space on th eOptanes for future considerations. Then down the line as I added additional pools I created additional partitions on the optanes and used those for slogs for other pools as well.

For instance, you could create a second set of partitions on the Optanes and use those as slogs for the boot partition.

I’m not sure if the automated installer will allow for that configuration, but you can always add them after the fact once the system is up and running.

Yeah, this is going to differ for everyones applications, and it is very difficult to predict.

This thread might help:

I set mine to 64k because I was being cautious, but that has resulted in only 101GB out of 2TB being used on my huge pool.

Worth noting is I added my special vdev after the pool had already been in existence for years, and the old small files stay were they are. Only new ones are written to the special VDEV. So this is going to result in less utilization than for a special vdev that is there from the get go.

I might up it to 128K or higher at some point though.

Given you have SSDs I doubt you’ll benefit much or at all going for anything faster than a simple mirror. You’ll get more storage using RAID-Z at the expense of slower writes, only one drive can fail otherwise you’ll see data loss and slower resilvering but it somewhat depends on how much storage you need. The most I/O intensive VM will likely be the exchange server, depending on workload you might want to look into using iSCSI from VM for (Exchange) storage instead of within the VM depending on how write amplification behaves. In your case I’d just do “plain” “arrays” without any special options such as L2ARC, SLOGs etc. Unless you have a heavy load it’s not likely that you’ll see any performance benefit from it.

2x PM983 960Gb → zmirror (boot), mobo slots
2x PM983 960Gb → zmirror (VMs), Hyper M.2 card
5x PM983 1.92TB → raidz(1) + hot spare, random storage
1x PM983 1.92TB → scratch disk (Poudriere?) / backup if one of the mirrors fail

Jellyfin is supported natively so you can use that and remove the overhead of bhyve

Out of curiosity, what are you going to rip?

You might want to setup your own instance of Poudriere to provide your own optimized packages (ports) for your box which may also cut down on “bloat”/unnecessary dependencies.

You probably also want to limit ARC size in /boot/loader.conf to someting sane like 16G or something.

2 Likes