Proxmox and ZFS layout recommendations

Hi everyone,
I am building a new Proxmox VE server, and I’m looking for some input on the ideal setup for my hardware and use-case. The system will host some Windows guests that will be used for remote workstations.

I have a Supermicro server with 2x Xeon Gold 6346 3.1GHz CPUs and 786GB ECC RAM.
The server’s got 2x 480GB NVMe SSDs that I would like to use for the host OS, and 10x 2TB NVMe SSDs for data. There’s no RAID controller and the disks are presented directly to the OS (nvme0n1 through nvme11n1, with nvme0n1 and nvme1n1 being the two 480GB SSDs).

I would like redundancy on the two OS drives of course, ideally in such a way that if one fails I won’t need to fiddle with anything to get the host to boot and I could get a replacement in to resilver or rebuild while the system runs. The OS doesn’t need to be on ZFS but if that is what would work well I am happy to do so.
The remaining 10x 2TB NVMe’s should form the “data” pool that will store my VM disks, ISOs, and the like. I am thinking of RAIDZ2 with 1 or 2 spares. From my reading, it’s not super clear if I will benefit from special, ZIL, or SLOG vdevs since I will have everything on fast SSDs anyway.

I understand (I think) that spare devices are used in any(?) pool/vdev that experiences a device failure. So if the two 480GB OS devices are in a vdev, and the other 10x 2TB are split with 1 or 2 as spares, maybe some as special/ZIL/SLOG (as appropriate), and the rest for data, I’d hope a spare 2TB device would be used if one of the host/OS drives fails, right?

Are there any Proxmox and/or ZFS experts here that could recommend a good drive layout?

I really appreciate any input you can offer. :wink:

@R4mzy you poor poor person, they ignored you for 14 days. (and I just found out what the question mark button does)

So I like the overall system, I do need more info on what your workload is going to be like.

Go ahead and use the two 480G SSDs for boot, it sounds like you can tell Proxmox to make a mirror out of them in the installer with ZFS, so that would be my recommendation there.

As far as the data pool goes for hosting VMs and everything else, it depends, Personally for VMs I do striped mirrors 2x2x2 in ZFS, that is going to be your best case for raw performance. Now, you could also do a 2x2 striped mirror for the stuff that needs to be fast, then a 5 disk RaidZ1 + hot spare for bulk data you don’t care about being super fast. That would consume all 10 NVMes. If you really just want space, then you could do a set of striped 5 disk RaidZ1 vdevs, but then you have no hot/cold spares, but it would give you ~2x the performance of one big RaidZ2 vdev.

OK, so spares are hard assigned to a pool, they are not global, but it will work for any vdev in that pool. Honestly, if you are all flash, there is almost no point for a special/ZIL/L2ARC, that’s really meant to augment spinning rust, they also become bottlenecks on all flash, if you added a ZIL it would need to be faster than your whole flash pool or there is little point, for all flash it is more important that you have PLP (Power Loss Protection) in/on your SSDs so they do not lose data if you lose power during a write, if that write was in RAM and you lose power, well it’s gone, but your data is intact. Simular for a L2ARC, if it’s not faster than the pool, no point.

If you want ZFS to be faster add more vdevs or more RAM.

Also side note, you did not tell us what chassis you were in, so I don’t know if you only have 12 bays or 24, but I would personally to striped mirrors so I only have to expand 2 drives at a time, if you go the 5 disk RaidZ1 route and want to expand, you either need to get 5 more SSDs for a new vdev or replace 5 SSDs with bigger ones if you are lacking on drive slots.

I’d argue for 3x3 RaidZ1. Or 5 mirrors. Multiple pools is always bad. One very wide Z2 is certainly the slowest option.

Not needed in ZFS. Transactional nature ensures consistency.

I doubt more than 768GB memory (on a 10-16TB pool) will do anything. Bottleneck is ZFS itself and maybe memory bandwidth, certainly not cache.

A 3x3 Z1 is worse than a 2x2 striped mirror, at least with the S/M you get extra read speed and you don’t have to wait on parady math. You can also remove a striped mirror from the pool, which you can not do with RaidZ#.

Multiple pools are fine, it’s all about how you use it.

Yes, it will not corrupt data, but it can lose data.

Well, he will be splitting that RAM up with VMs and whatnot, so not ideal if those VMs are going to have a lot chugging through RAM, but if it’s mostly static, meh. ARC should still be faster than generic flash in response times there is a reason people use Optane persistent memory for super hot data even on all flash.

Thanks @gcs8 and @Exard3k! :slight_smile:
Unfortunately I needed to get this host up quickly so I couldn’t use your feedback, but it’s appreciated nonetheless.

I ended up settling on three pools - root (OS), data, and swap.

For root I have the two 480GB SSDs in a mirror, set up through the Proxmox installer, which is working nicely and the host boots without a problem from either disk.

The 10x 2TB SSDs are split between a 2x2x2 striped mirror for the data pool, the swap pool is a 2x mirror, and the last 2x SSDs are set as spares and made available to all 3 pools.

In my testing, the spares take over nicely regardless of which pool loses a drive. So not only does each pool have its own resiliency to losing a drive (up to 3 for the data pool), but when such a loss does occur the gap is filled by a spare while I get around to replacing the failed drive.

Performance has been good so far, though with the time constraints I couldn’t do as much comparison testing vs. other layouts as I’d have liked. Still, the general consensus seems to be that the best performance comes out of striped mirrors and that’s largely what I went with…

I don’t have the chassis model at hand right now but it’s just a 1U machine and all bays are occupied, so the only option for expansion in this case would be upgrading to bigger drives (or remote storage I suppose). For what the host will be used for, though, this shouldn’t be a problem (it’s going to run a handful of VMs that will be used as virtual desktops for training purposes, and will be reset to a template state daily/weekly-ish, and the current setup has the capacity to run as many VMs as we should need for this).

If the whole deal should fail spectacularly (or miserably…) at some point in the future I’ll be sure to leave an update here as a warning. :sweat_smile:

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.