ZFS & SSDs - Help designing VM hypervisor server storage layout

Hi everyone,
I’m about to replace all of my current hyper-v/esxi/etc servers with a single new one wich will
hold all the VMs I have in production right now plus new ones when need comes.
Due to budget restrictions I cannot afford to buy an expensive server and the best offer
I have right now is with these hardware specs:

Size: 1U server
CPU: AMD Genoa 9654P
RAM: 256GB ECC
Disks: 6x3,84TB SSD (storage), 2x240gb NVME (boot), no RAID/HBA card
NICs: 2x10Gbe 

Now this system has more than thrice the specs of all my current servers put together
so performance-wise I’m more than happy, but what I’m worried about is the storage
configuration I’ll be implementing due to the fact I have only 6 disks, I could go up to 10
disks but that’s it because the server has only 10 bays.
Right now these are more or less the type of servers I have in production:

DBs:
- 1 MySQL - read/write intensive workload
- 1 MSSQL - read/write intensive workload
- 1 PostgreSQL - read/write intensive workload
- 1 PostgreSQL - read/write mildly intensive workload
- 1 Firebird - read/write mildly intensive workload

Others:
- 1 Fileserver read/write quiet intensive workload
- 1 webserver read mildly intensive workload
- other various smaller VMs - insignificant workload 

With this small disks number I cannot create many pools, can I create just one RAIDZ2 pool
and put everything there or will it kill the performance of the VMs, especially the DBs one
with read/write intensive workload?
I also got a quote for a 2U system with double the disks number but with half the capacity
(12x1,92TB), the price is higher of course but not so much, would this be a better solution?
If I can I’ll avoid buying a more expensive system, but in this case I’ll be able to create more
pools dedicated to the type of workloads of the various VMs, right?
The performance difference will be so much to justify the increased cost or not?

One crucial note: I want to convert the fileserver to TrueNAS Core/SCALE or something else
with native and good ZFS & SMB support because I need to take snapshots an send them to
my backup server, but I’ll have to create the VM inside an existing ZFS pool/dataset, I’ve read
contrasting opinions about nested ZFS and I need to know if this is really so bad or not,
otherwise I’ll have to dedicate at least two drives to be passedthrough the VM.

@wendell I know you’re the master here for (also) this kind of things, I hope you too can
share your thoughts and knowledge here.

I hope I’ve explained myself, tell me if you need more infos.

Thanks

P.S.: are there ZFS courses I can attend to learn this things without searching the web and
posting on forums?

Performance can be tricky with RAIDZ configurations, but using SSDs does help. If your drives can keep up then you may end up hitting the CPU very hard for high IOPS operations on that pool. You’ve specified read and write workloads, but not whether they are sequential or random, but with databases I kind of assume you’re talking random.

If you go with the proposed setup there are some things to pay close attention to:

  • You could consider a pair of mirrored, dedicated, fast SSD SLOG devices in the pool. These will allow sync writes to (temporarily) bypass the RAIDZ2 vdev and smooth out the write workload to the pool to make it more sequential and less random. (The ZIL is committed to pool vdevs in batches every few seconds.) You could carve out partitions from your NVMe boot drives to serve for this (since the SLOG does not typically need to be very large) but if those devices are lower spec just for boot then that may not be ideal. Note that the SLOG device does depend on the writes in question being sync. I’m assuming that’s the case for database workloads, but your hypervisor may be a factor there as well.

  • The host has 256 GB of RAM but you need to pay attention to how much is being allocated for ARC cache by ZFS. This could have a big impact on both the pool’s read performance and your ability to run VMs since they will need that memory as well. This is a property you may want to tune on a VM host. I’m not sure how TrueNAS adjusts the ARC size (if at all) but ZFS by default typically sets it to 50% of the total system RAM. This also brings up the question of what hypervisor you’re planning to use for the host.

  • Are your pool drives enterprise-grade SSDs and do they have a battery- or capacitor-backed buffer? If so, that may greatly help performance as well and make it less of a concern. The general performance specs of your drives are also a factor. Not every SSD is created equal.

  • ZFS does not like to run with a full pool. That has a big impact on performance. It sounds like you are concerned about having enough storage space, so if you think you will be maxing the pool out and you still need performance then you may want to explore other options.

Ultimately I suspect an all-SSD pool with a Genoa EPYC will perform just fine, but it all depends on the workload you’ll be throwing at it.

This will depend on the specs of the drives but in general, more disks = more performance. You also need to make sure your chassis and connection method provides sufficient PCIe lanes to each drive.

It may be appealing to customize a pool for each type of workload you want to run, but this does come with tradeoffs like manageability and scale. You can’t use the extra space on one pool to augment another. It’s generally easier to customize the individual datasets (or ZVOLs) for each VM than it is to make separate pools. Sure, you can’t make different VDEV levels, but you don’t have to worry about duplicating other devices if you want to have an SLOG, L2ARC, or Special VDEV.

Again, very difficult to know for sure as there are too many unknowns. PCIe bus, drive model/specs, and more details of the proposed workloads would be good to know.

Since your intended use case here is for a file server I think it will be fine. I would likely just make sure the recordsize matches between the VM and the internal pool. You could also disable compression in TrueNAS if you have it enabled on the host pool.

Yes it will be random given the variety of VMs I will deploy

I’m planning on using Proxmox as a hypervisor

The SSDs are Samsung PM893 and the NVMEs are Micron 7450 PRO (I specifically asked for drives
with PLP)

No I won’t go over the common 80% ZFS space threshold, that’s the reason for all that storage space

So it is safe to go with a single RAIDZ2 pool (be it of 6 or 12 drives) and use ZVOLs for those VMs
that I intend to use ZFS with?

Thanks for your replies

All enterprise drives is good and I think you’ll be able to get a lot of performance out of that setup. Whether it will be sufficient for your database usage is still an open question.

A RAIDZ2 with 6 drives is probably fine but 12 might be pushing it. SSDs will resilver faster but you may want to consider Z3. This could also be a use case for dRAID:

https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html

Another option would be two striped RAIDZ2 vdevs. That would increase your performance, but as usual, at the cost of more space lost to redundancy.

If you go with the 6 x 4 TB SSDs now you can add 6 more later as another RAIDZ2 on the existing pool. If you start with 12 initially, then your only option will be to replace the drives one by one and then resize the pool when they are all upgraded to a larger size.

What should I be looking at then to know if the system is sufficient for this task?
Is it a matter of pool/dataset configuration or at hardware level still?

Thanks I’ll definetly look into it, this whole dRAID thing (among others) is pretty new to me.

There are also disk shelves to take in consideration if and when I’ll need more storage,
I’m not worried about it right now as I have plenty of space to work with.

I shouldn’t worry about getting a HBA card then, if I don’t need to passthrough disks to VM right?

Thanks again

This is the difficult question. What would be good to know is what kind of IOPS your workload demands. How large is the working set of your databases, and can that fit into memory? Does the application have other caching methods at work? After all the caching is done, how much actual I/O is hitting your disks on a regular basis, and what form does it take?
Above, you describe your databases as intensive or mildly intensive, but that still doesn’t give much of an idea of what we’re talking about. Is this 1000 x 4k IOs per second, or 1000000 x 512B IOs per second?

Another question that might be insightful is: Are these workloads being bottlenecked today by your current hardware, and what is that hardware? If so, then you at least have some measurement to go by for determining how much better the new hardware will be. And if not, then you know you’ll have nothing to worry about in terms of performance. I understand you have multiple servers that you’ll be consolidating down to a single server, so that makes things more difficult to estimate, but this is the kind of information that would give you a much more accurate prediction of what to expect.

I’ve honestly never used an HBA card with ZFS before myself. As I understand, for SATA SSDs it would be nearly essential so you can pass through the controller. For NVMe devices that are direct PCIe attached I think it depends on the PCIe IOMMU groups on the motherboard. Not an area I have any direct experience in.