Storage configuration on virtualization workstation with ZFS

serverdonk · March 9, 2024, 8:32am

Hi all,

I am planning my first server build with around a 10,000 USD budget. The goal of this server is to serve as a virtualization platform (using Proxmox and ZFS) to achieve the following main goals:

Media server using Jellyfin
Highly parallel scientific computing workloads using large (~5TB) datasets and lots (>500GB) of RAM on CPU (genomics)
Small/moderate sized ML workloads using small datasets (~1TB) and little (~64GB) RAM on GPU
Miscellaneous server classics: hosting a website, hosting a DNS.

The server is expected to act as reliable storage for the scientific datasets. It is expected to be in use and up most of the time with rotating datasets every 6 months roughly, never exceeding 15TB. The rest of the storage is most likely dedicated to media for jellyfin. I am planning on a RAIDZ1 setup, and if budget allows for an extra HDD RAIDZ2.

To that end here are the specs I’ve gathered so far, with consideration for price on the second hand market:

Motherboard: H13SSL-N
CPU: AMD EPYC 9654
CPU Heatsink: Aliexpress 6 pipe heatsink
RAM: 768GB (12x64 ECC, DDDR5; M321R8GA0BB0-CQK)
PSU: Corsair RM1000X
GPU: NVIDIA RTX A5000
Storage: 48TB HDD (4x12TB; HUH721212ALE601)
Fast storage: M.2 Sabrent Rocket 4 Plus with heatsinks 2x 4TB

I welcome any general feedback about this configuration, I have not yet settled on a case.

However, I am writing to specifically inquire about optimizing the storage for ZFS. My goal is to be performant and maintain data integrity. After reading the guide posted by Edward and watching Wendell’s video, I have come up with the following questions and ZFS setup.

Hardware questions

What would be an ideal boot drive? The Sabrent’s seem overkill.
Should I be using the Sabrent as L2ARC, it seems like given the amount of RAM I have this wouldn’t be very useful. Perhaps I’d be better off storing the large datasets that are currently being used to compute on the Sabrent’s instead? Could ZFS somehow handle that for me?
I’m under the impression I should snag an M.2 expansion card and get 4x Intel Optane 16GB M.2 for a SLOG. Does this seem appropriate?
I’m under the impression I should replace one or both of the M.2s with a few intel optane SSDs totaling around 1.4TB to act as the special vdev. Is this correct or can those drives be used?

Software questions:

What would be the best ZFS recordsize if the scientific dataset is made up of files ~400MB?
What would be the best ZFS recordsize if the media dataset is made up of files ~4GB?
What would be the best ZFS volblocksize?

I assumed I would set the following:

zfs_arc_max to 128GB
If L2ARC is used: l2arc_rebuild_enabled to True; l2arc_write_max to 64MB/s and l2arc_write_boost to 7GB/s
If L2ARC and special is used l2arc_exclude_special set to True
zfs_txg_timeout set to 60 seconds
All else default.

For Proxmox I would follow wendel’'s guide but with a swapiness of 10:

# https://tweaked.io/guide/kernel/
# Don't migrate processes between CPU cores too often
kernel.sched_migration_cost_ns = 5000000
# Kernel >= 2.6.38 (ie Proxmox 4+)
kernel.sched_autogroup_enabled = 0

# Don't slow network - save congestion window after idle
# https://github.com/ton31337/tools/wiki/tcp_slow_start_after_idle---tcp_no_metrics_save-performance
net.ipv4.tcp_slow_start_after_idle = 0

# try not to swap 
vm.swappiness = 10 

# max # connections 
net.core.somaxconn = 512000

net.ipv6.conf.all.disable_ipv6 = 1

# https://www.serveradminblog.com/2011/02/neighbour-table-overflow-sysctl-conf-tunning/
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

# close TIME_WAIT connections faster
net.ipv4.tcp_fin_timeout = 10
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 15

# more ephermeral ports
net.ipv4.ip_local_port_range = 10240    61000

# https://major.io/2008/12/03/reducing-inode-and-dentry-caches-to-keep-oom-killer-at-bay/
vm.vfs_cache_pressure = 10000

As for the ZFS organization, I guess it will depend on the final hardware selection, but I figured all the HDDs in one pool, I make one ZFS dataset for the media, one ZFS dataset for the science. One pool for the SSDs/boot drive? Not too sure about that.

jode · March 9, 2024, 12:07pm

The demand of your workload on storage performance is not clear to me. I have the hunch that storage is utterly underspec’ed for the set of hardware you’re proposing.

Yes, the four harddrives can store up to 48TB of data, but you’d only be able to read the data at a very slow pace.
ZFS performance is largely determined by the performance of the underlying storage devices. Things like L2ARC, SLOG, Special devices are meant to overcome performance gaps in certain workloads.
In the very best case scenario, your proposed storage could be accessed with max bandwidth of 4x 250MB/s (~ 1GB/s). Be prepared to never, ever, see that performance in case your workload accesses data randomly (most likely) or in parallel (even more likely with 96 cores).
Just reading 40TB of data sequentially at 1GB/s speeds exceeds 40000 seconds (>11 hours).

The HD storage is appropriate for storing and accessing movie media via Jellyfin (low parallelism, mostly sequential access).

I’d consider professional grade nvme storage for your scientific computing workload. Think multiple PCIe gen4 drives of the Micron 9400 type that provides up to 30TB of storage at 7GB/s read/write speeds - even or rather especially when accessed in parallel.

Your thoughts into a zfs setup have developed very far. Some points:

Think about the performance envelope you want your zfs setup to hit. Design your storage with this in mind. I assume that this may require many dozens of HDDs for your scientific workload and as such a closer look at nvme-based storage would be more appropriate.
Many zfs features exist to overcome HDD technology weak spots. You may find that the design assumptions for zfs don’t work so well using primarily nvme-based storage (at least as of 2024).
Don’t consider Intel Optane 16GB M.2 devices. These are not the Optane technology you’re looking for. If at all, you need to look at Intel Optane 1600X that are available at 58GB and 118GB sizes. These drives are very old and may become a performance bottleneck for your setup unless you use a bunch in parallel.

serverdonk · March 10, 2024, 2:53am

Good point. Let me clarify the relationship between workload and storage performance and explain my rationale a little better.

The scientific workloads are doing a lot of I/O, typically sequentially on decently large blocks, but highly parallel. The media workloads as you rightfully point out are also sequential, and here read speed is what matters to me, it should not bottleneck the streaming experience.

I cannot back my setup with NAND purely as that is beyond my budget for the capacities require. Thus I am looking to leverage ZFS to allow me to use HDDs optimally. My questions are aimed at figuring out how to do that. My thinking was that scientific datasets would be stored on the HDD, but those frequently in use would be on an SSD. This would either be done manually (or ideally) by ZFS. The scientific datasets have a size of roughly 5TB max, so I figured 8TB of NAND would be plenty.

If the Optane drives are too old/slow for a SLOG, what is the modern recommendation?

jode · March 10, 2024, 9:22am

Gotcha.

A 4 HDD zfs configuration is plenty fast for typical Jellyfin home use (1 or 2 parallel steams).

Well, hard drives don’t really like parallel workloads. As you may know, performance quickly drops into the single digit MB/s range. This would seriously impact your scientific workloads and likely waste a lot of resources of your large hw investment.
Is there a way for you to validate this before committing to expenditures (i.e. do you have existing HDDs to validate this behavior)?

Assuming the above is correct I propose the following:

I imagine a configuration of two zfs pools. The first ZFS pool consisting of this single enterprise class nvme drive to run scientific workloads and a second HDD-backed zfs pool for archive and media storage.
To overcome the lack of redundancy on the first pool I propose a however frequent (hourly, daily, weekly?) automated schedule creating snapshots of the data in the first pool and a second automated task to transfer these snapshots to a dataset on the second pool for archive and safe keeping.

The cost for an enterprise class nvme should be in the range of ~$100/TB. A single 8TB nvme drive would likely suffice to significantly speed up your workload. I’d use such a drive over the Sabrent Rocket 4 Plus because these drives are spec’ed to run 24/7 and expected to run such workloads, which is typically not the case for consumer class nvme drives. I have no personal experience with the Sabrent - they may suffice for your use case.
If you prefer the 4TB Sabrents, I’d set them up as separate vdevs in their own pool (similar to a raid0 configuration) for speed and capacity.

I am not convinced the HDD pool would benefit from L2ARC (is only used if there is frequently read data in excess of ARC capacity in memory) or SLOG devices (only used for sync writes, which your setup does not seem to have).
Having datasets that seems to consist primarily of large files, a relatively small special device could help with metadata access and remove this access pattern from HDDs, which are not well suited for it. A pair of inexpensive Intel Optane 1600X m.2 drives would be appropriate for this.

Some thoughts on your remaining questions:

boot drive: I think any nand based drive is appropriate. Maybe 2 in a raid1 configuration if you consider the system critical. I don’t expect the boot drive to experience strenuous workloads. In 2024 I’d go for 1TB m.2 nvme drives given that there are really no price benefits to SATA SSDs. Given the overall investment size I’d pick m.2’s from reputable brands and not succumb to the lure of basement bargain prices. I’d keep boot drives separate from data pools.
recordsize: zfs default recordsize is 128kb. If a dataset mostly or almost exclusively is made up of significantly larger files (400MB, 4GB), increasing the recordsize can eliminate overhead and improve data access speed. Typically, increasing the recordsize beyond 1MB will not yield significant improvements.
volblocksize: volblocksize is only relevant if you use zvols. Nothing in this thread points to a requirement to use zvols and their use is discouraged unless indicated by a specific use case.
zfs_arc_max: By default zfs will utilize up to 50% of system ram as ARC. What most people don’t realize is that zfs will give up ram if used by the system. I don’t think there is a need to limit zfs arc size.
zfs_txg_timeout: This is clearly a tunable parameter and as such it exists for a reason. Generally, the zfs defaults are pretty good for HDD based pools. I’d observe performance first with default value and only consider modifying this (or any other setting) once a steady state performance envelope has been documented. Specific to this paramenter I have the hunch that setting it to 60s may have quite a negative effect.

serverdonk · March 10, 2024, 6:38pm

Thank you for these tips. I’m curious are there any disadvantages to having 2x 4TB drives instead of a single 8TB drive? Does the former enable more parallel workloads? I was looking at this drive as an 8TB enterprise replacement (Micron 9200 PRO 7.6TB PCIe NVMe 2.5" U.2 7680GB SSD MTFDHAL7T6TCT-1AR1ZABY | eBay)

jode · March 10, 2024, 9:16pm

The challenge (not sure if ‘disadvantage’ is the right word) is that these two drives need to work together (i.e. raid0), which right now comes with noticeable overhead compared to a conceptually simpler single drive.
At this point zfs has too much overhead for scaling nvme drives efficiently. HDDs are so slow that a CPU has plenty of cycles for compressing data, calculating checksums, even de-duplication, in the time it takes from submitting a command to a HDD to receiving the answer. Not so with NVMe drives. These are freaking fast in comparison.
I understand that improvements are coming to zfs, soon, but until then zfs with nvme-based vdevs is not an easy experience.

On the plus side - two (or more) smaller nvme drives have the physical benefit of offering more bandwidth. That is something you need to look for.

I want to stress that you’re investing in a 96-core, 192-thread monster that needs to be fed with data. If it turns out that your storage is only capable of keeping 2/3 of cores occupied you could save thousands of dollars by selecting a CPU with less cores.

The Micron 9200 Pro is a gen 3 nvme. Certainly good quality, but probably not sufficient. I’d look for gen 4 nvme drives. Yes, they’re more expensive, but I assume in your scenario they’re worth the money. Make sure to check out the drive specs on the vendor websites (often larger devices have better performance). Also look for hardware reviews. I like the ones from StorageReview and TomsHardware as they compare to similar products.

Also, I need to point out that U.2 enterprise drives in general consume more power than consumer m.2 drives and they are designed for enterprise levels of air flow in server chassis, meaning they tend to run too hot in a typical consumer case, with the result that they first throttle, and even shut down. Sufficient cooling fixes that.

MazeFrame · March 11, 2024, 10:49am

From my video-editing testing, U.2 drives tend to do way better with sustained loads. M.2 drives (at least the ones I tested) do not keep their performance for more than 30-ish seconds, then they throttle to about 1 GB/s write. May be fine depending on use case, U.3 with cooling is just the % option.

jode · March 11, 2024, 1:44pm

TomsHardware tests specifically the sustained write performance and is a very valuable resource to judge that aspect.

The sustained write tests for the Sabrant Rocket 4 Plus show that it drops in two steps:

First it writes for a few seconds at full PCIe bandwidth (supposedly until onboard RAM is full)
Then it drops to a 4GB/s write plateau for about 850 seconds
Then it drops to a level of about 1.5GB/s

The graphs nicely show how a specific product compares to its competitors.

serverdonk · March 18, 2024, 4:55pm

Thank you for these comments. I had not noticed that drive was Gen 3. I have chosen to forego the Sabrent 4 Rocket Plus in favor of a single 8TB Gen 4 drive you recommended, and perhaps 2 if the budget allows.

I definitely do not want the CPU to be I/O limited, which was the motivation of this blog post. Your advice towards reaching that goal is most welcome.