That is the suggestion for an enterprise type of build, with many multiple users hitting the array. You can ignore that nonsense.
Until you get around the 100TB mark, you don’t need to worry about having more than 8-16gb of ram available to keep ZFS running. ZFS has to keep track of some things in memory, and the larger the array, the more it has to track, but it’s not very much. Everything past that is just nice for a ram cache. ZFS will use half your ram by default, so if you want it to use more, you have to change that.
Never use deduplication, which requires ZFS to keep track of orders of magnitude more information. Then yes you will absolutely NEED tons of ram, and if you ever don’t have enough you’re fucked.
ECC is great and wonderful, and not that expensive. It’s also not mandatory or “needed” any more than any other filesystem. All it is, is another potential vector for data corruption mitigated.
RAIDZ2 and HDD’s are great for bulk storage, but it’s not going to be fun if you want to stick VM’s on that due to shit IOP’s and write latency.
You want pool made of mirrored SSD’s for VM’s, which you can than backup to your raidz2 pool.
ZFS tuning performance considerations for data storage
Here’s what I do for my openmediavault (debian based, just like proxmox) pool meant for bulk storage. This is not intended for VM’s, or for BSD based systems. You’ve been warned.
-
zfs set acltype=posixacl yourpool/yourstoragedataset //Mo’posix mo’better, stores attributes in a more efficient manner. Possible OS portability issues?
-
zfs set compression=lz4 yourpool/yourstoragedataset //If this wasn’t default already, something is very wrong.
-
zfs set xattr=sa yourpool/yourstoragedataset //This would be default if FreeBSD and illumos weren’t too busy being gay. Definite OS portability issues.
-
zfs set atime=off yourpool/yourstoragedataset //Don’t fucking write to the pool everytime you even look at something
-
zfs set relatime=off yourpool/yourstoragedataset //Don’t fucking write to the pool everytime you even look at something
-
zfs set recordsize=1M //Or even 4M. Results in less iops need to read data and thus can feel more snappy if you aren’t constantly trying to read/write only a small part of the file.
Ashift (set on pool creation only) ahould be either 12 or 13. Note that some SSD’s have firmware that is optimized for 4K blocksize operations, and may perform better set to 12 even if they are technically 8K blocksizes. Only testing can tell you which it is.
Create /etc/modprobe.d/zfs.conf and then add
options zfs zfs_arc_max=17179869184 //the max arc ram usage, 16 GiB converted to bytes
options zfs zfs_prefetch_disable=0 //(enables prefetch, good for spinning disks with sequential data)
options zfs zfs_txg_timeout=10 //wait time in seconds before flushing data to disks.
ZFS tuning performance considerations for VM/database storage
For VM’s or other databases, I don’t really know a whole lot. In practice I just use a barely function VM on my main system that I turn on ever few months for some basic compatability needs.
What I do know is that in regards to ZFS you want to match the recordsize with
You likely don’t care but I thought I’d mention that database tables and logfiles are very different workloads. Logfiles should be recordsize of anywhere from default (128K) or even 1M.
A VM via KVM using a qcow2 format should be on a zfs dataset set to a recordsize of 8K
Note that ZFS is not perfect. Out of the box it’s “ok” at most everything, but you really need to clarify your use case and have a proper hardware topology and tuning to support that use case. There are also some ongoing performance issues that are being worked on slowly, but those mostly show up only if you are playing around with a load of NVME drives.
There are also fun ways to speed things up such as like:
- Slog (mistakenly called a ZIL, which is ALWAYS present) for speeding up those sync writes.
- L2ARC, for when your workload is larger than your maxed out ram, but smaller than some SSD’s
- Allocation classes. Use faster storage to hold dedup tables, metadata and/or small io instead of sending it to the slow storage.
But in most cases you are better off just sticking your VM on your fast storage to begin with, rather than try to accelerate slow storage.