ZFS Guide for starters and advanced users. Concepts, pool config, tuning, troubleshooting

Exard3k · April 14, 2023, 9:17am

With Wendell featuring ZFS and homeserver in quite a lot of videos on L1Techs, we as a community have regular forum threads dealing with homeserver and storage in all kinds of ways.

The intention of this thread is to give an overview on what ZFS is, how to use it, why use it at all and how to make the most out of your storage hardware as well as giving advice on using dedicated devices like CACHE or LOG to improve performance.

I will also mention several easy to apply tuning parameters that will tailor ZFS behaviour to your specific use case along with listing the most important pool- and dataset properties you need to know for proper administration.

1.1 What is ZFS

ZFS is a filesystem like FAT32,NTFS,ext4 for your storage and volume manager at the same time. No maximum file sizes, no practical limit on the size of volumes, grow/shrink partitions at will, copy-on-write principle = no chkdsk ever again, you can’t corrupt the filesystem, snapshots to restore previous states of a filesystem in milliseconds, self-healing of corrupted data blocks and a hundred other features that makes ZFS a carefree experience and solves many problems we all hated when dealing with storage in the past.

The thing ZFS excels at is to only return the data you gave it. Integrity is paramount. If ZFS sees drives reporting errors, it will fix things with redundant data from other drives. If it can’t, because too much drives failed, it will shutdown and lock the pool to prevent further corruption. Other filesystems just don’t care at all resulting in lots of corrupt data until you realize it’s too late.

1.2: There are two ZFS out there? Which one is for me?

There are two kinds of ZFS: The proprietary Oracle ZFS used in their ZFS Storage Appliances and the open source OpenZFS project which was forked a decade ago after Sun Microsystems went to hell. Most of the code is the same which is why you can check Oracle website for administration guides as the core of ZFS remains the same. There are specific features unique to both branches as projects had different priorities regarding features over the last decade.

When I talk about ZFS, I’m always referring to the OpenZFS project as most people here in the audience probably aren’t engaged in the Oracle ecosystem or buy million dollar Storage Appliances for their homelab.

What kind of stuff do I need to build a server?

Not much. An old quad core with additional RAM and proper Network card will be fine.

More cores are better to enable higher compression. From my experience, 6 Zen3 cores will do 10Gbit streams with ZSTD compression just fine. 1G or 2.5G isn’t a problem for any CPU built in the last decade.

You want an extra SATA or NVMe drive as a read-cache. When using HDDs, even an old 240GB SATA SSD will improve things quite a bit for your reads.

It’s rather easy to overbuild a server. So start with little but with room for expansion. Practical use will tell you where deficits are. Adding RAM, more drives, more cache can be added later on.

Overkill doesn’t make things faster, it just makes things more expensive

1.3: How do I get it?

Being Open source, check out Github, otherwise OpenZFS is available on Illumos, FreeBSD and Linux in most repositories. MacOS and Windows have experimental versions which are very much development builds.

Ubuntu 22.04 (and some derivatives like Zorin), Proxmox and TrueNAS ship with ZFS by default and Unraid has a plugin that adds ZFS capability.
Other than that, commercial NAS systems from QNAP (the more expensive 1000$+ models) as well as iXSystems server are preconfigured for ZFS.

But nothing is stopping you from installing ZFS on e.g. Ubuntu server and make your DIY storage server.

Caveats in Linux, why ZFS has no native support and why it can break your system

ZFS isn’t licensed under the GPL (it uses CDDL) and can’t join ext4 or BTRFS as an equally treated filesystem in Linux for this reason.
While FreeBSD uses ZFS as their default filesystem for the root partition, Linux users have to tinker quite a bit to get the so called ZFS-on-root. This is very much the stuff for experienced Linux users. So if your distro (most don’t) doesn’t offer ZFS at installation, you’re out of luck. Or you use the OpenZFS documentation or other guides how to get ZFS on your root partition.

But having an ext4 or BTRFS boot drive and using ZFS for other drives (e.g. /home) is a compromise with much less work involved. Installing the package and creating a pool is all you need then.

But being not part of the Kernel means the OpenZFS team has to ship a new DKMS for every new kernel out there. This doesn’t always happen in time and leading edge distros may find themselves in a situation with a kernel and no compatible ZFS DKMS. The data won’t be lost, but if you can’t load ZFS module, then you can’t access your stuff. Particularly troublesome with ZFS as your root partition.

So I don’t recommend using ZFS on leading edge distros. Debian, Ubuntu or generally LTS/Server Kernels are the safe bet when using ZFS.

We all wish it would be different, but licensing and lawyers prevent us from having the good stuff.

1.4: What hardware is required?

ZFS runs on commodity hardware and uses free memory to cache data. There is no special requirement for ZFS. It is often said that ECC RAM is required, but that isn’t true. It’s a recommendation. And using ECC RAM is always recommended, not just for ZFS. So ZFS isn’t different from using e.g. NTFS in that regard.
And ZFS doesn’t like RAID cards at all. Or BIOS RAID. It just wants standard unaltered drives, from your SATA/M.2 port or an HBA.

1.5: ZFS uses a lot of RAM?

ZFS will grab large portions of unused memory (and give it back to the OS if needed elsewhere) to cache data, because DRAM is fast while your storage is slow. Unused RAM is wasted RAM.It’s essentially like a dynamic and intelligent RAMDisk where the most recently and frequently used data is stored so that data is accessed at main memory speed. 30GB/sec Crystal Disk mark stats in your VM while using HDDs? Yeah that’s proper ZFS caching at work.
In fact it is rather hard to measure actual disk performance, because ZFS optimization and caching is omnipresent.

Otherwise ZFS runs fine on systems with just 8GB of memory. It just can’t use many features to speed things up with no memory to work with. In this case you can expect below average filesystem performance and slow disks, pretty much how storage looks elsewhere.

2. Pool config and properties

With all things storage you have to decide on what the priorities are. The magic storage triangle consists of capacity, integrity and performance. It’s a classical pick 2 out of 3 dilemma.

                       capacity
                          /\
                         /  \
                        /    \
                       /      \ 
          performance /________\ integrity

ZFS offers all available RAID levels, while RAID0 is called “stripe”, RAID1 is called “mirror” and RAID5 is called RAIDZ. With RAIDZ there are 3 levels each adding an additional disk for parity.

RAIDZ2 and 3 still allow for self-healing and full data integrity even if a disk has failed. Replacing a disk can take days with large HDDs being at full load all the time, so people like to have that extra safety during this critical process.
Usually, parity RAID like RAID5 have to deal with the infamous “RAID5 write hole” which can kill your RAID during a power loss. RAIDZ in ZFS doesn’t have this problem, so we don’t need a battery backup and RAID cards to deal with this flaw.

Otherwise Mirrors are probably the easiest, fastest and most flexible to work with in ZFS. You can add and remove mirrors to the pool and expansion is really easy by just plugging in two new drives. With only 50% storage efficiency, it’s always a hard sell. But the alternatives aren’t without compromises either.
No one can escape the magic triangle

You can use a pool configuration without any redundancy. You get a warning you can override, but this isn’t a recommended setup for obvious reasons.
I use ZFS on my laptop on a single disk, because I still have snapshot feature, easy ZFS to ZFS backup and ARC cache available that way. So there are valid use cases for non-redundant ZFS pools.

But any proper storage server has some level of redundancy, because no one wants to lose all of their data. Greed gets your data killed eventually.

What is a vdev and do I need one?

Each RAID or single disk is a vdev (virtual or logical device). A pool can have as many vdevs as you like. If I have a mirror of two disks, I can just add another mirror to the pool to double the storage capacity. The data automatically gets striped across vdevs, essentially making it a Raid10 now. Or a RAID50 if you add a new RAIDZ to the pool with an existing RAIDZ.

This is how expansion works: Adding another vdev to the pool. If you have a RAIDZ2 with 6 disks, you need another RAIDZ2 with 6 disks to expand. Mirrors just need 2 drives to do an expansion. Lack of flexibility for RAIDZ Expansion is a thing ZFS is often criticized for. But that’s what we’re dealing with (for now).

BUT: If one vdev dies, the entire pool dies. So make sure each vdev has the same level of redundancy. You don’t want weak links in your chain.

------------------- ARC--------------

ARC stands for adaptive replacement cache. It is the heart and nexus of ZFS and resides within main memory. Every bit of data goes through it and ARC keeps a copy for itself. As far as caching algorithms go, it’s fairly sophisticated.
Data that is used more often gets a higher ranking. If you run out of free memory, ARC will evict the data with the lowest ranking to make space for new stuff. It will prioritize to keep the most recently used (MRU) and most frequently used (MFU) data you read or wrote while also keeping a balance between those two.
So memory is fairly important. More memory, larger ARC. Less stuff has to be evicted and less data has to be fetched from slow storage. If the data is stored with compression, ARC will store the compressed state. This can result in memory of 64GB actually storing 100GB+ of actual data. That’s free real estate.

In an ideal world, we just use DRAM for storage. But the stuff is bloody expensive, so we have to deal with slow storage like HDDs and NVMe for the time being and use some tricks to make the best out of this misery.

2.1 other devices outside of disks storing user data

ZFS has several classes of devices that help and assist the pool to speed things up. Like adding an SSD to act as a read cache. We’ll talk about all of them in the following paragraphs and why you want them or why you don’t need them.

L2ARC aka cache aka read-cache

I told you that ZFS uses memory to keep the most recently used (MRU) and most frequently used (MFU) data in memory so you can access them at main memory speed.
Well, RAM isn’t cheap and there are only so much slots on your board. Thus L2ARC was invented which usually is one or more fast devices (e.g. NVMe SSD) where all the data that spills over from the ARC (main memory) gets stored on. It may have been evicted from ARC, but it is likely that this data will be used again later but your RAM was just to little to store everything.

We don’t get main memory speed if we need the data, but it’s still way faster than getting it from a HDD. And we increase the amount of data that can be cached by the size of the L2ARC device. Essentially you increased your RAM by a TB for caching purposes when using a 1TB NVMe, but making it a bit slower (DRAM vs. NVMe).

L2ARC uses (usually a very small) amount of memory to keep track of stuff that is stored in your L2ARC. For a pool with default settings this isn’t noticeable and the gains far outweight the overhead it produces.

Here some fancy table to illustrate an often misinterpreted and overexaggerated usage of L2ARC headers:

Your typical homeserver ranges from 64k to 256k average record-/blocksize. Some datasets may have less, others will have vastly more.

It’s usually a no-brainer to use L2ARC because you can cache so much more data. And you don’t need expensive enterprise stuff or redundancy because it just stores copies of stuff already present on (redundant) data vdevs.

L2ARC is probably the highest priority when deciding on what vdev to add to the pool, because caching and accelerating reads is generally desired by most people.

You will find many old guides around ZFS and L2ARC on the internet. Most don’t like L2ARC, but these were often written when L2ARC was some fancy expensive PCIe Accelerator card or very expensive 1st gen 100GB SATA. Just going for more memory was the better deal. This changed with cheap consumer NVMe being sold in supermarkets and gas stations.

You can add or remove a CACHE at any time.

SLOG aka seperate log device aka log aka “write-cache”

ZFS has no “real” write caching like one usually imagines a write cache. But first we have to talk about two kinds of writes:

ASYNC writes:
That’s your standard way of doing things. Most stuff is ASYNC. When it doubt, it’s this.

ZFS writes the data into memory first and bundles up everything in so called transaction groups (TXG) which are written to disk every 5 seconds by default. In essence it converts a bunch of random writes into sequential writes. This obviously increases write performance by quite a lot. There is nothing worse than random I/O. So this makes our HDD write at 200MB/s instead of 3MB/s. Same goes for SSDs with GBs/sec instead of 60MB/sec.

But your volatile memory gets wiped on power loss or system crashing. So having data in memory during an incident makes you lose that data. The copy-on-write nature of ZFS prevents the filesystem from corruption and you revert back to the last TXG, essentially losing the last 5 seconds.

SYNC writes:
But there are some nasty applications that insist on writing to disk immediately or they refuse to proceed with their code. This is because of data integrity concerns mentioned above, which is a valid point. Most storage isn’t as secure as ZFS. But it complicates things.
We’re now back to abyssal write performance and random writes dragging things down.

And that’s where the SLOG comes into play. All writes will go to that device which is very fast and ZFS gets to do its TXG thing as usual. After 5 seconds, SLOG gets wiped and stores the next 5 seconds of data.
We now have a physical copy on the SLOG and ready-to-flush data in memory. Power outages aren’t a concern anymore and we still get fast writes.

This needs a dedicated drive with a lot of write endurance and very good random I/O capabilities.

A mirrored pair of Enterprise-grade NAND Flash with high endurance ratings, Optane or equivalent are the recommended setup. As you only need 5 seconds worth of writes to store, 8-16GB is all you need. You can overprovision your SSD down to that amount to increase endurance.

You can remove a LOG from the pool at any time.

Here some graph with colored arrows to illustrate SYNC:

So Exard…I trust my memory and I got UPS and shit. I want to treat all applications the same. That whole SYNC stuff is slow and annoying and I don’t want to spend hundreds of Dollars and use valuable M.2 slots.

Well, you can just set sync=disabled for each individual dataset (or set it pool-wide), making ZFS just lie to the application, pretending it’s all stored on disk already while in fact it’s still in memory.
This is the highest performance option for dealing with SYNC writes. Faster than any SLOG can ever be. But the very complexity of SYNC operations still result in lower performance than ASYNC. The choice is yours, there are good arguments for both options.

You can even set sync=always so every write will be treated as a SYNC write. This obviously will result in higher load for your SLOG as every bit that is written will be written to the SLOG first. Or to your storage if you don’t use a SLOG.

If you are unsure how much of a difference a SLOG will make and what utility the concept of TXGs provide, you can change sync setting during operation at any point and test how performance and pool reacts.

I did test transferring a million small files to my pool with and without SLOG and with changing sync settings. The HDDs told me immediately (judging from screaming drive noises) what their preferences were It wasn’t a surprise, I was just impressed by how much of a difference in performance it was.

So we just dealt with bad random write performance. If anyone complains about HDDs are bad at this, it doesn’t matter to you when using ZFS.

But I want a proper write cache

Just write some 50GB of data at network speed and let ZFS figure out when and how to write stuff to disk? That’s what most people imagine a write cache does. There is no such thing in ZFS for several reasons and it’s not even on the list of planned features. It ain’t gonna happen.

But ZFS has some 200 cogs to tweak things:

If you have enough memory, you can raise the limits on transaction group timeouts and the amount of dirty data within the ARC (the stuff that’s not yet written to disk) to make a pseudo-write cache. Obviously comes with a bunch of disadvantages regarding safety and general waste of memory.
But we’ll talk about zfs tuning parameters later.

— Special allocation class aka special aka special vdev —

Wouldn’t it be great to store all the small files on an SSD and the large files on HDDs? Each utilizing their own strength and negating the weakness of a particular type of storage. Usually this doesn’t work. But with ZFS you can.

Even the man himself, Wendell, raised attention for this (relatively) new and niche feature of ZFS.

Special allocation classes have multiple uses: To store metadata, to store small files or to store the table used for deduplication. This is all that nasty little stuff your HDD hates and where you get 2-5MB/sec and HDD access noise is all over the place if you have to read a lot of it. The very reason why we switched to SSDs asap.

What is metadata?
Every block has metadata associated with it. When was the block written?, whats the compression? block checksum? disk offset…basically the whole administrative package leaflet for your data so everything is correct, measured and noted for further reference. By default ZFS stores 3 copies just to make sure this doesn’t get lost.

And for each record or block, one “piece” of metadata is produced. So the amount of metadata scales linearly with the recordsize or blocksize you are using. e.g. a 8k blocksize ZVOLs produce 16x as much metadata as a 128k blocksize ZVOL.
So no one actually knows how much GB of metadata you can expect unless you tell them what record-/blocksize you usually see and use on your pool. Typically it’s maybe 0.5% of your data, but can vary greatly as mentioned above.

If you feel like small files and metadata don’t get cached by ARC+L2ARC enough, special vdev can improve things. But so does increasing cache. Or tuning parameters. Hard to tell what’s best and useful as a general recommendation.

special_small_blocks:

This is your dataset property to set the limit on how big the files on your special vdev can be, everything larger gets allocated to your standard data vdevs.

zfs set special_small_blocks=64k sets the special vdev to store only files smaller or equal to that value.
I’m personally using 128k value while my datasets usually go with 256k or higher recordsize. Value has to be lower than your recordsize to make sense.

If the device is full, ZFS just allocates further data to standard data vdevs as normal. So things don’t suddenly break if you run out of space.

Hardware recommendation for special

You want a device which is faster than your standard storage and is good at random reads. So any decent consumer SATA or NVMe will be fine. It just needs equal or better redundancy than your data vdevs. This is because the stuff actually lives there and isn’t some cache device storing copies. If special vdev dies, your pool dies. Metadata dead = Pool dead. So use a mirror or better.

You can remove a special vdev if everything else in your pool is also mirrors. If using RAIDZ, you can’t remove the special vdev.

3. Datasets and Zvols

— Datasets and properties —

Each dataset is it’s own filesystem you can mount, export and tailor to your needs. You can imagine them being folders/directories within a pool.
Generally you want to have a lot of them and not run everything in a single dataset (the pool itself is a dataset too). If you want to rollback to a previous snapshot, just restore the snapshot of the dataset. No need to rollback the entire pool. And you may not want the craziest compression for everything, but just for some purely archival datasets.

We create new filesystems with zfs create tank/movies

Upper limit on datasets is 2^48, so make use of it

There are like 50 properties. You usually can ignore most of them, some provide useful statistics but a couple of them are essential in building an efficient and well managed pool.

zfs get all tank/dataset returns a list for all of them. To check if everything looks right.

zfs get compressratio tank/dataset only returns the compression ratio from data within that dataset.

My Top10 properties (you probably run into at some point) are:

click to expand:

recordsize

Setting the upper limit on how big chunks ZFS allocates to disk. 128k is default. Datasets with purely small stuff benefit from lower recordsize while large media datasets work just fine with 1M or higher. Can speed up reads and writes. You can split a 1MB file into 256 pieces or use a large 1M box . Your HDD loves the large ones. But reading and writing the whole MB just to change a bit somewhere in the middle is far from efficient. It’s always a trade-off.

compression

set your favorite compression algorithm. LZ4 is default, really fast compared to achieved compression. ZSTD is the new and fancy kid on the block and very efficient. GZIP is available too, but slow. I always recommend to use compression. LZ4 algorithm is gigabytes per second per core just to give you an idea how cheap compression is these days.

atime

You want to turn this off immediately unless you absolutely need atime.

casesensitivity

Useful when dealing with Windows clients and everything where case sensitivity may become a problem

sync

Consult the SLOG paragraph above on why you may want to change sync

copies

You can store multiple copies of data in a dataset. It’s like a RAID1 on a single disk but only within this dataset. Useful for very important data or when you don’t have redundancy in the first place but still want self-healing of corrupt data. Obviously doesn’t protect against drive failure.

mountpoint

Well, it’s where you want the dataset to be mounted in the system.

primarycache & secondarycache

You can exclude datasets from cache. Or exclude everything by default and only allow some datasets to use cache. primary being ARC and secondary being L2ARC.

----- ZVOLS -----

If datasets are your filesystems you can mount and share, Zvols (ZFS Volume) are volumes that are exported as raw block devices just like your HDD/SSD in e.g. /dev/sda or /dev/nvme0

With ZFS, you can create these devices at will. We can format it, mount it, share it via iSCSI. We’re not using files or folders anymore, but entire (virtual) disks.

We create Zvols with zfs create -V 2TB tank/zvolname

Your Windows VM just sees a hard disk, not some network share. And although Windows formats the disk with NTFS, it is ultimately stored on our ZFS pool and we get all the features like snapshots, compression, self-healing, resizing, integrity, etc. Things you normally won’t get with NTFS.
Hardware that can boot from iSCSI can just use your Zvol as its boot drive. There are a lot of cool things you can do.

For home usage, sharing the Zvol via iSCSI is also an alternative way of providing storage outside of SMB or NFS. If you don’t want network folders but an entire disk, Zvols are for choice.

I have around 30 Zvols on my pool, most serve as VM disks and some are shared via iSCSI to other PCs. Some things just don’t work well via “network folder”, be it permissions, protocol overhead or OS/application not treating an SMB share like it would for a local drive.

We talked about dataset properties…Zvols have similar properties. But the most important difference is recordsize is replaced with volblocksize.

Determining the optimal blocksize is a science by itself. Here are some basic guidelines you should consider:

You ideally should match blocksize with the cluster size of the underlying filesystem that will be used.

Low values are better for random read and writes and larger values offer better throughput.

If you only have 128kb boxes and want to store a 1kb file, that’s a lot of wasted space. ZFS compresses rather well, but it’s still far from optimal. On the other hand, reading 256 blocks (4k volblocksize) to read a single 1MB file takes a lot of time. With 128k volblocksize, you only need to read 8.

I know that TrueNAS and Proxmox have default settings when creating Zvols and they aren’t bad as a general recommendation. Just keep in mind that you can’t change blocksize after creation. And digging into the low-level details of storage concepts and sector/cluster/block-sizes may provide you with more optimal blocksizes for your use case.

We can also create Zvols with thin-provisioning. This only uses the space on the ZFS pool that is actually being used instead of allocating the maximum value at creation.

We use the -s sparse option to do this: zfs create -s -V 2TB tank/zvolname

Keep in mind that Zvols can’t be used by multiple clients like a general SMB/NFS share.

To be continued…

Exard3k · April 14, 2023, 9:17am

----------ZFS TUNING-----------

ZFS offers extensive opportunity for tuning and tailoring ZFS behaviour to your needs. Usually the defaults are good and usually well reasoned. I could work with a default pool anytime. But sometimes you just want to tweak some things that itches you.

For TrueNAS Core, just use the GUI. System → Tunables → enter parameter, set to sysctl, enable, done.
Scale doesn’t have this feature and is more complicated to tune than even your standard Linux system.

In Linux, everything is a file and we find our cogs and knobs in /sys/module/zfs/parameters/. Some distros keep the stuff in other directories and FreeBSD is totally different as well.

To make things permanent, touch /etc/modprobe.d/zfs.conf
and add e.g. options zfs zfs_txg_timeout 20 to make that value load during boot.

The official reference document from the OpenZFS project can be found here. This extensive and well-made wiki will cover every last parameter with explanation and syntax.

Here are my favorite picks i’ve used in the past and present and are generally useful for most pools (1-2 are more of experimental nature), feel free to post your favorites too.

Values are usually in bytes, so get your GB to bytes converter ready, because 50GB in bytes are a looooot of digits.

Click to expand:

zfs_arc_max

Sets the maximum for ARC size. Useful if you don’t want ARC to grab all your memory. It will pass space back to the OS if needed by applications, but there always is some latency involved. So use this to prevent ARC from being too dominant in terms of size. Default is usually set to very high percentage of RAM size depending on OS/Distro. The zfs_arc_min equivalent tells ZFS to NOT give back memory to the OS if ARC size hits the threshold. I wouldn’t recommend using zfs_arc_min, so be careful with this.

l2arc_rebuild_enabled

Instead of starting with an empty cache device after each reboot, you can make the L2ARC persistent with this parameter, keeping all your data. Aside from booting with a warm cache (which is just great), the drive also doesn’t have to write similar data over and over again, saving on drive endurance as well.
TrueNAS Scale ships with this enabled by default. I recommend using this for everyone using L2ARC. There are no drawbacks.
No, it won’t save data from your ARC. ARC has to retrain and warm up after each reboot, the fate of volatile memory.

l2arc_write_max and l2arc_write_boost

L2ARC has a limit on writes of 8MB/sec. This is most likely too slow for you. We’re talking about TB worth of NVMe storage after all. _boost alters this until the device is full (to warm up the cache faster) while _max sets a general limit per second.
While this mechanic is intended to save on drive endurance, it also protect the L2ARC from being flooded by sequential data that’s only ever accessed once, kicking out potentially more valuable data in the process. Eviction policy is more primitive than ARC itself. So keeping some rate limit for the L2ARC is useful.

l2arc_mfuonly and l2arc_exclude_special

These optimize space on the L2ARC. Either by only accepting most frequently used data (MFU) and not MRU. The other one being useful when you have both L2ARC and special vdev and both being equally fast (e.g. all are NVMe drives). There is no reason to store the same data on both of them.

zfs_txg_timeout

ZFS writes the transaction group to disk every 5 seconds by default. It can be annoying to hear drive noises every 5 seconds or you don’t think writing bytes worth of logs is worth to ramp up all drives. I use this parameter for both my laptop and my server. The laptop has an NVMe SSD and waking up the drive every 5 seconds is bad for power consumption, so I use 300 seconds (5min) and the drives stays in deep slumber power state for that period. If there is a lot of data to be written, ZFS still commits the writes early because of other limits. So you can’t break stuff with it. But leaving data in (volatile) memory is always some sort of risk. Laptops ship with built-in UPS, so I’m ok with leaving data in memory for minutes.

zfs_dirty_data_max and zfs_dirty_data_max_percent

Well you learned about TXGs and how ZFS keeps stuff in memory to aggregate everything into one big chunk to optimize writes. Well, there is an upper limit on how much memory ZFS allows to be “dirty” (potentially dangerous state because of volatile memory). But if you aren’t concerned with this, you can increase it. I use a value of 30GB on my pool (along with higher txg_timeout) because I like the utility of having a pseudo write-cache of 30GB and I have the memory to spare. But this also means that when ZFS commits the 30GB TXG, it suddenly has to compress 30GB of stuff. I had TrueNAS basically locking up for a couple of seconds because 100% CPU load for ZSTD compression when playing around with this parameter. So use this with care as it comes with some unintended drawbacks. I did settle with 30GB as a working solution on my pool, but depending on your CPU, compression and general integrity concerns regarding data residing in memory for an extended period, this certainly isn’t an option for most people.

zfs_arc_meta_limit and zfs_arc_meta_min

If you feel like your ARC like to evict too much or too little metadata, you can set a new maximum with _limit and minimum to keep with _min. Particularly useful if you don’t like the ratio between data and metadata in your ARC. This is particularly useful for pools with either very low amounts of metadata or very high amounts of metadata, compared to the amount of memory (ARC size).
It can make caching metadata more effective and reliable and that’s how I’m using it.

----- USEFUL RESOURCES TO LEARN AND ADMINISTRATE-----

Aaron Toponce - zpool and and zfs Administration)
The guide I used to do my first steps with ZFS and learn, it’s a decade old by this point, but the basics in ZFS never changed. Aaron Toponce did an amazing job on keeping things simple with this step by step approach.

edit: The link is broken as I edit this. there is a presentation pdf here which is a bit more limited in scope but is a good ressource too. I hope his website will be fine.

Official OpenZFS Documentation
Amazing wiki for everything ZFS.

ZFS 101—Understanding ZFS storage and performance – Ars Technica, Jim Salter
Great article to get an overview on how things work. I didn’t really understand copy-on-write and the whole vdev thing before reading this. Totally worth the read.

And as we all love Level1Techs and audio-visual content, here are some informative lectures from conferences to get on the basics:

OpenZFS Basics by Matt Ahrens and George Wilson
The ZFS filesystem" - Philip Paeps (LCA 2020)
The Z File System (ZFS) Jim Salter@All Things Open

Scripts and tools helping automate the command line:
ZFS auto snapshot…a script to automatically create and delete hourly/daily/weekly/monthly snapshots via cron

Syncoid/SANoid: A tool to automate everything ZFS regarding snapshots and replication

nx2l · April 14, 2023, 11:43am

proposing new section:

I updated and now my zfs pool didnt mount after a reboot

Exard3k · April 14, 2023, 11:47am

Chapter 5, Appendix XVIII: The joys of DKMS, GPL-unfriendly licenses and lawyers.

I will come to that. I plan on filling stuff within the coming weekend. It’s as much a “what the community wants” as well as “what I had in mind”. I do see this as a useful and collaborative ressource.

Practical things will go to the second posting while the first one covers concepts and general understanding to get a primer on some buzzwords.

jode · April 14, 2023, 11:33pm

What a great thread. Kudos!

May I propose a perspective on ZFS in comparison to a relatively simple, but generally well understood file system? How about FAT or ext(3/4)?

IMHO ZFS is so darn powerful because it sets out to utilize all the CPU cycles that are genereally spent waiting on IO for useful features at runtime:

TXG
compression

Other features exist to overcome technology weaknesses in certain use cases:

ZIL for lacking small recordsize write performance
L2ARC for lacking random read performance
Special for lacking small recordsize random read performance
multiple VDEVs to overcome lacking performance at increasing IO depths

With this knowledge it could be possible to explain pool design recommendations for certain use cases. That may lead to improved understanding (level up) or to total confusion (even more than today ).

Recommendations, such as turning off SYNC for a dataset, can be easier explained. Also, it may foster a healthy discussion on tuning for all-flash pools.

Exard3k · April 15, 2023, 7:58am

Yeah I’m not that happy with the pool config section yet. But I won’t go into general RAID level pro/con either. This is about ZFS, not a general storage guide.

I don’t recommend disabling sync, although I personally don’t use it myself. I’m just stating options you got and mentioning drawbacks on either side. But feel free to explain sync more easily. The whole concept is counter-intuitive for most people, which is why there are so many misconceptions about LOG and writes in general.

I keep on editing stuff, adding new paragraphs and optimize some text formatting.

jode · April 15, 2023, 11:03am

Wrong choice of words on my side. I saw you explaining the option of disabling sync a few times.

mith · April 15, 2023, 1:17pm

ZFS was written to run on Sun hardware which was extremely robust so its design leveraged this reliability to create a filesystem that will never need days of outage to resilver or fsck. ZFS lacks fsck tools because it is designed to be always available and always consistent on disk. A ZFS filesystem can be corrupted only if there are SW bugs or HW memory errors. ZFS just doesn’t have SW bugs, , so that’s not an issue. Memory errors are a distinct possibility, especially with the folks on this forum who might be tempted to run their memory modules at close to the melting point of silicon for that extra 5% performance. Because of its design relative to other filesystems, the recommendation for ECC is a lot stronger. I would not be so cavalier about the advice.

Since artificial market segregation has made ECC rare in consumer hardware, I would suggest:

Raid is not backup. If your data is valuable to you keep backups on separate, distinct hardware. No, a snapshot is not a backup.
Use reliable hardware for ZFS. Don’t try to overclock it. Use good cooling, etc. Use ECC if you can get it but be aware that non-ECC ram has a non-negligible impact on reliability despite using raid z2, z3, etc.
Always perform scheduled filesystem scrubs to detect and repair bit-rot on your disks. Periodic background scrubs are an important part of the ZFS strategy for eliminating long re-silvers and filesystem checks.

mith · April 15, 2023, 2:36pm

All ZFS filesystems have a write cache. Every ZFS filesystem has a ZFS intent log or ZIL for short. It is located inside the zfs filesystem itself unless you specify an slog vdev. Provisioning an SLOG merely places the ZIL on dedicated hardware. The main reason for doing so it support applications that need to perform a large number of synchronous writes. A standalone database is a good example. Most casual users of ZFS will never need an SLOG. This is especially true if the vdevs are on SSD already.

Many applications contain embedded databases and they need to be able to use synchronous writes to ensure that you don’t lose data in the event of power loss. That doesnt mean every file they open is synchronous or asynchronous. The application can choose. Synchronous writes aren’t “nasty” per se, they are needed in some instances to ensure reliable commits in the event of power loss. Disabling all synchronous writes is probably never going to be a good idea unless you know exactly what you are doing. One example that comes to mind is hosting a massive reddis database instance that exceeds the size of memory where you don’t care if the backing store on disk is corrupted because it can be easily recreated. Bottom line is that its best to let the application choose whether it wants to write synchronously or asynchronously because the designers of that application have specific reasons for doing so. In fact, forcing all writes to be synchronous is far more common than the opposite. This is sometimes needed to ensure reliable writes when an application has bugs and doesnt use synchronous writes where it should.

Most people don’t need L2ARC either. If your ZFS vdevs are backed by ssd, then its plenty fast enough and a read cache (above and beyond ram) wont help. If your vdevs are on rotational storage, then L2ARC on ssd might help but it is highly application dependent. Its only beneficial for an application that repeatedly access the same data, the data doesnt get flushed from cache by competing applications, and that data exceeds the size of ram. That use case is so narrow that you might as well put all that application’s data on a dedicated ssd drive and not involve ZFS at all. One example where it might make sense is an application that moves over a large window of data that it accesses in a random pattern. The application crunches on that data for a while, then it moves onto the next window. In that instance the cache could help if the application’s access window fit inside the L2 cache. I cant even think of a use case that fits that pattern. Genomic data processing perhaps… The reasons why L2ARC isn’t generally beneficial are similar to the reasons optane accelerated storage never really worked out.

Not trying to pick on you dude. But there are things in this guide that needs a lot of caveats. I would hate for someone to try to implement some of this advice when the defaults would be better.

mith · April 15, 2023, 2:49pm

First introduction of zvols under “What is metadata”? Suggest you create separate section for zvols, what they are used for and howto tune the recordsize for best performance. Hint, this will be some small integer multiple of the filesystem cluster size that will be created in the zvol so the opposite of what I think you are suggesting here for the intended use case of zvols.

mith · April 15, 2023, 3:01pm

This is misleading. ZFS is one of the least flexible when it comes to growing and shrinking. You can only grow and you must add chunks in vdev size units of storage. For example, if you started with an 8 drive vdev in raidz2, then you should grow in units of 8 if you want to keep the same storage characteristics. Planing, especially for the casual home lab, is very important. You should add a section for how to plan for expansion and the tradeoff between performance, reliability, and storage efficiency. The TL;DR for most people interested in very reliable, performant storage is raidz2 with a minimum of 5 disks per vdev.

I’m not that up-to-date on any new topology migration features in openzfs so there may be some new options available now or on the horizon…

Exard3k · April 15, 2023, 3:18pm

Yeah I want to add Zvol. But right now I’m busy with formatting stuff so the wall of text is somewhat manageable to read. I’m keeping things simple for a reason. I could talk about write amplification and zvols. There is a lot of important stuff, but I want to avoid a deep dive into advanced stuff and keep it accessible for most users.

This was about size of datasets and zvols, which is vastly more flexible than having to stick with your 32GB ext4 partition that’s running at 95% capacity. With quotas, reservations or changing size of zvols this is lightyears ahead from doing things in NTFS or ext4. I can’t remember how often I had to increase VM disk sizes (all my VMs run on sparse Zvols) and how easy it was.

And sure, storage expansion when talking adding more drives is rather short. I intend to update and make things more clear. This is all very much in progress, also because to check on community concern and wishes.

BTRFS clearly has the lead here. I just recently converted a RAID1 to RAID0 (I needed space tbh *shame). You can run RAID0 on mondays, RAID1 on Saturdays and RAID10 every other day with 3 lines in crontab. BTRFS doesn’t care.
For ZFS, the RAIDZ expansion feature allows to increase RAIDZ width at the cost of some tracking in memory. Code is done, just needs extensive testing. I’ve not seen any mentions of conversion.
Recreating the pool and restore from backup is the only good workaround. And a lot of people don’t keep backups on top of their expensive NAS/storage server.

diizzy · April 15, 2023, 8:16pm

This isn’t true (using default settings), see zfs.4 — OpenZFS documentation

zstd compression is also not recommended as you likely need rewrite all data once a new version is imported due to data not being identical in compressed form. See https://github.com/openzfs/zfs/pull/11367

regulareel · April 16, 2023, 1:37am

This thread is bookmarked. Thank you!

Exard3k · April 16, 2023, 7:15am

The wording is more the problem here. You are correct on this one. I just wanted to make the point in ZFS not using lots of memory but mostly unused free memory. I changed it to “large portions” for a general representation. Although Ubuntu doesn’t ship with 50% as I had to manually tune it back. I remember it being way higher. So the actual packaged Linux versions differ from the OpenZFS docs. In general, you want to have an eye on your ARC size and tune it with zfs_arc_max. That’s why it’s the very first parameter I mention in Tuning.

I remember this was addressed at some point with ZFS being able to store multiple versions of e.g. ZSTD-3 files to stay consistent. This was especially troublesome for L2ARC afair.
This might be a legacy issue for older pools and older versions of ZFS, but I’m not an expert on this. I’m using ZSTD on multiple pools and never experienced any problems. And using send/receive or encrypted datasets are commonly used features I work with. Doesn’t mean there is nothing wrong somewhere

Thanks for raising the attention to these things.

1ms_hand2eye · April 16, 2023, 8:41am

An important aspect for beginners using it as their rootfs, which is a good beginner way to get used to something, is the fact that ZFS kind of treats single-disk pools of NVMes as second class citizens. Theoretically the real (non synthetic) performance should be as good as any other filesystem over the variety of needs a power user might want to use their workstation for… The situation improves a bit for mirrored vdevs of NVMes, but not quite to the degree of some other encryption scheme + XFS unfortunately (yet).

github.com/openzfs/zfs

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule)

opened 05:20PM - 05 Feb 19 UTC

bwatkinson

Type: Performance

### System information Type | Version/Name --- | --- Distribution Name | CentOS Distribution Version | 7.5 Linux Kernel | 4.18.20-100 Architecture | x86_64 ZFS Version | 0.7.12-1 SPL Version | 0.7.12-1  ### Describe the problem you're observing We are currently seeing poor read performance with ZFS 0.7.12 with our Samsung PM1725a devices in a Dell PowerEdge R7425. ### Describe how to reproduce the problem Briefly describing our setup, we currently have four PM1724a devices attached to the PCIe root complex in NUMA domain 1 on an AMD EPYC 7401 processor. In order to measure read throughput of ZFS, XFS, and the Raw Devices, the XDD tool was used which is available at: [email protected]:bwatkinson/xdd.git In all cases I am presenting, kernel 4.18.20-100 was used and I disabled all CPU's not on socket 0 within the kernel. I also issued asynchronous sequential reads to the file systems/devices while pinning all XDD threads to NUMA domain 1 and Socket 0's memory banks. I conducted four tests consisting of measuring throughput for the raw devices, XFS, and ZFS 0.7.12. For the Raw Device tests, I had 6 I/O threads per devices with a request sizes of 1 MB and a total of 32 GB read from each device using Direct I/O. In the XFS case, I created a single XFS file system on each of the 4 devices. In each of the XFS file systems, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB using Direct I/O. In the ZFS Single ZPool case I created a single ZPool composed of 4 VDEVs and read 128 GB file of random data using 24 I/O threads with request sizes of 1 MB. In the ZFS Multiple ZPool case I create 4 separate ZPools each consisting of a single VDEV. In each of the ZPools, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB. In both the Single ZPool and Multipl ZPool cases I set the record sizes for all pools to 1 MB and I set the primarycache=none. We decided to disable the ARC in all cases, because we were reading 128 GB of data, which was exactly equal to 2x available memory on Socket 0. Even with the ARC enabled we were seeing no performance benefits. Below are the throughput measurements I collected for each of these case. Performance Results: Raw Device - 12,7246.724 MB/s XFS Direct I/O - 12,734.633 MB/s ZFS Single Zpool - 6,452.946 MB/s ZFS Multiple Zpool – 6,344.915 MB/s In order to try and solve what was cutting ZFS read performance in half, I generated flame graphs using the following tool: http://www.brendangregg.com/flamegraphs.html In general I found most of the perf samples were occurring in zio_wait. It is in this call that io_scheudle is called. Comparing the ZFS flame graphs to XFS flame graphs, I found that the number of samples between the submit_bio and io_schedule was significantly larger in the ZFS case. I decided to take timestamps of each call to io_schedule for both ZFS and XFS to measure the latency between the calls. Below is a link to to histograms as well as total elapsed time in microseconds between io_schedule calls for the tests I described above. In total I collected 110,000 timestamps. In plotted data the first 10,000 timestamps were ignored to allow for the file systems to reach a steady state. https://drive.google.com/drive/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh?usp=sharing In general, ZFS has a significant latency between io_scheudule calls. I have also verified that the output from iostat shows a larger await_r value for ZFS over XFS for these tests as well. In general it seems ZFS is letting request sit in the hardware queues longer than XFS and the Raw Devices causing a huge performance penalty in ZFS reads (effectively cutting the available device bandwidth in half). In general, has this issue been noticed with NVMe SSD’s and ZFS and is there a current fix to this issue? If there is no current fix, is this issue being worked on? Also, I have tried to duplicate the XFS results using ZFS 0.8.0-rc2 using Direct I/O, but the performance for 0.8.0 read almost exactly matched ZFS 0.7.12 read performance without Direct I/O. ### Include any warning/errors/backtraces from the system logs

Exard3k · April 16, 2023, 9:54am

Added:
first draft for ZVOL section (doesn’t feel complete yet)
paragraph for ZFS on Linux regarding DKMS and dangers of leading edge distros.
sync=always and general testing/benchmarking recommendation paragraph for SLOG

Yeah trial by fire can certainly improve knowledge. But setting up Root-on-ZFS and facing incompatible DKMS versions isn’t something I would advice beginners to do. Although if you can get it pre-configured that way, that’s definitely a good way to learn (and get comfortable in using zfs and zpool commands)

mith · April 16, 2023, 1:47pm

Its the scope. Trying to cover too much for too many different audiences. Kudos to devoting your energy to this project, BTW. ZFS is a big topic. To keep you from writing a book on it, one idea to keep you focused, clear, and concise is to present it as a series of mini-Howoto posts. Structure each with an into that briefly tells the reader what its for and any pros and cons. Then follow it up with a recipe for how to do it. Many of the arch linux wikis are structured this way and they are the best thing about arch, IMHO. Nice thing about this is that you can work on the project a little bit at a time and it wont overwhelm you. You also dont need to replicate the deep dive info that is readily available online. Provide links if you need to.

Example topics for mini howtos:

vdev types and when to use them.
How to turn a stack of disks into a reliable raidz2 pool
My ZFS pool wont import. Howto troubleshoot.
How to carve out filesytems and mount them
How to monitor and scrub your zfs pool
How to get a systemd service to depend on a zfs mount
How to replace a disk in a zfs pool
Pool spares. What are they good for?
ZFS replication
Dedup: don’t use it unless you know what you are doing
Howto decide if compression is good for my application
How to create and mount a zvol
Benchmark the performance differences between a zvol and qcow2 on zfs
How to setup an iscsi target & initiator with a zvol
Howto create, revert to, and delete snapshots. Resources for scheduled snapshots.
Howto to grow a pool
Keeping on top of a pool with ZED, smartmond, and periodic scrubs

HaaStyleCat · April 16, 2023, 6:34pm

Posts like these are why I love the community so much. Sometimes the regular manuals can be difficult to read, and the bits that are important are so sparse or spread out they can be hard to get all in one go.

Also, its a place for people to support people who are learning without talking down to them or being elitest not willing to share their experiances.

Either way great job and thank you @Exard3k for contributing.

lyth1um · April 18, 2023, 2:30am

im tinkering with raids for non mission critical data like media. when i remember correctly back in the days it was told to not use a bunch of hdd of the same batch as the others. how far can they differ from another? same size, check, same sector size check. not the same modell? got 1 12tb TOSHIBA HDWR21C and TOSHIBA MG07ACA1, for the mg i can find some precise datasheets, but for the other one, meh. some non-public data in the sheets. except of the lifetime, i cant see any difference. i can assume that the writespeed/readspeed tanks when it is getting filled?