ZFS - best setup, very small box, 4 drives in 2022?

Am looking at rebuilding my NAS, currently running Ubuntu Server 20.04 with 4x8TB drives (plus an old 128GB SSD as cache, but I don’t think it’s doing anything useful). The system has 32GB of RAM, I don’t think I’ve seen it more than half used.

If I reformat this, what’s the best layout for ZFS these days? Record sizes? Compression? Apparently I created this pool back on January 9 2016, I don’t think I’ve done much maintenance since then.

I think I’ll forgo the cache this time and move to a 1TB SSD as boot drive (currently on SATA DOM).

I was concerned about https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476 but it looks like it’s been resolved

  pool: BoxODisks
 state: ONLINE
  scan: scrub repaired 0B in 0 days 11:47:22 with 0 errors on Sun Feb 13 12:11:23 2022
config:

        NAME                        STATE     READ WRITE CKSUM
        BoxODisks                   ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000cca260c43584  ONLINE       0     0     0
            wwn-0x5000cca260c4e52f  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000cca260c438c1  ONLINE       0     0     0
            wwn-0x5000cca260c2148c  ONLINE       0     0     0
        cache
          sdc                       ONLINE       0     0     0

The system has 32GB of RAM, I don’t think I’ve seen it more than half used.

By default ZFS only uses up to half the ram available. This can be changed to whatever you like. Note that the system asking ZFS to free up ram is very slow, and sometimes not fast enough to prevent and out of memory error, so give your system plenty of breathing room. Honestly 16GB is more than enough for that pool size.

Double check these commands, but you can change max ARC the following ways

# To get the current ARC size and various other ARC statistics, run this command:
cat /proc/spl/kstat/zfs/arcstats
# Changes max ARC usage immediately, though it may not show up immediately. Lasts only until reboot
echo "34359738368" > /sys/module/zfs/parameters/zfs_arc_max
# Permanently changes max ARC usage, but only occurs the next reboot.
echo "options zfs zfs_arc_max=34359738368" >> /etc/modprobe.d/zfs.conf

If I reformat this, what’s the best layout for ZFS these days?
This come down to personal opinion, but the answer for me is for 6 or less disks, mirrors are best. After that have 6 wide RAIDZ2 vdevs. I don’t like RAIDZ1 (or RAID5).

A slightly more complicated answer is RAIDZ sacrifices performance for what can sometimes be considerably less space efficiency than you might think. Because of how RAIDZ and blocks work, small blocks (4K and 8K, and to a lesser extent 16K) can actually take up 2x their space on RAIDZ1, the same as a mirror. Likewise RAIDZ2 amplifies space occupied by 3 times like a triple mirror. But for all of these performance is going to be terrible compared to mirrors. Things are generally fine past 32K though, and honestly most of your storage is going to generate much larger blocks. Unless you are using a ZVOL on RAIDZ, in which case you’ve made a terrible mistake.

The long complicated answer is here:


Record sizes?

For general NAS storage, set this to 1M or 4M recordsizes. You are most likely writing once, and accessing rarely. Going higher generally doesn’t give much benefit unless you hoard video like me.

If you have an active dataset, like for torrents, logs, databases, then consider leaving it at the default 128K, though 64K should be fine too. It’s normally been suggested to match recordsizes with write sizes (like the cluster_size of a database). However, because of how ZFS works, having a slightly larger recordsize trades a little performance loss now, for greatly reduced free space fragmentation (and thus saved performance) later. The performance degradation from free space fragmentation isn’t something that shows up in traditional quick and stupid benchmarks, though it also doesn’t really become pathologically until the pool is getting close to full and is more of an enterprise concern. Also note that some free space fragmentation is not only expected but a good thing, it’s only a problem when it becomes high and consists mostly of small blocks.

Against, double check these commands

# Immediately changes max recordsize to 16M until reboot.
echo "16777216" > /sys/module/zfs/parameters/zfs_max_recordsize
# Permanently changes max recordsize to 16M
echo "options zfs zfs_max_recordsize=16777216" >> /etc/modprobe.d/zfs.conf
# Actually sets the recordsize on your pool root or datasets
zfs set recordsize=4M your_pool/your_dataset

In order to change the block sizes of data already on a pool, you’ll have to copy thing to a new dataset with rsync or something (and verify everything is there). Doing a send/recv doesn’t change block sizes.


Compression?

Leave at LZ4, which is default. The most space consuming items on your NAS are going to be pictures or video, which are already compressed.


Apparently I created this pool back on January 9 2016, I don’t think I’ve done much maintenance since then.

I generally set the following on my root pool so every new dataset inherits these.

  • atime=off
  • relatime=off
  • recordsize=4M

The following may possibly have non-linux ZFScompatibility issues. Not sure what the state of things are, but if you’re on linux you generally want them.

  • xattr=sa
  • dnodesize=auto
  • acltype=posixacl

Also, you should ALWAYS manually create a new pool with ashift=12 unless you have proof otherwise. Some drives lie and ZFS will default to the reported physical size, even if it’s fucking stupid. Like ashift=9 (512 sectors).
So use:

zpool create -o ashift=12 ...

You can also set the following to make it automatic in case you forget.

# Immediate setting of a minimum value of 4096 byte sectors (ashift=12) for automatic ashift
echo "12" > /sys/module/zfs/parameters/zfs_vdev_min_auto_ashift
# Permanent setting of a minimum value of 4096 byte sectors (ashift=12) for automatic ashift
echo "options zfs zfs_vdev_min_auto_ashift=12" >> /etc/modprobe.d/zfs.conf

Let me know if there are any errors.

5 Likes

Is there any downside to creating lots of datasets?
I’m thinking I’m going to put /var and /etc on their own datasets so I can do zfs send and have snapshots for backups.

Not at all. I have over a dozen. I believe when you get to many thousands of datasets + snapshots, then certain zpool/zfs commands can become very slow, but most people aren’t going to run into that.

As for splitting off directories into datasets, I intend to do this at some point, but currently I don’t really have experience with it and so don’t have much in the way of recommendations.

One thing to watch out for is the datasets trying to mount when sent over to the backup machine.

Thought these posts might be helpful in addressing the issue. Definitely test sending over a dataset, and see if it mounts. If it doesn’t, try exporting and importing the pool to make sure it still won’t mount.

And some stuff possibly setting the org.openzfs.systemd:ignore property mentioned here https://github.com/openzfs/zfs/issues/10530 and here https://github.com/openzfs/zfs/issues/10356#issuecomment-641237561 when using systemd.

3 Likes

Just a mention that you probably don’t need or want a separate boot drive, you can have a boot partition on every data drive, then your machine will still boot even if any of the drives fail. Depends if you want: a) simple and don’t care if there is downtime if (when) the SSD fails, or b) a little more complex but will keep trucking with any drive failure.

Sure it might start up in a few seconds less on an SSD, but I’d bet that for NAS duties, once all the common-used files of services are in cache, you won’t notice a difference in performance. Then, all of that SSD can be used as ZFS L2ARC, which you might just notice a difference in performance with.

For setting it up, one of the only suitable (IMO :slight_smile: ) uses for hardware RAID-1.

Without hardware RAID, still possible but the only complication is that they need to be separate FAT32 file systems.

In the pre-UEFI days you could use software RAID-1 on your /boot partition, because BIOS doesnt understand filesystems, and boot-loaders would never write to it. But with UEFI (and UEFI loaders) they can modify the filesystem which would cause a RAID inconsistency.

With UEFI:

  • Create an ESP partition on each drive.
  • Use efibootmgr to create an EFI boot entry for each device.
  • Create a package manager post-install script with rsync’s the fs mounted at /boot to the ESPs every time the kernel/initramfs is changed.

I have 4 NVME’s in a pool with this setup which works great.

2 Likes

I never quite understood how this works - isn’t there a RAM penalty as well?

What happens if these get out of sync?

It looks like Install Arch Linux on ZFS - ArchWiki has most of what I need to know for the actual setup if I go for ZFS on the root as well.

Is there any way I can test the structure on a VM and then copy it over on to the real drives?

If the initramfs contains every module you need, or you don’t clean up old modules from /lib/modules - nothing, you’ll just be on an older kernel than you expected.

Some distros remove old modules from /lib/modules so if you boot an older kernel, then there is no matching modules for e.g. your network card (if you don’t put it in your initramfs).

Yes.


You, can put your /boot (incl. your EFI) onto FAT32 on mdraid mirror (3-way/4-way mirror as needed) on a GPT partition typed EF00 , and UEFI will happily boot systemd-boot, which will happily load the kernel and initrd from your FAT32 formatted ESP. This can mount a dataset from your raid as “/” and start all the other services.

L2ARC is your read cache. Frequenty/Recently used data won’t all fit into memory to be cached and has to be fetched from slow HDDs.
With L2ARC you can extend your max. 32GB ARC cache with a fast NVMe and read at NVMe speeds on a cache hit.

The ARC stores the information if and where that data resides in the L2ARC, so there is some bookkeeping that uses RAM. But unless you have most of your data stored in 4k or 8k datasets, this is negligible, considering the huge increase in cache size.

Really any NVMe will do, L2ARC is cheap to get these days. Memory is faster, sure. But you can’t get a TB of memory for 100$ and that’s the essence of L2ARC :wink:

I feel like the RAM penalty of L2ARC gets overstated I saw a chart of the usage over different sized drives and even at the default 128KiB record size 1TB of L2ARC takes up a very small amount of RAM.

You have a lot of interesting suggestions. I have an already created pool (and no possible way to wipe and recreate) and am curious what improvements I might see by changing some of my values to match your suggestions.

My NAS does general storage with a decent sized media library and I haven’t done much in the way of “optimizing” the block sizes, having just left it at the default 128KiB for almost everything. Only thing I adjusted was a downloads dataset that is set to 32KiB; I’m not super concerned about the fragmentation because a lot of that data is either temporary or won’t be used directly e.g. I’ll download an updated copy of Debian move it to my PC and then flash my USB.

What kind of benefit would I see by splitting my Videos into their own dataset with significantly higher recordsizes?

I also haven’t fiddled with the other dataset options and left them at the default FreeNAS (now TrueNAS) set them to at creation. Are their any trade-offs to disabling atime or relatime? (I don’t actually see an option for relatime in the menu, I’d have to switch to commandline for that)

EDIT: Actually the downloads are set to 16KiB and now shows a warning against it in the UI. I might need to adjust that dataset too if it’d improve performance.
Also it looks like atime is already disabled on my actual root storage dataset but not the root dataset.

EDIT2: Researching on my own a little as I wait for response. One site suggested that much of a torrent client’s download would actually sit in the ZIL (presumably on RAM) until hitting recordsize and so wouldn’t actually suffer at all from larger recordsizes. So I might have good reason to increase the size of that dataset to large values as well as a video dataset.

Videos tend to be massive, being multiple MB or sometimes even GB in size. By increase the block size, you greatly reduce IO needed to obtain all the blocks for a file. Larger blocks reduces seek times essentially. Larger blocks also greatly reduces the amount of metadata that needs to be read or changed, which also reduces IO. My bulk storage is all on HDDs, which are awful at IO since they take a long time to change positions, but once in position are great at spitting out data.

That being said, it’s not always a case of larger is better, because going really large can actually increase latency. Larger blocks obviously take more time to fully read out, and ZFS needs to read them completely before it can send the data back which matters for two reasons

  • I believe that some programs, like video players, can start doing things immediately with only partial data. I’m not entirely clear on what’s happening underneath here though.
  • A a program only wants to change 1 byte of a file, then obviously reading a 16M block is inefficient, both reading and rewriting out.
  • ZFS does parallel read operations when it can. So a 4M file as one large block could potentially be read slower than 4x 1M block reads when the blocks are across different vdevs. SSDs are great at doing parallel operations even on a single drive, so large blocks on those can potentially have higher latency on those as well.

I personally prefer to just have a blanket default of 4M for most of my storage to keep metadata down.

Only thing I adjusted was a downloads dataset that is set to 32KiB 16Kib

Is this general downloads, or is it a torrent download folder?
If it’s general downloads, you definitely want it at least at ZFS 128K default or larger.
For torrents:

  • Recordsize is most important for HDDs, because they are severely limited on random R/W IO. SSDs aren’t that big a deal.
  • Most torrents seem to deal with 16K-64K blocks of data. As such, for a download dataset, I’d probably set recordsize to 32K-64K. This will reduce free space fragmentation later.
  • Then when the torrent is completed, and I want to do heavy seeding, I’d have the client move the file to a 16K to 254K recordsize dataset, which gets rid of the file fragmentation. If your files can fit in ram (ARC) then there’s no issue with larger sizes, because the reads come from ARC. If you have very limited ram, then smaller recordsizes are better as they reduce drive reads. You may need to experiment to find what’s optimal, 16K IO is not good for trying to fill bandwidth with.
  • If done seeding or only doing it lightly and have plenty of ram, just move to normal 1-4M dataset.
  • Torrent seeding is actually the perfect used case for setting an SSD to be an L2ARC.
  • If you have enough ram, you can play around with things like increasing zfs_dirty_data_*and zfs_txg_timeout settings to increase the ability of ZFS to coalesce pending writes better before committing them. Note that this effects all pools and can have potentially adverse effects. It’s easy to get too clever for ZFS tunables, and worsen your performance.
  • Make sure to turn off file download pre-allocation, which does nothing but create unnecessary IO on COW filesystems like ZFS.
  • Please make sure you turn of ATIME. ATIME means everytime a file is accessed (read), the metadata gets updated with a new last read time. This is awful as it is, and even worse for small blocks. The only thing that depends on ATIME are some email servers I think. No one will normally need them.
1 Like

Can’t do NVMe easily in 5028A-TN4 | Mini-ITX | SuperServers | Products | Super Micro Computer, Inc., I am probably going to try to get a half height 2.5gbE card instead.

I thought ZFS on RAID was a no no?

Only for the ESP where you can’t use ZFS, use mdraid.

Proxmox has a nice thing going on with proxmox-boot-tool which syncs the ESP partitions automagically so you can boot off of either disk in a mirror with no work involved. I believe that if you add a new disk then you’ll have to manually add the EFI partition and point the tool to it. https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot

Note that proxmox installed on top of ZFS uses systemd-boot, and will ignore the traditional method of adding kernel/grub boot options.

1 Like

ZFS is a RAM penalty. Give it as much as you can.

I boot my TrueNAS core from a pair of USB 3 sticks. More SATA ports for storage SSDs.

Using an SSD for cache may make your server slower. When writing to an SSD it starts fast then slows down. It’s slow speed maybe slower than your array. Very noticeable when imaging a hard drive to the server.

Synchronous write mode is what makes ZFS really slow writing, turn that off and it caches to RAM so super fast. More RAM more speed.

Theoretically your array should have the speed of two drives. Couple that with RAM caching and it should be plenty responsive and no need for the SSD. Create a pool out of the SSD and speed test it, that will show you it’s not the be-all-end-all.