Home NAS latency (SOLVED)

Recordsizes

My own NAS workload (using ZFS 0.8.3) is basically:

  • Bulk writes
  • Bulk reads
  • Accessing metadata

For the first two, I highly recommend either 1M or 4M

zfs set recordsize=1M yourpool/yourdataset

To set it more than 1M, you have to edit /etc/modprobe.d/zfs.conf and add the following and reboot. Note that 4M is kinda commonly used and probably doesn’t have issues, but higher is not tested well. I’ve seen some people go 16M, which might be cool, but I’m not touching that for a while, and it seems 4M is already the point of diminishing returns.

options zfs zfs_max_recordsize=4194304

This won’t effect already written data, but anything new will be stored in larger chunks (recordsize is a MAX, blocks are sized dynamically). This results in fewer IO operations everytime a whole file is accessed and also less metadata. If you have a workload were you are constantly reading and/or writing parts of a file, then you want that on a separate dataset with a smaller recordsize. Datasets holding databases might be 8K/16K/32K/64K, which is a big difference. VM storage also needs a small recordsize.

Here’s what I use for my youtube archive dataset for example

zfs set acltype=posixacl kpool/archive-yt
zfs set compression=lz4 kpool/archive-yt
zfs set xattr=sa kpool/archive-yt
zfs set atime=off kpool/archive-yt
zfs set relatime=off kpool/archive-yt
zfs set recordsize=4M kpool/archive-yt
zfs set aclinherit=passthrough kpool/archive-yt
zfs set dnodesize=auto kpool/archive-yt
zfs set exec=off kpool/archive-yt

Please note that some of these settings permanently throw away any BSD compatibility. Not an issue for me, but something you should be aware of.

Special Vdevs for metadata

I also highly recommend looking at the new special vdev. Good information is hard to find right now (maby I looked wrong, but the man pages were fucking useless for this). This thread introduces it nicely. As it happens, I had 3x 250gb 860 evos laying around from putting old laptops out of commission, so I just grabbed another one. 500GB of space should be sufficient for most people. If you have billions of files or storage close to the triple digits, spend more time figuring out how much metadata you have. If you run out of space no big deal By default, only metadata is allocated to this special vdev, which allows you to do things like look at the metadata without having to touch your HDD’s (great for indexing, searching, and opening a folder with tens of thousands of files).

Basically to add 4 SSD’s in 2 mirrored pairs, grab the disk ID’s with this:

ls -l /dev/disk/by-id

And then fit them into this:

zpool add yourpool special mirror /dev/disk/by-id/wwn-0x1 /dev/disk/by-id/wwn-0x2 mirror /dev/disk/by-id/wwn-0x3 /dev/disk/by-id/wwn-0x4

Note the two uses of mirror, which mirrors the first two disks, and then mirrors the second two disks and combines them.

Note that if this special vdev cannot be removed, and if it dies your pool also dies, just like any other vdev. Incremental backups continue to be a good idea at all times in all circumstances. Smart monitoring and email notifications are also best practice.

To show how full each vdev is, use:

zpool list -v yourpool

NSF and sync writes

By the way, my understanding is that by default, NSF requests only sync writes (and apparently iscsi does not), which means an SLOG is a very good idea, as you’ve already seen. A good way to test this prior to adding an SLOG is in ZFS temporarily setting sync=disable on the dataset, which tells ZFS to lie. If it speeds up then an SLOG will help. You could also mount the NFS share with the async option.

zfs set sync=disabled yourpool/yourdataset

When done make sure to set it back to:

zfs set sync=standard yourpool/yourdataset

2 Likes