Home NAS latency (SOLVED)

@kdb424 found it

https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/

2 Likes

I normally do that on my datasets. I canā€™t believe I missed it on that one.

I had that website open, but did not find that page somehow. Much appreciated.
It seems that a SLOG may be of some help after all as NFS exports are my top usage of this machine, and it would seem to recommend one based on that link. As far as l2arc, Iā€™ll look into that one again in the next major ZFS version as I believe that persistent l2arc is likely to be included. Itā€™s already in mainline openzfs.

That is a given, and I donā€™t know how I didnā€™t do that. Thank you for point out my stupid errors :slight_smile:

2 Likes

Update. Iā€™ve tuned the scratch volume and Iā€™m now able to copy over my .cache (massive, and many many files) with cp at over 300mbps (Not MB) which is already a massive improvement. Iā€™ve also added the m.2 slog on the optane drive, and while itā€™s technically unsafe, Iā€™ll add another down the line in a mirror as it seemed to help (all of my usage is NFS outside of zfs send/recv). I appreciate all of the help! If thereā€™s anything else that can be added, Iā€™ll keep tabs on this, but Iā€™d call this ā€œsolvedā€

2 Likes

Glad you got it working :+1:

1 Like

Recordsizes

My own NAS workload (using ZFS 0.8.3) is basically:

  • Bulk writes
  • Bulk reads
  • Accessing metadata

For the first two, I highly recommend either 1M or 4M

zfs set recordsize=1M yourpool/yourdataset

To set it more than 1M, you have to edit /etc/modprobe.d/zfs.conf and add the following and reboot. Note that 4M is kinda commonly used and probably doesnā€™t have issues, but higher is not tested well. Iā€™ve seen some people go 16M, which might be cool, but Iā€™m not touching that for a while, and it seems 4M is already the point of diminishing returns.

options zfs zfs_max_recordsize=4194304

This wonā€™t effect already written data, but anything new will be stored in larger chunks (recordsize is a MAX, blocks are sized dynamically). This results in fewer IO operations everytime a whole file is accessed and also less metadata. If you have a workload were you are constantly reading and/or writing parts of a file, then you want that on a separate dataset with a smaller recordsize. Datasets holding databases might be 8K/16K/32K/64K, which is a big difference. VM storage also needs a small recordsize.

Hereā€™s what I use for my youtube archive dataset for example

zfs set acltype=posixacl kpool/archive-yt
zfs set compression=lz4 kpool/archive-yt
zfs set xattr=sa kpool/archive-yt
zfs set atime=off kpool/archive-yt
zfs set relatime=off kpool/archive-yt
zfs set recordsize=4M kpool/archive-yt
zfs set aclinherit=passthrough kpool/archive-yt
zfs set dnodesize=auto kpool/archive-yt
zfs set exec=off kpool/archive-yt

Please note that some of these settings permanently throw away any BSD compatibility. Not an issue for me, but something you should be aware of.

Special Vdevs for metadata

I also highly recommend looking at the new special vdev. Good information is hard to find right now (maby I looked wrong, but the man pages were fucking useless for this). This thread introduces it nicely. As it happens, I had 3x 250gb 860 evos laying around from putting old laptops out of commission, so I just grabbed another one. 500GB of space should be sufficient for most people. If you have billions of files or storage close to the triple digits, spend more time figuring out how much metadata you have. If you run out of space no big deal By default, only metadata is allocated to this special vdev, which allows you to do things like look at the metadata without having to touch your HDDā€™s (great for indexing, searching, and opening a folder with tens of thousands of files).

Basically to add 4 SSDā€™s in 2 mirrored pairs, grab the disk IDā€™s with this:

ls -l /dev/disk/by-id

And then fit them into this:

zpool add yourpool special mirror /dev/disk/by-id/wwn-0x1 /dev/disk/by-id/wwn-0x2 mirror /dev/disk/by-id/wwn-0x3 /dev/disk/by-id/wwn-0x4

Note the two uses of mirror, which mirrors the first two disks, and then mirrors the second two disks and combines them.

Note that if this special vdev cannot be removed, and if it dies your pool also dies, just like any other vdev. Incremental backups continue to be a good idea at all times in all circumstances. Smart monitoring and email notifications are also best practice.

To show how full each vdev is, use:

zpool list -v yourpool

NSF and sync writes

By the way, my understanding is that by default, NSF requests only sync writes (and apparently iscsi does not), which means an SLOG is a very good idea, as youā€™ve already seen. A good way to test this prior to adding an SLOG is in ZFS temporarily setting sync=disable on the dataset, which tells ZFS to lie. If it speeds up then an SLOG will help. You could also mount the NFS share with the async option.

zfs set sync=disabled yourpool/yourdataset

When done make sure to set it back to:

zfs set sync=standard yourpool/yourdataset

2 Likes

Iā€™ve set my record size on most of my ā€œgeneralā€ pools to 1M at this point, but didnā€™t know I could go higher. May be worth looking into. Some things I have a 16k such as my linux iso directory of course, but thatā€™s on a scratch ssd pool

Do you see any compression on videos out of curiosity? I mean, CPU cycles are wasted with it off, but mostly curious.

I opened that link, and my mind was blown. Iā€™ll have to get back to you on that, but thatā€™s amazing to know that exists. If my glance is correct, this stores metadata and small files on the SSD for the iops, but large files will go to spinning rust behind it, or is this all to the SSD, then spillover to rust only? I saw that spillover is there regardless. I may have a set of 256GB SSDā€™s somewhere to try this out on (mirrored, and pool has proper backups as stated elsewhere), or could try this on just a scratch dataset of course.

Iā€™ve since added an optane M.2 module (only a single for now, yes I know I need to order another one and adapter to mirror), and that did in fact make things MUCH more responsive, though I havenā€™t checked the real numbers. I am more comfortable with sync=standard on a non mirrored slog than disabled completely for the time being.

Itā€™s 1am, so I have to give that a re-read in the morning, but thatā€™s amazing information on this. Thank you so much for the book of a reply!

Not on the videos. lz4 will abort on those quickly enough itā€™s not an issue. What is compressed is all the side files (text descriptions, annotations, translations, etc).

For the special vdevs (make sure your ZFS is recent 0.8.3 and pool has been upgraded), the default is that only metadata is sent to the ssdā€™s, unless they are full and then the metadata goes to the hdds as normal. You can furthmore set each dataset individually to also send files up to a certain size to the ssdā€™s (until full, then to hddā€™s as normal). Iā€™m not exactly familiar with setting that, as I donā€™t want to spend the time on figuring out if I want to use it or not (vs just having a dedicated SSD pool, which I donā€™t need right now anyways)

Itā€™s 1am

Dammit, not again.

2 Likes

Iā€™ve looked into this and I may play with it on a scratch pool as thatā€™s where my hot data should sit anyway, and the security of that pool wonā€™t matter as itā€™s all scratch. I tend to tier my storage with the vast majority of it being large media (by size, not file count), backups (zfs send/recv), and very minimal used by services like web servers ect. Iā€™ve been looking at downsizing my desktop and therefor internal storage, which is the main reason I was looking into the performance side, and for my really hot data, it would be cheap enough to even throw a completely unsafe pool at my data as it would get backed up on a cron job to a safe pool.

Think OS on local safe pool, but remote home dir on a hybrid SSD/HDD pool. I could skip the mirrors as long as I donā€™t mind the downtime if something fails while safely testing this newer feature.