@kdb424 found it
I normally do that on my datasets. I canāt believe I missed it on that one.
I had that website open, but did not find that page somehow. Much appreciated.
It seems that a SLOG may be of some help after all as NFS exports are my top usage of this machine, and it would seem to recommend one based on that link. As far as l2arc, Iāll look into that one again in the next major ZFS version as I believe that persistent l2arc is likely to be included. Itās already in mainline openzfs.
That is a given, and I donāt know how I didnāt do that. Thank you for point out my stupid errors
Update. Iāve tuned the scratch volume and Iām now able to copy over my .cache (massive, and many many files) with cp at over 300mbps (Not MB) which is already a massive improvement. Iāve also added the m.2 slog on the optane drive, and while itās technically unsafe, Iāll add another down the line in a mirror as it seemed to help (all of my usage is NFS outside of zfs send/recv). I appreciate all of the help! If thereās anything else that can be added, Iāll keep tabs on this, but Iād call this āsolvedā
Glad you got it working
Recordsizes
My own NAS workload (using ZFS 0.8.3) is basically:
- Bulk writes
- Bulk reads
- Accessing metadata
For the first two, I highly recommend either 1M or 4M
zfs set recordsize=1M yourpool/yourdataset
To set it more than 1M, you have to edit /etc/modprobe.d/zfs.conf and add the following and reboot. Note that 4M is kinda commonly used and probably doesnāt have issues, but higher is not tested well. Iāve seen some people go 16M, which might be cool, but Iām not touching that for a while, and it seems 4M is already the point of diminishing returns.
options zfs zfs_max_recordsize=4194304
This wonāt effect already written data, but anything new will be stored in larger chunks (recordsize is a MAX, blocks are sized dynamically). This results in fewer IO operations everytime a whole file is accessed and also less metadata. If you have a workload were you are constantly reading and/or writing parts of a file, then you want that on a separate dataset with a smaller recordsize. Datasets holding databases might be 8K/16K/32K/64K, which is a big difference. VM storage also needs a small recordsize.
Hereās what I use for my youtube archive dataset for example
zfs set acltype=posixacl kpool/archive-yt
zfs set compression=lz4 kpool/archive-yt
zfs set xattr=sa kpool/archive-yt
zfs set atime=off kpool/archive-yt
zfs set relatime=off kpool/archive-yt
zfs set recordsize=4M kpool/archive-yt
zfs set aclinherit=passthrough kpool/archive-yt
zfs set dnodesize=auto kpool/archive-yt
zfs set exec=off kpool/archive-yt
Please note that some of these settings permanently throw away any BSD compatibility. Not an issue for me, but something you should be aware of.
Special Vdevs for metadata
I also highly recommend looking at the new special vdev. Good information is hard to find right now (maby I looked wrong, but the man pages were fucking useless for this). This thread introduces it nicely. As it happens, I had 3x 250gb 860 evos laying around from putting old laptops out of commission, so I just grabbed another one. 500GB of space should be sufficient for most people. If you have billions of files or storage close to the triple digits, spend more time figuring out how much metadata you have. If you run out of space no big deal By default, only metadata is allocated to this special vdev, which allows you to do things like look at the metadata without having to touch your HDDās (great for indexing, searching, and opening a folder with tens of thousands of files).
Basically to add 4 SSDās in 2 mirrored pairs, grab the disk IDās with this:
ls -l /dev/disk/by-id
And then fit them into this:
zpool add yourpool special mirror /dev/disk/by-id/wwn-0x1 /dev/disk/by-id/wwn-0x2 mirror /dev/disk/by-id/wwn-0x3 /dev/disk/by-id/wwn-0x4
Note the two uses of mirror, which mirrors the first two disks, and then mirrors the second two disks and combines them.
Note that if this special vdev cannot be removed, and if it dies your pool also dies, just like any other vdev. Incremental backups continue to be a good idea at all times in all circumstances. Smart monitoring and email notifications are also best practice.
To show how full each vdev is, use:
zpool list -v yourpool
NSF and sync writes
By the way, my understanding is that by default, NSF requests only sync writes (and apparently iscsi does not), which means an SLOG is a very good idea, as youāve already seen. A good way to test this prior to adding an SLOG is in ZFS temporarily setting sync=disable on the dataset, which tells ZFS to lie. If it speeds up then an SLOG will help. You could also mount the NFS share with the async option.
zfs set sync=disabled yourpool/yourdataset
When done make sure to set it back to:
zfs set sync=standard yourpool/yourdataset
Iāve set my record size on most of my āgeneralā pools to 1M at this point, but didnāt know I could go higher. May be worth looking into. Some things I have a 16k such as my linux iso directory of course, but thatās on a scratch ssd pool
Do you see any compression on videos out of curiosity? I mean, CPU cycles are wasted with it off, but mostly curious.
I opened that link, and my mind was blown. Iāll have to get back to you on that, but thatās amazing to know that exists. If my glance is correct, this stores metadata and small files on the SSD for the iops, but large files will go to spinning rust behind it, or is this all to the SSD, then spillover to rust only? I saw that spillover is there regardless. I may have a set of 256GB SSDās somewhere to try this out on (mirrored, and pool has proper backups as stated elsewhere), or could try this on just a scratch dataset of course.
Iāve since added an optane M.2 module (only a single for now, yes I know I need to order another one and adapter to mirror), and that did in fact make things MUCH more responsive, though I havenāt checked the real numbers. I am more comfortable with sync=standard on a non mirrored slog than disabled completely for the time being.
Itās 1am, so I have to give that a re-read in the morning, but thatās amazing information on this. Thank you so much for the book of a reply!
Not on the videos. lz4 will abort on those quickly enough itās not an issue. What is compressed is all the side files (text descriptions, annotations, translations, etc).
For the special vdevs (make sure your ZFS is recent 0.8.3 and pool has been upgraded), the default is that only metadata is sent to the ssdās, unless they are full and then the metadata goes to the hdds as normal. You can furthmore set each dataset individually to also send files up to a certain size to the ssdās (until full, then to hddās as normal). Iām not exactly familiar with setting that, as I donāt want to spend the time on figuring out if I want to use it or not (vs just having a dedicated SSD pool, which I donāt need right now anyways)
Itās 1am
Dammit, not again.
Iāve looked into this and I may play with it on a scratch pool as thatās where my hot data should sit anyway, and the security of that pool wonāt matter as itās all scratch. I tend to tier my storage with the vast majority of it being large media (by size, not file count), backups (zfs send/recv), and very minimal used by services like web servers ect. Iāve been looking at downsizing my desktop and therefor internal storage, which is the main reason I was looking into the performance side, and for my really hot data, it would be cheap enough to even throw a completely unsafe pool at my data as it would get backed up on a cron job to a safe pool.
Think OS on local safe pool, but remote home dir on a hybrid SSD/HDD pool. I could skip the mirrors as long as I donāt mind the downtime if something fails while safely testing this newer feature.