For fun I have set up a small home NAS/SAN and was looking for suggestions on how to reduce latency. General specs
z170 Asrock Motherboard
16GB Ram 2133
3 10TB WD Red Raidz
3 3TB (CMR) WD Red Raidz (cold storage archive, removed when not backing up)
Gigabit onboard ethernet (Realtek I believe)
Gentoo Linux (I’m aware there are better distros, this is what I use everywhere)
I am using ZFS with NFS exports to share general media to my machines, and a directory to back up phones ect to. I’m fully saturating the gigabit link on large files/binary streams but for large directories of small files, I’m hitting latency issues that are pretty bad. From nfsiostat I get this result on my largest pool when transferring a ton of small photos.
Other things I’ve considered are going with 10G sfp+ with DAC cables as I could use the bandwidth anyway, and I know that’s lower latency. Please let me know if I missed anything in this brain dump, or you want further information on this network/machines.
EDIT: I somehow missed that the system was scrubbing. I’ll edit with more accurate results on that pool.
EDIT2: Quick notes. I have 3 copies of my data. Online, offline, and offsite. I could care less abut my live data as I have proper backups. I’m not going to ignore all risks for speed as getting my data is inconvenient, but it is securely backed up, so some risks are acceptable in a home setting with proper backups.
EDIT3: This has been solved thanks to everyone offering ZFS tuning and adding an optane slog.
Can you get tuned-adm on gentoo or is that just a rhel thing?
ZFS isn’t too latency friendly in general. You could add a zippy slog to help with write iops and latency. I don’t remember off the top of my head, but there are tunables to disable forced cache flushing or extend the time before it flushes. I believe the default is 5 seconds.
While I would like to agree with this route, building my array of disks I was able to afford 3 disks hence the raidz. I couldn’t afford to not have the space going with 2, nor the 4th disk unfortunately. 95% of my storage needs are media/backups, so I’m playing with the performance aspect to see what I CAN do VS actually needing it. My main “server” that I run docker containers on is a lowly chromebox with 6GB ram and a celeron, so my true needs are quite low and probably put the general cost into picture.
It appears I may be able to, though it’s designed for systemd, so I’d have to adjust it as I run openRC. Yeah, I’m gonna by THAT guy
I actually purchased a small optane drive (The M80 I believe, the M.2 kind) and it’s physically in the machine but not put to use yet as I am not sure if that’s direct to nand or a cache flush. If it’s a flushing drive, they aren’t battery backed, so that’s a bad idea.
A waste (IMO). The performance of a mirror with 1 hot spare and have 2 fold better IOPS, with faster resilvering when a drive dies would be more worth in the short and long term.
Okay but how full is it? The general rule of thumb is to have no more than 80% utilized space. If you are at 95% then ZFS is freaking the hell out because it was not designed to really go that full gracefully.
Something I’ve done on my NAS which is running Linux and btrfs for the filesystem, is to adjust the VFS cache and try to keep the file / inode information in RAM.
I’ve got these two lines in /etc/sysctl.d/91-swappiness.conf
vm.swappiness=10
vm.vfs_cache_pressure=25
And I periodically run a du /home >/dev/null to load the cache. I think it improves small file access by a lot, but your results may vary.
I’m also not sure how it will work with ZFS, because doesn’t that have its own ARC thing for caching file data instead of the Linux VFS cache?
Yes. ARC is a cache of the last frequently used files and it resides in RAM. L2ARC on an SSD is also a waste if resources are tight because it doesn’t do much for a home NAS because the RAM maintaining the L2ARC could be used instead for the regular ARC.
10TB 3 disk mirror + hot spare = 10TB
Same disks raidZ = 20TB
Sorry, storage NEEDS. It’s a 20TB pool and I have 13TB free
I have set swappiness to 0 to prevent spillage into swap unless under extreme pressure which almost never happens. I could try to force files into cache, though it’s sporadic enough that I can’t really predict what I need in arc, se I’m at the mercy of physical disks.
This is also why I bought my optaine drive, and ended up not using it. l2arc is likely a waste of my ram.
Low swappiness is a good idea, but yeah I’m not sure if any vfs values will affect zfs. There are zfs tunables that do roughly equivalent things though.
For general Linux tuning, I’d try to dig into what tuned-adm does under the hood with the performance-latency profile and replicate that on Gentoo (or install tuned-adm if possible).
Without a space drive you are sunk anyway. The capacity means nothing if you loose it all when a drive dies and the array cannot resilver.
A 10 TiB drive will take about 2-3 days to resilver and 4 years from now when your drives are worn out if you have cycled them regularly then you are asking for trouble.
If you are using 7 TiB then yeah you would need to have more drives in your pool as the 50% storage hit does suck but thats the only real downside to mirrors.
That’s good! Optane is awesome. However, only 1 though? If that dies your array will also be trashed. Which is why you’re supposed to mirror those twice or thrice over.
Yeah. He could still look into running a periodic job to pull in the file info though. Use cron or systemd or whatever. Once all of that is cached it doesn’t need to be read from disk again, and file metadata is the slowest part of small file access.
If you don’t mind taking on some risk, you could use the optane as slog and disable the cache flushing. Just be sure to have smartmontools configured with email alerts so you’ll get notified of any pre-fail symptoms.
Worst case there is you lose recent writes if the optane dies.
No single pool is critical to me at all. I added the raidz for uptime or I would have just gone with a stripe. I keep offsite backups, and online backups on seperate pools so I have 3 copies. If I lose the data in the middle of backups, life happens. That also is where a lot of my budget goes and why pools have low storage per pool as they are not in the same physical box or even location. Also note that I live alone, and am the sole user of this data, so anyone else’s access/my monitory loss is no concern.
With the known risks to the live data explained as not a real issue, I was intending to try it out as a ZIL as a single drive. If the performance improved, and it’s a direct to nand write as they aren’t battery backed, I would get a second one as a mirror as you are correct, that is best practice.
You might not see much improvement out of the box with slog. In my experience, it usually needs some tuning.
You also might see some latency improvement with compression off. LZW is cheap, but it’s not free. There are cases where it speeds up I/O for highly compressible data, but I think those are really specific use cases (logs maybe).
tank type filesystem -
tank creation Fri Dec 27 18:15 2019 -
tank used 4.27T -
tank available 13.3T -
tank referenced 234K -
tank compressratio 1.05x -
tank mounted yes -
tank quota none default
tank reservation none default
tank recordsize 128K default
tank mountpoint /mnt/data local
tank sharenfs off default
tank checksum on default
tank compression off default
tank atime on default
tank devices on default
tank exec on default
tank setuid on default
tank readonly off default
tank zoned off default
tank snapdir hidden default
tank aclinherit restricted default
tank createtxg 1 -
tank canmount on default
tank xattr on default
tank copies 1 default
tank version 5 -
tank utf8only off -
tank normalization none -
tank casesensitivity sensitive -
tank vscan off default
tank nbmand off default
tank sharesmb off default
tank refquota none default
tank refreservation none default
tank guid 16067106483522758241 -
tank primarycache all default
tank secondarycache all default
tank usedbysnapshots 0B -
tank usedbydataset 234K -
tank usedbychildren 4.27T -
tank usedbyrefreservation 0B -
tank logbias latency default
tank objsetid 54 -
tank dedup off default
tank mlslabel none default
tank sync standard default
tank dnodesize legacy default
tank refcompressratio 1.00x -
tank written 234K -
tank logicalused 4.49T -
tank logicalreferenced 78.5K -
tank volmode default default
tank filesystem_limit none default
tank snapshot_limit none default
tank filesystem_count none default
tank snapshot_count none default
tank snapdev hidden default
tank acltype off default
tank context none default
tank fscontext none default
tank defcontext none default
tank rootcontext none default
tank relatime off default
tank redundant_metadata all default
tank overlay off default
tank encryption off default
tank keylocation none default
tank keyformat none default
tank pbkdf2iters 0 default
tank special_small_blocks 0 default
dataset that actually holds the data that I’m trying to improve
tank/scratch type filesystem -
tank/scratch creation Wed Apr 29 6:27 2020 -
tank/scratch used 113G -
tank/scratch available 13.3T -
tank/scratch referenced 113G -
tank/scratch compressratio 1.00x -
tank/scratch mounted yes -
tank/scratch quota none default
tank/scratch reservation none default
tank/scratch recordsize 128K default
tank/scratch mountpoint /mnt/data/scratch inherited from tank
tank/scratch sharenfs on local
tank/scratch checksum on default
tank/scratch compression off default
tank/scratch atime on default
tank/scratch devices on default
tank/scratch exec on default
tank/scratch setuid on default
tank/scratch readonly off default
tank/scratch zoned off default
tank/scratch snapdir hidden default
tank/scratch aclinherit restricted default
tank/scratch createtxg 1607865 -
tank/scratch canmount on default
tank/scratch xattr on default
tank/scratch copies 1 default
tank/scratch version 5 -
tank/scratch utf8only off -
tank/scratch normalization none -
tank/scratch casesensitivity sensitive -
tank/scratch vscan off default
tank/scratch nbmand off default
tank/scratch sharesmb off default
tank/scratch refquota none default
tank/scratch refreservation none default
tank/scratch guid 513176425129963498 -
tank/scratch primarycache all default
tank/scratch secondarycache all default
tank/scratch usedbysnapshots 0B -
tank/scratch usedbydataset 113G -
tank/scratch usedbychildren 0B -
tank/scratch usedbyrefreservation 0B -
tank/scratch logbias latency default
tank/scratch objsetid 840 -
tank/scratch dedup off default
tank/scratch mlslabel none default
tank/scratch sync standard default
tank/scratch dnodesize legacy default
tank/scratch refcompressratio 1.00x -
tank/scratch written 113G -
tank/scratch logicalused 112G -
tank/scratch logicalreferenced 112G -
tank/scratch volmode default default
tank/scratch filesystem_limit none default
tank/scratch snapshot_limit none default
tank/scratch filesystem_count none default
tank/scratch snapshot_count none default
tank/scratch snapdev hidden default
tank/scratch acltype off default
tank/scratch context none default
tank/scratch fscontext none default
tank/scratch defcontext none default
tank/scratch rootcontext none default
tank/scratch relatime off default
tank/scratch redundant_metadata all default
tank/scratch overlay off default
tank/scratch encryption off default
tank/scratch keylocation none default
tank/scratch keyformat none default
tank/scratch pbkdf2iters 0 default
tank/scratch special_small_blocks 0 default
I’ve also edited the first post with correct numbers while I am not scrubbing.