Home NAS latency (SOLVED)

10TB 3 disk mirror + hot spare = 10TB
Same disks raidZ = 20TB

Sorry, storage NEEDS. It’s a 20TB pool and I have 13TB free

I have set swappiness to 0 to prevent spillage into swap unless under extreme pressure which almost never happens. I could try to force files into cache, though it’s sporadic enough that I can’t really predict what I need in arc, se I’m at the mercy of physical disks.

This is also why I bought my optaine drive, and ended up not using it. l2arc is likely a waste of my ram.

Low swappiness is a good idea, but yeah I’m not sure if any vfs values will affect zfs. There are zfs tunables that do roughly equivalent things though.

For general Linux tuning, I’d try to dig into what tuned-adm does under the hood with the performance-latency profile and replicate that on Gentoo (or install tuned-adm if possible).

Without a space drive you are sunk anyway. The capacity means nothing if you loose it all when a drive dies and the array cannot resilver.

A 10 TiB drive will take about 2-3 days to resilver and 4 years from now when your drives are worn out if you have cycled them regularly then you are asking for trouble.

If you are using 7 TiB then yeah you would need to have more drives in your pool as the 50% storage hit does suck but thats the only real downside to mirrors.

That’s good! Optane is awesome. However, only 1 though? If that dies your array will also be trashed. Which is why you’re supposed to mirror those twice or thrice over.

Yeah. He could still look into running a periodic job to pull in the file info though. Use cron or systemd or whatever. Once all of that is cached it doesn’t need to be read from disk again, and file metadata is the slowest part of small file access.

If you don’t mind taking on some risk, you could use the optane as slog and disable the cache flushing. Just be sure to have smartmontools configured with email alerts so you’ll get notified of any pre-fail symptoms.

Worst case there is you lose recent writes if the optane dies.

2 Likes

No single pool is critical to me at all. I added the raidz for uptime or I would have just gone with a stripe. I keep offsite backups, and online backups on seperate pools so I have 3 copies. If I lose the data in the middle of backups, life happens. That also is where a lot of my budget goes and why pools have low storage per pool as they are not in the same physical box or even location. Also note that I live alone, and am the sole user of this data, so anyone else’s access/my monitory loss is no concern.

With the known risks to the live data explained as not a real issue, I was intending to try it out as a ZIL as a single drive. If the performance improved, and it’s a direct to nand write as they aren’t battery backed, I would get a second one as a mirror as you are correct, that is best practice.

1 Like

I agree. Gotta be prepared for failure when it happens.

For example, we had a hypervisor die the other day and had we not setup proper replication there would have been a service impact to our customers.

You might not see much improvement out of the box with slog. In my experience, it usually needs some tuning.

You also might see some latency improvement with compression off. LZW is cheap, but it’s not free. There are cases where it speeds up I/O for highly compressible data, but I think those are really specific use cases (logs maybe).

Actually, you get better I/O with LZ4 compression turned on, even for uncompressable data IIRC. That JRS guy is a genious.

@kdb424 could you give us some of your pool’s details?

zfs get all pool_name

tank  type                  filesystem             -
tank  creation              Fri Dec 27 18:15 2019  -
tank  used                  4.27T                  -
tank  available             13.3T                  -
tank  referenced            234K                   -
tank  compressratio         1.05x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /mnt/data              local
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           off                    default
tank  atime                 on                     default
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            restricted             default
tank  createtxg             1                      -
tank  canmount              on                     default
tank  xattr                 on                     default
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  guid                  16067106483522758241   -
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0B                     -
tank  usedbydataset         234K                   -
tank  usedbychildren        4.27T                  -
tank  usedbyrefreservation  0B                     -
tank  logbias               latency                default
tank  objsetid              54                     -
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  dnodesize             legacy                 default
tank  refcompressratio      1.00x                  -
tank  written               234K                   -
tank  logicalused           4.49T                  -
tank  logicalreferenced     78.5K                  -
tank  volmode               default                default
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               off                    default
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              off                    default
tank  redundant_metadata    all                    default
tank  overlay               off                    default
tank  encryption            off                    default
tank  keylocation           none                   default
tank  keyformat             none                   default
tank  pbkdf2iters           0                      default
tank  special_small_blocks  0                      default

dataset that actually holds the data that I’m trying to improve

tank/scratch  type                  filesystem             -
tank/scratch  creation              Wed Apr 29  6:27 2020  -
tank/scratch  used                  113G                   -
tank/scratch  available             13.3T                  -
tank/scratch  referenced            113G                   -
tank/scratch  compressratio         1.00x                  -
tank/scratch  mounted               yes                    -
tank/scratch  quota                 none                   default
tank/scratch  reservation           none                   default
tank/scratch  recordsize            128K                   default
tank/scratch  mountpoint            /mnt/data/scratch      inherited from tank
tank/scratch  sharenfs              on                     local
tank/scratch  checksum              on                     default
tank/scratch  compression           off                    default
tank/scratch  atime                 on                     default
tank/scratch  devices               on                     default
tank/scratch  exec                  on                     default
tank/scratch  setuid                on                     default
tank/scratch  readonly              off                    default
tank/scratch  zoned                 off                    default
tank/scratch  snapdir               hidden                 default
tank/scratch  aclinherit            restricted             default
tank/scratch  createtxg             1607865                -
tank/scratch  canmount              on                     default
tank/scratch  xattr                 on                     default
tank/scratch  copies                1                      default
tank/scratch  version               5                      -
tank/scratch  utf8only              off                    -
tank/scratch  normalization         none                   -
tank/scratch  casesensitivity       sensitive              -
tank/scratch  vscan                 off                    default
tank/scratch  nbmand                off                    default
tank/scratch  sharesmb              off                    default
tank/scratch  refquota              none                   default
tank/scratch  refreservation        none                   default
tank/scratch  guid                  513176425129963498     -
tank/scratch  primarycache          all                    default
tank/scratch  secondarycache        all                    default
tank/scratch  usedbysnapshots       0B                     -
tank/scratch  usedbydataset         113G                   -
tank/scratch  usedbychildren        0B                     -
tank/scratch  usedbyrefreservation  0B                     -
tank/scratch  logbias               latency                default
tank/scratch  objsetid              840                    -
tank/scratch  dedup                 off                    default
tank/scratch  mlslabel              none                   default
tank/scratch  sync                  standard               default
tank/scratch  dnodesize             legacy                 default
tank/scratch  refcompressratio      1.00x                  -
tank/scratch  written               113G                   -
tank/scratch  logicalused           112G                   -
tank/scratch  logicalreferenced     112G                   -
tank/scratch  volmode               default                default
tank/scratch  filesystem_limit      none                   default
tank/scratch  snapshot_limit        none                   default
tank/scratch  filesystem_count      none                   default
tank/scratch  snapshot_count        none                   default
tank/scratch  snapdev               hidden                 default
tank/scratch  acltype               off                    default
tank/scratch  context               none                   default
tank/scratch  fscontext             none                   default
tank/scratch  defcontext            none                   default
tank/scratch  rootcontext           none                   default
tank/scratch  relatime              off                    default
tank/scratch  redundant_metadata    all                    default
tank/scratch  overlay               off                    default
tank/scratch  encryption            off                    default
tank/scratch  keylocation           none                   default
tank/scratch  keyformat             none                   default
tank/scratch  pbkdf2iters           0                      default
tank/scratch  special_small_blocks  0                      default

I’ve also edited the first post with correct numbers while I am not scrubbing.

You should enable lz4 compression.

But you will need to rewrite all of your current data as zfs does not do this for you. The easiest way is to just pipe your dataset.

There are also various other bits like disable atime and set the ashift value.

There is a tuning article by the JRS guy as well.

Turn atime off

1 Like

@kdb424 found it

https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/

2 Likes

I normally do that on my datasets. I can’t believe I missed it on that one.

I had that website open, but did not find that page somehow. Much appreciated.
It seems that a SLOG may be of some help after all as NFS exports are my top usage of this machine, and it would seem to recommend one based on that link. As far as l2arc, I’ll look into that one again in the next major ZFS version as I believe that persistent l2arc is likely to be included. It’s already in mainline openzfs.

That is a given, and I don’t know how I didn’t do that. Thank you for point out my stupid errors :slight_smile:

2 Likes

Update. I’ve tuned the scratch volume and I’m now able to copy over my .cache (massive, and many many files) with cp at over 300mbps (Not MB) which is already a massive improvement. I’ve also added the m.2 slog on the optane drive, and while it’s technically unsafe, I’ll add another down the line in a mirror as it seemed to help (all of my usage is NFS outside of zfs send/recv). I appreciate all of the help! If there’s anything else that can be added, I’ll keep tabs on this, but I’d call this “solved”

2 Likes

Glad you got it working :+1:

1 Like

Recordsizes

My own NAS workload (using ZFS 0.8.3) is basically:

  • Bulk writes
  • Bulk reads
  • Accessing metadata

For the first two, I highly recommend either 1M or 4M

zfs set recordsize=1M yourpool/yourdataset

To set it more than 1M, you have to edit /etc/modprobe.d/zfs.conf and add the following and reboot. Note that 4M is kinda commonly used and probably doesn’t have issues, but higher is not tested well. I’ve seen some people go 16M, which might be cool, but I’m not touching that for a while, and it seems 4M is already the point of diminishing returns.

options zfs zfs_max_recordsize=4194304

This won’t effect already written data, but anything new will be stored in larger chunks (recordsize is a MAX, blocks are sized dynamically). This results in fewer IO operations everytime a whole file is accessed and also less metadata. If you have a workload were you are constantly reading and/or writing parts of a file, then you want that on a separate dataset with a smaller recordsize. Datasets holding databases might be 8K/16K/32K/64K, which is a big difference. VM storage also needs a small recordsize.

Here’s what I use for my youtube archive dataset for example

zfs set acltype=posixacl kpool/archive-yt
zfs set compression=lz4 kpool/archive-yt
zfs set xattr=sa kpool/archive-yt
zfs set atime=off kpool/archive-yt
zfs set relatime=off kpool/archive-yt
zfs set recordsize=4M kpool/archive-yt
zfs set aclinherit=passthrough kpool/archive-yt
zfs set dnodesize=auto kpool/archive-yt
zfs set exec=off kpool/archive-yt

Please note that some of these settings permanently throw away any BSD compatibility. Not an issue for me, but something you should be aware of.

Special Vdevs for metadata

I also highly recommend looking at the new special vdev. Good information is hard to find right now (maby I looked wrong, but the man pages were fucking useless for this). This thread introduces it nicely. As it happens, I had 3x 250gb 860 evos laying around from putting old laptops out of commission, so I just grabbed another one. 500GB of space should be sufficient for most people. If you have billions of files or storage close to the triple digits, spend more time figuring out how much metadata you have. If you run out of space no big deal By default, only metadata is allocated to this special vdev, which allows you to do things like look at the metadata without having to touch your HDD’s (great for indexing, searching, and opening a folder with tens of thousands of files).

Basically to add 4 SSD’s in 2 mirrored pairs, grab the disk ID’s with this:

ls -l /dev/disk/by-id

And then fit them into this:

zpool add yourpool special mirror /dev/disk/by-id/wwn-0x1 /dev/disk/by-id/wwn-0x2 mirror /dev/disk/by-id/wwn-0x3 /dev/disk/by-id/wwn-0x4

Note the two uses of mirror, which mirrors the first two disks, and then mirrors the second two disks and combines them.

Note that if this special vdev cannot be removed, and if it dies your pool also dies, just like any other vdev. Incremental backups continue to be a good idea at all times in all circumstances. Smart monitoring and email notifications are also best practice.

To show how full each vdev is, use:

zpool list -v yourpool

NSF and sync writes

By the way, my understanding is that by default, NSF requests only sync writes (and apparently iscsi does not), which means an SLOG is a very good idea, as you’ve already seen. A good way to test this prior to adding an SLOG is in ZFS temporarily setting sync=disable on the dataset, which tells ZFS to lie. If it speeds up then an SLOG will help. You could also mount the NFS share with the async option.

zfs set sync=disabled yourpool/yourdataset

When done make sure to set it back to:

zfs set sync=standard yourpool/yourdataset

2 Likes

I’ve set my record size on most of my “general” pools to 1M at this point, but didn’t know I could go higher. May be worth looking into. Some things I have a 16k such as my linux iso directory of course, but that’s on a scratch ssd pool

Do you see any compression on videos out of curiosity? I mean, CPU cycles are wasted with it off, but mostly curious.

I opened that link, and my mind was blown. I’ll have to get back to you on that, but that’s amazing to know that exists. If my glance is correct, this stores metadata and small files on the SSD for the iops, but large files will go to spinning rust behind it, or is this all to the SSD, then spillover to rust only? I saw that spillover is there regardless. I may have a set of 256GB SSD’s somewhere to try this out on (mirrored, and pool has proper backups as stated elsewhere), or could try this on just a scratch dataset of course.

I’ve since added an optane M.2 module (only a single for now, yes I know I need to order another one and adapter to mirror), and that did in fact make things MUCH more responsive, though I haven’t checked the real numbers. I am more comfortable with sync=standard on a non mirrored slog than disabled completely for the time being.

It’s 1am, so I have to give that a re-read in the morning, but that’s amazing information on this. Thank you so much for the book of a reply!

Not on the videos. lz4 will abort on those quickly enough it’s not an issue. What is compressed is all the side files (text descriptions, annotations, translations, etc).

For the special vdevs (make sure your ZFS is recent 0.8.3 and pool has been upgraded), the default is that only metadata is sent to the ssd’s, unless they are full and then the metadata goes to the hdds as normal. You can furthmore set each dataset individually to also send files up to a certain size to the ssd’s (until full, then to hdd’s as normal). I’m not exactly familiar with setting that, as I don’t want to spend the time on figuring out if I want to use it or not (vs just having a dedicated SSD pool, which I don’t need right now anyways)

It’s 1am

Dammit, not again.

2 Likes

I’ve looked into this and I may play with it on a scratch pool as that’s where my hot data should sit anyway, and the security of that pool won’t matter as it’s all scratch. I tend to tier my storage with the vast majority of it being large media (by size, not file count), backups (zfs send/recv), and very minimal used by services like web servers ect. I’ve been looking at downsizing my desktop and therefor internal storage, which is the main reason I was looking into the performance side, and for my really hot data, it would be cheap enough to even throw a completely unsafe pool at my data as it would get backed up on a cron job to a safe pool.

Think OS on local safe pool, but remote home dir on a hybrid SSD/HDD pool. I could skip the mirrors as long as I don’t mind the downtime if something fails while safely testing this newer feature.