ZFS help needed

Hello all,

I have created the following zfs pool using:

  1. zpool create nas-storage raidz1 /dev/disk/by-id/ata-ST16000NM001G-2KK103_ZL2BJ6Q3 /dev/disk/by-id/ata-ST16000NM001G-2KK103_ZL28634N /dev/disk/by-id/ata-ST16000NM001G-2KK103_ZL285QFD -f
    To create the main storage pool.
  2. Added a cache disk zpool add nas-storage cache /dev/disk/by-id/nvme-eui.0025385991b21308
  3. zfs set recordsize=512K nas-storage
  4. The I have added 2 nvme’s that i intend to use as metadata storage and also for small files.
    zpool add nas-storage special mirror /dev/disk/by-id/nvme-eui.002538bc1141edd7 /dev/disk/by-id/nvme-eui.002538bc21bb891c

The final pool

Questions:

  1. In special mirror-1 the disks are mirrored, correct? If not what will be the command to add the disks mirrored?
  2. I am planning to add a third disk as mirror of the first one, what will be the commend to add this third disk?

I forgot to add the record size set up…
zfs set recordsize=1M nas-storage
zfs set special_small_blocks=512K nas-storage

Thank you for your support.
BR. Ion

  1. It’s a mirror. special → mirror → nvme-eui… all fine.

zpool status shows a 3-wide RAIDZ1, a mirror of special and one drive for L2ARC.

  1. zpool attach pool disk newdisk. Adding another drive to a mirrror is done via attach (and detach if you want to remove drives from a mirror). This will make the special a 3-way mirror. triple redundancy but same 1TB of capacity.

You can change recordsize on the fly and for each dataset. No need to keep it the same for all datasets. Some datasets will want lower values. You can set both values on each dataset.
I have the default 128k for my pool and my datasets vary between 8k and 2M.

512k for small blocks is basically 90% of a mixed pool of data. You probably want extra capacity of flash in your case.
64k or 128k is the usual value. HDDs do fine at 128k+. 512k blocks are not small and take up a lot of space.

2 Likes

Hi,

After data transfer on the pool this is what i have on the nas-stoarage.

I was expecting that using zfs set special_small_blocks=512K more data will be stored on the nvme’s considering the file size structure (arround 15Gb of data).

1k: 128934
2k: 51018
4k: 41406
8k: 30902
16k: 21645
32k: 14579
64k: 13777
128k: 12690
256k: 10106
512k: 17154
1M: 14118
2M: 15163
4M: 72485
8M: 32683
16M: 7574
32M: 1468
64M: 742
128M: 1340
256M: 351
512M: 503
1G: 389
2G: 256
4G: 219
8G: 63
16G: 109
32G: 14
64G: 7

zpool list -v nas-storage
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
nas-storage 44.6T 12.1T 32.5T - - 0% 27% 1.00x ONLINE -
raidz1-0 43.7T 12.1T 31.6T - - 0% 27.6% - ONLINE
ata-ST16000NM001G-2KK103_ZL2BJ6Q3 - - - - - - - - ONLINE
ata-ST16000NM001G-2KK103_ZL28634N - - - - - - - - ONLINE
ata-ST16000NM001G-2KK103_ZL285QFD - - - - - - - - ONLINE
special - - - - - - - - -
mirror-1 928G 7.17G 921G - - 0% 0.77% - ONLINE
nvme-eui.002538bc1141edd7 - - - - - - - - ONLINE
nvme-eui.002538bc21bb891c - - - - - - - - ONLINE
cache - - - - - - - - -
nvme-eui.0025385991b21308 931G 931G 25.7M - - 0% 100.0% - ONLINE

Any ideas whey and how to fix this?

That’s not a lot of small stuff. When talking special vdev, we’re talking about 8 or 9-digit numbers. special vdev is mainly for pools that have tens of millions of small blocks and TB of metadata that can’t be economically be cached by main memory anymore.

but 7G allocated on special seems low. Metadata on media isn’t much. It’s a couple of bytes every megabyte with that recordsize. Should be more from what I see, but I mostly deal with 8k-128k datasets. Did you change the property after you wrote stuff to the pool? That would explain a lot.

Once written to pool, there is no way to retroactively change the allocation other than delete/rewrite/restoring from backup. ZFS never moves data by itself and changes in dataset properties only apply for things from this point forward.

Pool doesn’t really need special. Little to no small stuff, large recordsize, mostly media. Memory should be able to handle this on its own.

And with RAIDZ you can’t remove the special vdev either. Doesn’t look good.

Hi,
The data was copied on the pool after the pool creation and parameter set up.

That looks right to me. ZFS does write combining, if you have several files that are smaller than a block, it stores those files in a single block. It effectively turns random writes into sequential writes, which is why it works so well on hard drives. This is also why people will often use an ($33) 58gb optane for their small blocks, they are cheap and durable. In your use case a 58GB optane would be 15% full.