ZFS Guide for starters and advanced users. Concepts, pool config, tuning, troubleshooting

There are tools to read the internals. I use smartctl (smartmontool) on Linux to see all stats including how many hours they ran, if they report errors or what they report as emulated and native sector size. Unless you have really old stuff that will indeed be 512 byte sectors, all hard drives run with 4k sectors, so use ashift=12 for ZFS. Too high ashift is barely a thing, setting ashift too low is really really bad. Some firmware is just lying, especially common with SSDs. Just test your config with differing ashift to see what the drives are actually running.

This is the case for most storage. First TB is fastest and it gets slower the more you fill it. Outer tracks are just more convenient to write to. SSDs have the same problem but for different reasons. Marketing always talks about performance when doing the first MB of write, because highest number :slight_smile:

This is generally a good advice. I use three different batches (check the serial number on the drive) for my 6 HDDs and occasionally do a rotation with my backup drives.
But this is even more important for SSDs. Same endurance rating, getting exactly the same amount of writes, same model and same batch? Yeah, don’t do this if you can avoid it.

1 Like

hm then the perfomance goes really downhill. im at ~190mb/s sequential writing of videos. reading of ext4-lvm with stripped to the stripped-zfspool. at the beginning the pool hit abou ~420mb/s. i think the numbers advertised are sequential reading of full disk? i mean bunching alot of smaller devices together will be more perfomance compared to this high density. nvm, untill some good/cheap ssds are there, its gonne take ages. i think a qlc-ssd will be slower as this setup. maybe ext4-lvm would be better option for this usecase. :crazy_face:

1 Like

I’m getting approx. the same for each disk of my pool (Toshiba MG08 16TB drives). Sequential writes range between 150MB/s and 250MB/s on modern drives.
I’m getting 600-700MB/sec with my three mirrors. Adding in ZFS compression, it nicely fits into the 10Gbit network bandwidth and I often see the full 1.16GiB/sec when transferring large amounts of stuff.

1 Like

maybe im reading of the slowest tracks? and im using rsync -avh --append-verify etc. , dunno how much the speedometer there differs of benchmarks. writing single files of ssd to it or single file of the lvm its faster ofc. i turned compression off in the pool, transcoding will save much more spare compared to this. fio benchmarks and dd show much higher values as rsync a whole lvm to transfer. ive used “advanced” copy/mv for a while to replace the old coreutils for progressbar. but meh, im still tinkering to get my c-states to work, looks like asrock boards got bad bios for it.

1 Like

So I’ve got two all-SSD pools in my workstation. One is four 8TB Samsung SATA drives in RAIDZ1 that I’m using for bulk storage, and one is three 2TB WD Black NVMEs in a striped vdev I’m using for operating system, working data, and applications. The striped vdev duplicates its snapshots on the big, slow pool, which in turn duplicates its snapshots daily to spinning rust on my home server.

I’m kind of disappointed with the performance of my SSD pools though. Compared to BTRFS or ext4 on LVM, the NVMEs in particular are severely underperforming. Some of my favourite L1Techs content were Wendell’s interviews with Allan Jude, and in turn Allan’s presentation on optimising SFS for SSDs at the BSD conference. I was hoping the changes Allan made would percolate into OpenZFS over time, but I’m getting impatient, so does anyone know which tunables I can poke to boost performance? I’ve tried adjusting obvious ones like zfs_vdev_async_write_max_active, but got no consistent improvements in benchmarks.

1 Like

I’ve been on cutting edge distros for nearly a decade with ZFS on root and the DKMS nature of ZFS was never the issue. If it is, then you’re not running your system with a proper ZFS bootloader like ZBM.

im keen to know now if zfs set the sector size correct? lsblk -t says it doesnt need alignment, but those drives are using 512e out of the box, ive read that its possible to set this with hdparm and vendor specific tools. do i get better perfomance with this? like the last fraction? couldnt find anything conclusive about this. so dunno if zfs using 4k and the hdds accepting it.

e stands for emulated, and they report as 512 byte sectors to ensure compatibility for old systems. They lie to the OS. Doesn’t mean it’s actually 512 byte sectors. Same with SSDs. Usually you see 4096n or 4kn in the spec sheet meaning 4k native.

you can test different ashift and benchmark with fio for differences. If ashift=9 is faster than ashift=12, then your drive really loves those 512 bytes.

1 Like

ashift=12 works best so far. at first i created just the zpool with no dataset, but now ive moved my stuff into datasets and tested send/recv cause i dont have that many drives in my systems.

but i went for just a stripe over 3 drives, seems like the raidz1 perfomance was really slow for them, dropped under 200mb/s, is that the perfomance panelty for raidz? as far as i watched the talks from openzfs on youtube, its writing parity at the same time on all drives.

Very good. Some notes:

Please take this out. L2ARC is just about the last thing you want to add to a zpool, and then only if your arc stats support it. If we’re talking about best practices, nudge folks toward a special vdev instead. Ah, looks like I brought a knife to a gun fight.

To be clear, the ZFS that ships with Ubuntu is not installed via DKMS by default, but you can opt into this behavior.

To be clear, the conflict is with a leading-edge kernel, not a leading-edge distro. For example you can break ZFS in Ubuntu 20.04 LTS by installing the latest mainline kernel (and zfs-dkms).

When I’m teaching ZFS, I really try to hammer home that these concepts are analogous to the RAID levels we all know and love. So it’s useful for learning by coming from a familiar place, but the implementations are subtlety different and have their own caveats and pitfalls. A stripe of vdevs doesn’t quite work like a typical stripe, a RAIDZ1 vdev doesn’t quite work like a typical RAID5 volume, etc.

Or by replacing two old drives with two new drives.

New data gets automatically write-balanced across the vdevs. Existing data stays where it is. The way vdev write-balancing is implemented can lead to some surprises here.

  • If your sync writes are important to you, SLOG should be a mirrored pair or better
  • If your sync writes are important to you, SLOG devices should have some level of power loss protection
  • If sync write performance is important to you, small block write latency at low queue depths is the most important spec to shop for

I recognize that there are valid use-cases for setting sync=disabled or sync=always on a dataset, but I think it’s much too nuanced and foot-gunny to include in a beginner’s guide. If you’re determined to leave it, please just add a note that these actions should only be performed if lab testing, performance telemetry, etc. prove it to be necessary.

And it’s the only compression option that will bail out early if it detects poorly-compressible data.

Please note that if you exclude something from ARC it will never make it to L2ARC.

1 Like

Technically it’s not striping. But the term is so common, I just call it striping, just like RAID levels that are different, but share enough similarity so everyone knows what they are dealing with. Distribution or round robin may be the better term, but to quote Jim Salter: " A zpool is not a funny-looking RAID0—it’s a funny-looking JBOD, with a complex distribution mechanism subject to change".

3 disk RAID5, 3-wide RAIDZ1 or EC 2+1. Very different on a low level, but when it comes to storage efficiency,IOPS and designing a server, they share enough similarities to throw them into one basket to get everyone aboard.
The interested user will dig into the details if interested.

I should really update some things. And expand/autoexpand=on is definitely an important aspect I left out. Mainly because I just add mirrors to my pool, but changing an “outdated” vdev to a more up-to-date state or because you Don’t really want to have a spare 2TB ready and get everything up to 16-18T. Expand is often discussed among the L1T populace. Thanks for raising attention to this feature.

Notes taken. Paragraph on Linux and DKMS vs. non-DKMS will get an update. Along with more readable formatting.

Thanks for the feedback!

1 Like

opensuse tumbleweed uses really new kernels, i tried to use zfs there, but there arent modules there yet or i didnt looked up deep enough. maybe it changed but i stay away from it now.