ZFS Guide for starters and advanced users. Concepts, pool config, tuning, troubleshooting

Exard3k · April 18, 2023, 7:55am

There are tools to read the internals. I use smartctl (smartmontool) on Linux to see all stats including how many hours they ran, if they report errors or what they report as emulated and native sector size. Unless you have really old stuff that will indeed be 512 byte sectors, all hard drives run with 4k sectors, so use ashift=12 for ZFS. Too high ashift is barely a thing, setting ashift too low is really really bad. Some firmware is just lying, especially common with SSDs. Just test your config with differing ashift to see what the drives are actually running.

This is the case for most storage. First TB is fastest and it gets slower the more you fill it. Outer tracks are just more convenient to write to. SSDs have the same problem but for different reasons. Marketing always talks about performance when doing the first MB of write, because highest number

This is generally a good advice. I use three different batches (check the serial number on the drive) for my 6 HDDs and occasionally do a rotation with my backup drives.
But this is even more important for SSDs. Same endurance rating, getting exactly the same amount of writes, same model and same batch? Yeah, don’t do this if you can avoid it.

lyth1um · April 18, 2023, 8:08am

hm then the perfomance goes really downhill. im at ~190mb/s sequential writing of videos. reading of ext4-lvm with stripped to the stripped-zfspool. at the beginning the pool hit abou ~420mb/s. i think the numbers advertised are sequential reading of full disk? i mean bunching alot of smaller devices together will be more perfomance compared to this high density. nvm, untill some good/cheap ssds are there, its gonne take ages. i think a qlc-ssd will be slower as this setup. maybe ext4-lvm would be better option for this usecase.

Exard3k · April 18, 2023, 8:32am

I’m getting approx. the same for each disk of my pool (Toshiba MG08 16TB drives). Sequential writes range between 150MB/s and 250MB/s on modern drives.
I’m getting 600-700MB/sec with my three mirrors. Adding in ZFS compression, it nicely fits into the 10Gbit network bandwidth and I often see the full 1.16GiB/sec when transferring large amounts of stuff.

lyth1um · April 18, 2023, 8:40am

maybe im reading of the slowest tracks? and im using rsync -avh --append-verify etc. , dunno how much the speedometer there differs of benchmarks. writing single files of ssd to it or single file of the lvm its faster ofc. i turned compression off in the pool, transcoding will save much more spare compared to this. fio benchmarks and dd show much higher values as rsync a whole lvm to transfer. ive used “advanced” copy/mv for a while to replace the old coreutils for progressbar. but meh, im still tinkering to get my c-states to work, looks like asrock boards got bad bios for it.

Susanna · April 18, 2023, 9:11am

So I’ve got two all-SSD pools in my workstation. One is four 8TB Samsung SATA drives in RAIDZ1 that I’m using for bulk storage, and one is three 2TB WD Black NVMEs in a striped vdev I’m using for operating system, working data, and applications. The striped vdev duplicates its snapshots on the big, slow pool, which in turn duplicates its snapshots daily to spinning rust on my home server.

I’m kind of disappointed with the performance of my SSD pools though. Compared to BTRFS or ext4 on LVM, the NVMEs in particular are severely underperforming. Some of my favourite L1Techs content were Wendell’s interviews with Allan Jude, and in turn Allan’s presentation on optimising SFS for SSDs at the BSD conference. I was hoping the changes Allan made would percolate into OpenZFS over time, but I’m getting impatient, so does anyone know which tunables I can poke to boost performance? I’ve tried adjusting obvious ones like zfs_vdev_async_write_max_active, but got no consistent improvements in benchmarks.

1ms_hand2eye · April 24, 2023, 5:10pm

I’ve been on cutting edge distros for nearly a decade with ZFS on root and the DKMS nature of ZFS was never the issue. If it is, then you’re not running your system with a proper ZFS bootloader like ZBM.

lyth1um · May 2, 2023, 2:37am

im keen to know now if zfs set the sector size correct? lsblk -t says it doesnt need alignment, but those drives are using 512e out of the box, ive read that its possible to set this with hdparm and vendor specific tools. do i get better perfomance with this? like the last fraction? couldnt find anything conclusive about this. so dunno if zfs using 4k and the hdds accepting it.

Exard3k · May 2, 2023, 3:05am

e stands for emulated, and they report as 512 byte sectors to ensure compatibility for old systems. They lie to the OS. Doesn’t mean it’s actually 512 byte sectors. Same with SSDs. Usually you see 4096n or 4kn in the spec sheet meaning 4k native.

you can test different ashift and benchmark with fio for differences. If ashift=9 is faster than ashift=12, then your drive really loves those 512 bytes.

lyth1um · May 26, 2023, 3:21pm

ashift=12 works best so far. at first i created just the zpool with no dataset, but now ive moved my stuff into datasets and tested send/recv cause i dont have that many drives in my systems.

but i went for just a stripe over 3 drives, seems like the raidz1 perfomance was really slow for them, dropped under 200mb/s, is that the perfomance panelty for raidz? as far as i watched the talks from openzfs on youtube, its writing parity at the same time on all drives.

bambinone · May 26, 2023, 7:45pm

Very good. Some notes:

Please take this out. L2ARC is just about the last thing you want to add to a zpool, and then only if your arc stats support it. If we’re talking about best practices, nudge folks toward a special vdev instead. Ah, looks like I brought a knife to a gun fight.

To be clear, the ZFS that ships with Ubuntu is not installed via DKMS by default, but you can opt into this behavior.

To be clear, the conflict is with a leading-edge kernel, not a leading-edge distro. For example you can break ZFS in Ubuntu 20.04 LTS by installing the latest mainline kernel (and zfs-dkms).

When I’m teaching ZFS, I really try to hammer home that these concepts are analogous to the RAID levels we all know and love. So it’s useful for learning by coming from a familiar place, but the implementations are subtlety different and have their own caveats and pitfalls. A stripe of vdevs doesn’t quite work like a typical stripe, a RAIDZ1 vdev doesn’t quite work like a typical RAID5 volume, etc.

Or by replacing two old drives with two new drives.

New data gets automatically write-balanced across the vdevs. Existing data stays where it is. The way vdev write-balancing is implemented can lead to some surprises here.

If your sync writes are important to you, SLOG should be a mirrored pair or better
If your sync writes are important to you, SLOG devices should have some level of power loss protection
If sync write performance is important to you, small block write latency at low queue depths is the most important spec to shop for

I recognize that there are valid use-cases for setting sync=disabled or sync=always on a dataset, but I think it’s much too nuanced and foot-gunny to include in a beginner’s guide. If you’re determined to leave it, please just add a note that these actions should only be performed if lab testing, performance telemetry, etc. prove it to be necessary.

And it’s the only compression option that will bail out early if it detects poorly-compressible data.

Please note that if you exclude something from ARC it will never make it to L2ARC.

Exard3k · May 26, 2023, 8:45pm

Technically it’s not striping. But the term is so common, I just call it striping, just like RAID levels that are different, but share enough similarity so everyone knows what they are dealing with. Distribution or round robin may be the better term, but to quote Jim Salter: " A zpool is not a funny-looking RAID0—it’s a funny-looking JBOD, with a complex distribution mechanism subject to change".

3 disk RAID5, 3-wide RAIDZ1 or EC 2+1. Very different on a low level, but when it comes to storage efficiency,IOPS and designing a server, they share enough similarities to throw them into one basket to get everyone aboard.
The interested user will dig into the details if interested.

I should really update some things. And expand/autoexpand=on is definitely an important aspect I left out. Mainly because I just add mirrors to my pool, but changing an “outdated” vdev to a more up-to-date state or because you Don’t really want to have a spare 2TB ready and get everything up to 16-18T. Expand is often discussed among the L1T populace. Thanks for raising attention to this feature.

Notes taken. Paragraph on Linux and DKMS vs. non-DKMS will get an update. Along with more readable formatting.

Thanks for the feedback!

lyth1um · May 26, 2023, 8:52pm

opensuse tumbleweed uses really new kernels, i tried to use zfs there, but there arent modules there yet or i didnt looked up deep enough. maybe it changed but i stay away from it now.

Exard3k · November 25, 2023, 10:13am

Updated links in the second posting and added zfs-autosnapshot and Syncoid as tools with links. Found some typos.

ZFS Administration guide from Aaron Toponce is offline…very sad. For over a week now. I linked a http dump pdf in the guide.

disk_diddler · January 27, 2024, 5:20am

Is this a suitable topic to discuss Wendell recommending using 2 to 4 Optane devices to speed up ZFS / TrueNAS systems?

I’m trying to find threads discussing the fire sale he mentioned over a year ago but they’ve all closed due to idle time.

Does anyone know if this board is capable of me putting in a 2 or 4x nvme card and doing what he’s suggesting?

https://www.supermicro.com/en/products/motherboard/a2sdi-8c-hln4f

(It seems the odd optane is still available cheap and I don’t use the PCI-e for anything else)

Also I have 6xSATA 16TB drives in the system AND 6x2TB SSD drives in the system, can 4 optanes safely cache the whole lot?

disk_diddler · January 27, 2024, 10:09am

Ok so I’ve spent the night, upgrading my IPMI and firmware to the very latest.

My googling seems a little ambiguous unfortunately regarding if this motherboard does PCIe bifurcation.

I still don’t actually understand if I need PCIe bifurcation, I got the impression, I do from one of the youtube videos but that’s hardly concrete.

I found this product:

KONYEAD pcie4.0 x16 to m.2 m-Key nvme x 4 ssd Expansion Card with Heatsink，Supports 4 NVMe M.2 2280 up to 256Gbps，M2 Adapter Card 2280 Driver Free. Bifurcation Required

Are there cards with 4x M.2 slots, I could plug 4x Optane 118GB or 58GB drives into, which do not require bifurcation or am I just misunderanding how to do this properly?

Here’s my board options:
zSHv7W2.png (1248×1176) (imgur.com)

xTYz06q.png (1237×958) (imgur.com)

It’s the new 1.8 BIOS. I’m not seeing the option there.

According to Anandtech though:

Intel Launches Atom C3000 SoCs: Up to 16 Cores for NAS, Servers, Vehicles (anandtech.com)

“the Atom C3000 features a PCIe 3.0 x16 controller (with x2, x4 and x8 bifurcation), 16 SATA 3.0 ports, four 10 GbE controllers, and four USB 3.0 ports.”

Perhaps SuperMicro didn’t add this option or perhaps me having 12 SATA ports an NVMe port is eating it all up.

Any advice would be of use! Curious if I can kick my server into high gear with some discounted 118GB Optanes (I have a little spare cash and 4 of these would hardly hurt me)

Minaril · January 27, 2024, 11:48am

I don’t think your motherboard or CPU has PCI-e bifurcation.
Based on the spec sheet of the motherboard you have 1 PCI-e 3.0 x4 slot. I haven’t seen a x2 x2 bifurcation yet, nor would I think an Intel Atom would have such capability.
Only thing you can do is find a PCI-e to M.2 card with one M.2. That will probably work.
Maybe, just maybe. You can use a cheatcode from wendell after you’ve turned pci-e into m.2

https://www.youtube.com/watch?v=M9TcL9aY004

https://www.amazon.com.au/Ableconn-M2-U2-131-SFF-8643-connector-PCIe-NVMe/dp/B01D2PXUQA

https://www.cablecreation.com/products/internal-hd-mini-sas-cable

Also, for a server I recommend you enable Above 4G decoding and SR-IOV and disable fastboot.

disk_diddler · January 27, 2024, 2:30pm

The thing is if you check that anandtech article, it does imply its capable.

I would prefer multiple drives, I take it the method you suggesting can’t do this?

jode · January 27, 2024, 10:20pm

I guess so. However, you could also open a new thread specific to your needs.

Yes. It is not.
The mobo only has a single PCIe slot hooked up to four Gen3 lanes. While it’s thinkable to enable 2x2 lane bifurcation, your bios/uefi screenshots don’t show the option.

Yep. Have a look at the mobo manual. Figure 1-6 is the block diagram that explains how the 16 lanes are split up on the board.

4 lanes go to the PCIe slot
2 lanes are shared between m.2 slot and SATA
the remaining lanes are used by other SATA, USB, Network options.

Realistically, you could add a maximum of 2 nvme devices to your mobo. The first into the m.2 slot, the second via a simple pcie-to-m.2 adapter card.
If you choose to install 2 nvme devices, that leaves up to 12 SATA ports usable for OS and zfs (this is the zfs thread after all).

disk_diddler · January 28, 2024, 1:02am

I’ve actually replied to you here (sorry this is the wrong place, for my idea)
Sorry about bouncing round but it sticks to the right topic.

@jode

lyth1um · May 4, 2024, 8:55am

maybe its wrong to revive this, i read a bit more and stumbled over the io scheduler, openzfs got their own scheduler. for ssd’s and zfs fs there should be none set, ages ago it was called noop.

ive looked into my setup, debian 12, and found it mq-deadline was still set on my disk in the zpool.

root@debian-server:~# cat /sys/block/sdf/queue/scheduler
[mq-deadline] none
root@debian-server:~# cat /sys/block/sde/queue/scheduler
[mq-deadline] none

i didnt found any specific statement that this could be a “problem”, is desired or can slow things down.
anyone with some knowledge about this?