There are tools to read the internals. I use smartctl (smartmontool) on Linux to see all stats including how many hours they ran, if they report errors or what they report as emulated and native sector size. Unless you have really old stuff that will indeed be 512 byte sectors, all hard drives run with 4k sectors, so use ashift=12 for ZFS. Too high ashift is barely a thing, setting ashift too low is really really bad. Some firmware is just lying, especially common with SSDs. Just test your config with differing ashift to see what the drives are actually running.
This is the case for most storage. First TB is fastest and it gets slower the more you fill it. Outer tracks are just more convenient to write to. SSDs have the same problem but for different reasons. Marketing always talks about performance when doing the first MB of write, because highest number
This is generally a good advice. I use three different batches (check the serial number on the drive) for my 6 HDDs and occasionally do a rotation with my backup drives.
But this is even more important for SSDs. Same endurance rating, getting exactly the same amount of writes, same model and same batch? Yeah, donāt do this if you can avoid it.
hm then the perfomance goes really downhill. im at ~190mb/s sequential writing of videos. reading of ext4-lvm with stripped to the stripped-zfspool. at the beginning the pool hit abou ~420mb/s. i think the numbers advertised are sequential reading of full disk? i mean bunching alot of smaller devices together will be more perfomance compared to this high density. nvm, untill some good/cheap ssds are there, its gonne take ages. i think a qlc-ssd will be slower as this setup. maybe ext4-lvm would be better option for this usecase.
Iām getting approx. the same for each disk of my pool (Toshiba MG08 16TB drives). Sequential writes range between 150MB/s and 250MB/s on modern drives.
Iām getting 600-700MB/sec with my three mirrors. Adding in ZFS compression, it nicely fits into the 10Gbit network bandwidth and I often see the full 1.16GiB/sec when transferring large amounts of stuff.
maybe im reading of the slowest tracks? and im using rsync -avh --append-verify etc. , dunno how much the speedometer there differs of benchmarks. writing single files of ssd to it or single file of the lvm its faster ofc. i turned compression off in the pool, transcoding will save much more spare compared to this. fio benchmarks and dd show much higher values as rsync a whole lvm to transfer. ive used āadvancedā copy/mv for a while to replace the old coreutils for progressbar. but meh, im still tinkering to get my c-states to work, looks like asrock boards got bad bios for it.
So Iāve got two all-SSD pools in my workstation. One is four 8TB Samsung SATA drives in RAIDZ1 that Iām using for bulk storage, and one is three 2TB WD Black NVMEs in a striped vdev Iām using for operating system, working data, and applications. The striped vdev duplicates its snapshots on the big, slow pool, which in turn duplicates its snapshots daily to spinning rust on my home server.
Iām kind of disappointed with the performance of my SSD pools though. Compared to BTRFS or ext4 on LVM, the NVMEs in particular are severely underperforming. Some of my favourite L1Techs content were Wendellās interviews with Allan Jude, and in turn Allanās presentation on optimising SFS for SSDs at the BSD conference. I was hoping the changes Allan made would percolate into OpenZFS over time, but Iām getting impatient, so does anyone know which tunables I can poke to boost performance? Iāve tried adjusting obvious ones like zfs_vdev_async_write_max_active, but got no consistent improvements in benchmarks.
Iāve been on cutting edge distros for nearly a decade with ZFS on root and the DKMS nature of ZFS was never the issue. If it is, then youāre not running your system with a proper ZFS bootloader like ZBM.
im keen to know now if zfs set the sector size correct? lsblk -t says it doesnt need alignment, but those drives are using 512e out of the box, ive read that its possible to set this with hdparm and vendor specific tools. do i get better perfomance with this? like the last fraction? couldnt find anything conclusive about this. so dunno if zfs using 4k and the hdds accepting it.
e stands for emulated, and they report as 512 byte sectors to ensure compatibility for old systems. They lie to the OS. Doesnāt mean itās actually 512 byte sectors. Same with SSDs. Usually you see 4096n or 4kn in the spec sheet meaning 4k native.
you can test different ashift and benchmark with fio for differences. If ashift=9 is faster than ashift=12, then your drive really loves those 512 bytes.
ashift=12 works best so far. at first i created just the zpool with no dataset, but now ive moved my stuff into datasets and tested send/recv cause i dont have that many drives in my systems.
but i went for just a stripe over 3 drives, seems like the raidz1 perfomance was really slow for them, dropped under 200mb/s, is that the perfomance panelty for raidz? as far as i watched the talks from openzfs on youtube, its writing parity at the same time on all drives.
Please take this out. L2ARC is just about the last thing you want to add to a zpool, and then only if your arc stats support it. If weāre talking about best practices, nudge folks toward a special vdev instead. Ah, looks like I brought a knife to a gun fight.
To be clear, the ZFS that ships with Ubuntu is not installed via DKMS by default, but you can opt into this behavior.
To be clear, the conflict is with a leading-edge kernel, not a leading-edge distro. For example you can break ZFS in Ubuntu 20.04 LTS by installing the latest mainline kernel (and zfs-dkms).
When Iām teaching ZFS, I really try to hammer home that these concepts are analogous to the RAID levels we all know and love. So itās useful for learning by coming from a familiar place, but the implementations are subtlety different and have their own caveats and pitfalls. A stripe of vdevs doesnāt quite work like a typical stripe, a RAIDZ1 vdev doesnāt quite work like a typical RAID5 volume, etc.
Or by replacing two old drives with two new drives.
New data gets automatically write-balanced across the vdevs. Existing data stays where it is. The way vdev write-balancing is implemented can lead to some surprises here.
If your sync writes are important to you, SLOG should be a mirrored pair or better
If your sync writes are important to you, SLOG devices should have some level of power loss protection
If sync write performance is important to you, small block write latency at low queue depths is the most important spec to shop for
I recognize that there are valid use-cases for setting sync=disabled or sync=always on a dataset, but I think itās much too nuanced and foot-gunny to include in a beginnerās guide. If youāre determined to leave it, please just add a note that these actions should only be performed if lab testing, performance telemetry, etc. prove it to be necessary.
And itās the only compression option that will bail out early if it detects poorly-compressible data.
Please note that if you exclude something from ARC it will never make it to L2ARC.
Technically itās not striping. But the term is so common, I just call it striping, just like RAID levels that are different, but share enough similarity so everyone knows what they are dealing with. Distribution or round robin may be the better term, but to quote Jim Salter: " A zpool is not a funny-looking RAID0āitās a funny-looking JBOD, with a complex distribution mechanism subject to change".
3 disk RAID5, 3-wide RAIDZ1 or EC 2+1. Very different on a low level, but when it comes to storage efficiency,IOPS and designing a server, they share enough similarities to throw them into one basket to get everyone aboard.
The interested user will dig into the details if interested.
I should really update some things. And expand/autoexpand=on is definitely an important aspect I left out. Mainly because I just add mirrors to my pool, but changing an āoutdatedā vdev to a more up-to-date state or because you Donāt really want to have a spare 2TB ready and get everything up to 16-18T. Expand is often discussed among the L1T populace. Thanks for raising attention to this feature.
Notes taken. Paragraph on Linux and DKMS vs. non-DKMS will get an update. Along with more readable formatting.
opensuse tumbleweed uses really new kernels, i tried to use zfs there, but there arent modules there yet or i didnt looked up deep enough. maybe it changed but i stay away from it now.
Ok so Iāve spent the night, upgrading my IPMI and firmware to the very latest.
My googling seems a little ambiguous unfortunately regarding if this motherboard does PCIe bifurcation.
I still donāt actually understand if I need PCIe bifurcation, I got the impression, I do from one of the youtube videos but thatās hardly concrete.
I found this product:
KONYEAD pcie4.0 x16 to m.2 m-Key nvme x 4 ssd Expansion Card with Heatsinkļ¼Supports 4 NVMe M.2 2280 up to 256Gbpsļ¼M2 Adapter Card 2280 Driver Free. Bifurcation Required
Are there cards with 4x M.2 slots, I could plug 4x Optane 118GB or 58GB drives into, which do not require bifurcation or am I just misunderanding how to do this properly?
āthe Atom C3000 features a PCIe 3.0 x16 controller (with x2, x4 and x8 bifurcation), 16 SATA 3.0 ports, four 10 GbE controllers, and four USB 3.0 ports.ā
Perhaps SuperMicro didnāt add this option or perhaps me having 12 SATA ports an NVMe port is eating it all up.
Any advice would be of use! Curious if I can kick my server into high gear with some discounted 118GB Optanes (I have a little spare cash and 4 of these would hardly hurt me)
I donāt think your motherboard or CPU has PCI-e bifurcation.
Based on the spec sheet of the motherboard you have 1 PCI-e 3.0 x4 slot. I havenāt seen a x2 x2 bifurcation yet, nor would I think an Intel Atom would have such capability.
Only thing you can do is find a PCI-e to M.2 card with one M.2. That will probably work.
Maybe, just maybe. You can use a cheatcode from wendell after youāve turned pci-e into m.2
I guess so. However, you could also open a new thread specific to your needs.
Yes. It is not.
The mobo only has a single PCIe slot hooked up to four Gen3 lanes. While itās thinkable to enable 2x2 lane bifurcation, your bios/uefi screenshots donāt show the option.
Yep. Have a look at the mobo manual. Figure 1-6 is the block diagram that explains how the 16 lanes are split up on the board.
4 lanes go to the PCIe slot
2 lanes are shared between m.2 slot and SATA
the remaining lanes are used by other SATA, USB, Network options.
Realistically, you could add a maximum of 2 nvme devices to your mobo. The first into the m.2 slot, the second via a simple pcie-to-m.2 adapter card.
If you choose to install 2 nvme devices, that leaves up to 12 SATA ports usable for OS and zfs (this is the zfs thread after all).
maybe its wrong to revive this, i read a bit more and stumbled over the io scheduler, openzfs got their own scheduler. for ssdās and zfs fs there should be none set, ages ago it was called noop.
ive looked into my setup, debian 12, and found it mq-deadline was still set on my disk in the zpool.