Zfs vdev with two drives (no mirror) answered, ty

hello everyone, thank you for taking the time.

i have two options:
1 nvme of 4T
2 nvme of 2T (cost:+50€)

I need the 4T as space to write stuff in.

in the context of the upcoming question:

  • Let us abstract the arc and slog [or is it called zil] that are in ram for a moment.
  • Let us also abstract the no redudancy, no failsafe in case nvme drive fails.
  • from now on when I say cache, I mean the cache that is physicaly on the nvme ssd.

If I use one nvme I use 4x pcie lanes, one nand controler, one hardware cache
If i use 2 nvme in the same pool, i use 8x lanes (4 on each controler), 2 controlers, two different hardware cache

this next paragraph is the question, stated in many different ways

QUESTION:
How does ZFS decide to write to the underlying nvme ssd if they are both in the pool?
does zfs fill up one drive, then the second
or
does zfs blindly writes odd bits on nvme0 and even bits on nvme1
or
does zfs balances the load as much as it can between the two drives/controllers

any insight?

bonus question can i expect speed increase in the two nvme configuration? ( I am effectively doubling the lanes and cache afterall )

thank you for reading, have a nice day :slight_smile:

EDIT: i have google in search of the following information:
what is the size of the cache on the corsair MP600 4to
I could not find it (i do not know if the cache is bigger on the 4to, compared to the 2to). This means i do not know if going for 2 ssd would double the cache size, but i know for a fact it would spread the load over 8 lanes instead of 4 and two controlers instead of one.

EDIT2:
4to and 2To have the same physical config, it’s onlythe dynamic cache that differs.
i’m going for two drives, so that te workload is divided into two NVMe thus increasing speed. (as stated below, that increase is counterbalanced by the fact the arc is counter productive when it comes to nvme ; but the overhaul is stil an increase)

You will see a roughly equal distribution of usage across all vdevs.

No. ZFS writes blocks of data, not individual bits. Along with the metadata, blockpointers,etc. This is no dumb round robin behaviour.

Yes. If ZFS notices one drive is slower performance than the rest, it gets less data to improve overall performance.

Yes. You can consider two vdevs to be the equivalent of being a RAID0 with two drives. ZFS automatically stripes data across all top-level vdevs. Even in a mirror configuration, read performance is doubled like you would expect from a RAID1.

Keep in mind that NVMe performance with ZFS is not optimal. This mainly is caused by NVMe being so fast and ZFS relying on the ARC so much. There is an update in the works to bypass the ARC for NVMe pools, but at present you will have to deal with somewhat degraded performance compared to e.g. ext4 or XFS. It’s still blazing fast NVMe speed, but it won’t win benchmarks.

edit:

If you write 100GB to your pool, each drive will only have to deal with ~50GB worth of data. So effectively, the DRAM cache on the drives is double the size. Each drive only needs to cache half the data so to say.

3 Likes

woha , such a comprehensive and thorough answer!
all i can say is
THANK YOU.
:slight_smile:

note i’m only starting reading up on zfs

I’m guessing atm the arc is common to all pools, since all arc settings zfs allow are agnostic to pools. a pity: being able to specify “don’t cache for this pool” would do the trick.
I could just have those two drive be ext4 and thereby exclude them from arc. but then i’d have to raid them, meaning having two volume managers , etc etc.

thanks again, have a nice day :slight_smile:

There is an option to exclude pool or single datasets from primarycache. But the ARC is much more than just a cache…it’s basically the CPU, Heart and Nexus of ZFS where everything comes together.
With RAID0, you lose many ZFS features anyway because there is no redundancy for self-healing, but you lose convenient things like snapshots, compression or encryption.

I run 2x NVMe in my laptop and I get ~95% the value from the manufacturer in sequential reads and writes. IOPS on the other hand is noticeable lower than usual numbers.

1 Like

ARC is much more then cache

indeed

I run 2x NVMe […] numbers

well that’s nice, I have some kind of numbers to go by :slight_smile:

thank you , have a nice day

1 Like

This is a good answer, do you have any recommendation where to read up on the particular part I commented?

1 Like

today I learned.

I also mis-remembered ZFS writing to the quickest vdev which responds, which I’m sure I heard Allan Jude mention on a podcast.

But, JRS begs to differ…

https://jrs-s.net/2018/04/11/zfs-allocates-writes-according-to-free-space-per-vdev-not-latency-per-vdev/

1 Like

Thanks for the link! Matt Ahrens said similar things in his lectures. Probably just a misunderstanding or generalizing on their part. I personally have homogeneous vdevs and don’t face any problems that may arise, like missing balance feature, but there are drawbacks for sure. I see roughly equal utilization of space allocation on all of my drives. Adding vdevs later on certainly causes imbalance. I’d probably do the “usual” re-create pool from backup thing to get balance again in this case.

I think the confusion with quickest vs. more free space on vdev comes down to the fact that HDDs get slower in performance the more you fill them. A fresh empty vdev of the same drive model is just faster on average.

1 Like

I’m pretty sure the behaviour Salter showed in his demos, was to balance new additions to a pool, while not Just writing to newly added devices.

SSD’s for data drives didn’t really factor into the beginnings of ZFS, but they did have 15k SAS, and 5.4k sata, so they knew pools might be imbalanced in speed, but I guess they chose another way, If JRS is accurate and still applicable.

More tuning could have happened since SSD’s became much more common, and now so much faster, with humongous queues / IOPS, the tuning of which is even mentioned in the Jude/Lucas books, from back in the day, so the behaviour could have changed.

And, I am sure I heard about it preferring to write to the vdevs that were more performant, but I lack any links

1 Like

Which also causes imbalance. At some point your NVMe vdev is at 100% and you’re left with a 0% HDD vdev and you’re gambling on whether your data gets pulled from the NVMe or HDD, making pool performance random and unreliable at best.
That’s an extreme example but shows why prefering faster drives over averaging out writes might not be a good thing in the long term.

I agree on ZFS being somewhat behind the curve when it comes to the “flash revolution”. DirectIO looks like an interesting step, but no one knows when this gets released. Probably before RAIDZ-expansion feature :slight_smile:

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.