Suggestions for planning out VDEVs and ZPools

Hello_User · April 9, 2023, 7:55pm

Hello,
I’m trying to migrate from tiered Windows Storage Spaces volume where I have 2 HDD’s (3TB and 4TB) in a slow tier and a 512GB SSD in a fast tier.
Windows is, and TrueNAS will be, virtualized with disks connected to a passed-through HBA. Performance is fairly important as I have VM’s that store data (not boot or os data) on iSCSI disks. The amount of RAM available to assign to the TrueNAS VM is about 40GB. Storage Spaces deduplication saves me about 1.5TB. I’m wondering if there is a way to calculate how much savings compression would net me considering I don’t have enough memory for zfs dedupe. I’d rather have more storage than disk redundancy so I’m assuming I can place each of the HDD’s in their own VDEV and place those VDEV’s in a pool.

I’m having trouble figuring out where the SSD should fit in. Ideally I would add it to a “tiered” vdev so I have 7.5TB of raw writable storage but I don’t think that’s possible in zfs. I probably also shouldn’t make it an L2ARC because of the aforementioned RAM problem and because it feels like a waste of SSD storage. SLOG also feels like a waste of storage. I’m planning to add a 32GB virtual disk backed by a gen 4 nvme drive for that purpose.

Exard3k · April 9, 2023, 11:29pm

Which is negligible. With today’s prices per TB, I personally wouldn’t bother adding complexity, sacrificing performance or buying new SSD just to save 20 Bucks. Aside from using memory worth a multitude of that to make dedup happen at all. This is why people don’t use dedup → opportunity cost higher than gains.

If you want to do it “for Science”, then consider using a special vdev to store your dedup table and tune zfs parameters so dedup table doesn’t dominate your ARC.

There is no compression on storage spaces? Compression ratio ranges from 1.0 to 5.0 depending on what kind of data you got. If its all highly compressed media files or zip folders, transparent compression is barely worth the cpu cycles.
Database, text files, OS files…those are prime real estate for compression and usually go >2.0

I recommend using zstd. Level 3 is default, but over the last 2 years I realized that my Zen3 CPU can compress zstd-5 and zstd-7 by the Gigabytes per second and I moved most archive-ish datasets to higher compression.
Even with a shitty CPU, lz4 compression is a no-brainer. I can’t think of a machine outside of dual core laptops that would want to disable this for power saving.

All of us like more storage and less space dedicated to redundancy. But if you want fault tolerance and integrity, there is no other option. And ZFS can’t repair stuff if damaged. So whats the point of migration then?

You may get a warning from TrueNAS or whatever software you’re using, but that will work. Data gets striped across all top level vdevs, making it a de-facto RAID0 config. With the disadvantages mentioned above.

No. Because there is metadata and filesystems (very nature of CoW) fragment over time and exponentially so when they approach capacity limits. Recommendation is to not let the pool run >80% of capacity because performance goes downhill very fast. And I’m not talking about write amplification and wasted space yet which will be thing with fixed blocksize ZVOLs.

7.5TB is a naive way of thinking about storage.

Keyword here is cache. HDDs won’t deliver performance no matter how much you have in whatever config. ARC will make performance happen along with L2ARC for reads and SLOG for sync writes. More ARC is more better.
I’m running with 30GB dirty_data_max and sync=disabled and writes are just wire speed (10GBe in my case) for all intents and purposes. Memory is life

Memory impact for 500G of L2ARC is trivial for an ARC size of 40GB.

If you need Sync, SLOG is the most efficient use of storage ever. And performance is fairly important to you. Making Sync writes basically async is like 10000% faster writes. Otherwise set sync=disabled to let your memory be the SLOG.

To sum things up and my recommendation:

Don’t use dedup.

Create a pool with the two drives each being a stripe vdev (single). Partition the SSD (better: use namespaces) with 16GB going as a SLOG and the rest as L2ARC and enable persistent L2ARC.
Works.

MikeGrok · April 10, 2023, 12:58am

From what I remember arc to l2arc is between 1:20 and 1:50, but in certain instances can be as high as 1:700.

Ie worst case scenario 10gb of arc to manage 200gb of L2arc.

And with a persistent l2arc, you don’t need to hit spinning disk for frequent reads after a reboot.

I am planning on making my nas my VM host, and am going to partition an SSD(optane) to host the virtual memory partitions of my VMs.

Exard3k · April 10, 2023, 1:15am

It’s these sentences that stick and everyone assumes this is the case. Not helpful because

That’s the typical scaremongering which is plain wrong.

The typical “ratio” if you want to call it this way is 1:2000 at an average of 128k recordsize. This is what most people use. You can check it via arcstat or even htop.

I just checked my pool and arc headers are at 760MB (which also includes some non-L2 stuff) and I’m running with 2TB of L2ARC. That pool has like 30 ZVOLs from 4k to 128k to >1M record media stuff. Basically the spread you find on most homeservers. Extending memory by 2TB at a cost of 600MB. That’s cake.

“worst case” just implies it can happen to anyone. Which it can’t. This number is for a pool consisting entirely of 4k records and blocks. And the people running pools like that will be happy to get as much of this tiny troublesome stuff off HDD access no matter the cost. This is for people running 30TB databases or selling 100TB LUNs to clients. Being able to cache this stuff is worth any cost in memory.

Hello_User · April 10, 2023, 2:49pm

Thank you for the very detailed response.

Firstly, I should have specified in my OP that the 512GB SSD is a sata-based Samsung 850 Pro consumer SSD, with a pretty low TBW rating of only 300TBW.

TrueNAS is currently assigned only 4 threads of an r5 3600 (Zen2) which is shared across 19 fairly light VM’s (according to esxi’s weird GHz metric)

qc2iNsW

I went ahead and tested the VM using Phoronix Test Suite and I’m seeing about a gigabyte per second compressions at zstd:3 and only about 220MB/s compressions at zstd:8.

Decompressions stayed consistent at about 1.3-1.4GB/s for both.

zstd3 zstd8

Knowing this, and considering I don’t expect writes to ever* exceed 250MB/s, I should probably stick to zstd:5 or zstd:6 just to be safe.

*Forever is a very long time and my needs might change.
I know you can change compression ratio and even algorithm at will but do you happen to know if that’s only for new writes or if zfs converts previously compressed files to the new setting?

I’ve been having lots of issues with storage spaces. Most recently with NFS permissions for Linux clients. I also want to migrate to TrueNAS to build up experience to eventually do it the “proper” way (even though I don’t currently have the hardware to do so).

With that said, your recommendation of using the SSD for multiple purposes seems like the best way to go, without buying new hardware. Persistent L2ARC is something I completely overlooked but should absolutely be enabled for my use-case. In my, admittedly short, time researching namespaces, I wasn’t able to find much documentation about it outside of some Oracle articles and some articles referencing namespaces for NVME SSD’s specifically. Are these “hardware” namespaces what you’re referring to? If so, my mistake for not specifying this was a sata SSD earlier.

I originally intended to add a virtual disk that resides on my gen 4 NVME ESXi datastore (WD sn850x 2TB) to the VM to use as SLOG and commit the 850 pro completely to L2ARC. I’m not sure that’s a good idea though because that drive also has a very low TBW rating at 1200TBW and it holds far more important data than the 850 Pro will in your recommended configuration.

This is a really good point. Considering I’m migrating all my storage anyway, it might be worth it to buy 2x8TB disks and build a single mirrored vdev instead of 2 single-disks striped ones. Or even 3x8TB in a raidz1 (I don’t foresee myself needing more than 8TB of storage for quite a long time though and 16TB feels like overkill).

Exard3k · April 11, 2023, 2:21am

It’s basically hardware partitioning. Good SSDs have this feature and each namespace shows up as seperate drive, making it a great feature if you want to use parts of that SSD for different things, e.g. multiple VMs.

If you feel like the whole compression stuff takes up too much CPU, just use the default, which is lz4. Worse compression, but insane speed. In the end…having a compression ratio of like 1.5 also means your drives have to read 50% less and your ARC can store 50% more data (free RAM!). Compression is just great.

It’s great. Rebooting the server and still having all the frequently used data in a warm cache ready at boot is amazing. Also reduces constant re-writes on the drives. My L2ARC NVMe (cheap PCIe 3.0 Mushkin drives) will never ever hit their TBW value.

L2ARC never stores critical data as the L2ARC is only a copy of what is already stored on HDDs, so cache vdev dying may me tragic, the pool doesn’t care at all. Keep in mind that SATA SSD as cache have that 540MB/s limit on bandwidth, so the nice cache hits won’t be just as nice as using NVMe on sequential data.

And yeah, with 300TBW, don’t use SLOG. That’s a recipe for disaster. I have cheap-ass 19€ 240GB drives and they are similar TBW and I don’t trust them one bit to survive. Got 4 of them as boot drives, all mirrored.

If you use a mirror now, you can just add new mirrors on the fly later on. Or replace old disks with new ones if you want to stay with two drives.

If you are running with two vdevs now and want to switch to a mirror or RAIDZ, you have to recreate the pool.

That’s why I’m running with mirrors. I can just get two new HDDs and ZFS just adds 16TB capacity to the existing pool. Done that last autumn. Or remove a mirror from the pool if one drive fails and I can’t be bothered to buy a replacement. Doesn’t apply to me because I run 16TB drives, but with old 3TB drives, I can see the argument of not buying a replacement and instead just move the data to other mirror vdevs and remove the old mirror. It’s called evacuation and device removal in ZFS, it’s kinda neat if you are in a situation where you want to remove a mirror.

The point of ZFS is to consolidate all your storage into one big pool and have easy and automated administration for everything storage and don’t worry about anything anymore. I’m running everything off my pool. And although everything is ultimately stored on HDDs, caching makes VMs so fast so I don’t bother with seperate storage for VMs anymore. It’s just double the work.

Hello_User · April 11, 2023, 3:34pm

Thank you again for responding.

This was incorrect. I need to move data off of a 13 year old 2TB HDD and I’m currently using 5.5TB on storage spaces. This means that 8TB of usable space (2x8 in a mirrored vdev) will indeed be too small.

As I can’t really afford 4x8TB at the moment, I’ve decided to go with 3x8TB in a raidz1 with the SSD committed to L2ARC. I might add some small, cheap, mirrored SSD’s to use as SLOG in the future. As I understand it, losing the SLOG vdev isn’t catastrophic for a zpool and I’m willing to sacrifice some SSD’s to the write performance Gods (if they’re cheap enough of course).

I realize that when going with a raidz1 I lose the flexibility of removing vdevs from the pool in the future. However, this seems like the best option given I’m short the cash to buy a 4th drive and I ended up needing more than a single drive worth of storage.

Unfortunately, the only way for me to add NVME storage to the VM (that isn’t a virtual disk on my datastore) is by using the pcie 2.0 1x (~250MB/s) ports on my board. So I might as well stick to SATA.

Without the SLOG, I expect decent write performance but pretty good read performance because of the fairly large amount of L2ARC (for the amount of data I plan on storing).

I plan to just stripe the 3 and 4TB drives in a separate pool with almost no caching (primarycache=metadata and secondarycache=none). This pool would then only be used for unimportant or infrequently accessed data.

I’ve also read this article on the “special” vdev, but considering the quality of SSD’s in my price range, I think it’s best for me to stay far away from this functionality.

Overall, I think I’ll be happy with the all tradeoffs made but only time will tell. Raidz1 might end up biting me in the ass in the future. Thanks again for the help!