Optane Safe for ZFS?

1ms_hand2eye · May 19, 2020, 3:49pm

Is Optane safe or does it reduce the safety of ZFS? I read this comment and have not been able to verify it yet:

I often hear Optanes can increase performance of a ZFS pool (even an SSD-based ZFS pool) by using it as a cache.

nx2l · May 19, 2020, 4:58pm

define what your use the word “safe” mean in the zfs context.

Log · May 19, 2020, 6:47pm

If you use it for l2arc, then a single disk is perfectly fine, as it’s basically a read cache and can be thrown out without any consequence.

An SLOG should be mirrored. Because if the device is having problems and power goes out, then that small bit of data that everything assumed was safely written is in fact lost.

There are no special issues with optane drives. In fact the low latency makes it ideal for an SLOG. A 16gb drive can be had for around $10 on eBay. SLOG’s only have to be big enough to hold 5 seconds (Unless you adjust that parameter) of sync writes, so for many people this is plenty. You can add additional pairs of mirrors for more space.

You can use larger optane drives as data disk if you hate money.

I have a special vdev of 2 triple mirror of Those cheap 16gb optanes to hold my metadata. Makes my mostly read only use case so much nicer to search through. If a special vdev is lost, the the pool is lost, so plan redundancy accordingly.

Trooper_ish · May 19, 2020, 6:53pm

( 2 triple mirror? Or a triple mirror? Like, 6 or 3 times 16GB m.2 sticks?)

Log · May 19, 2020, 6:56pm

I really need to figure out a way to properly convey ZFS disk pool structure.

Basically 2 vdevs of 3 drives mirroring each other. 6 drives total. So triple redundancy (because it’s my entire pools metadata). 2 pairs so I have 32gb of space (less if you convert it to GiB)

Trooper_ish · May 19, 2020, 7:08pm

Crikey, that must have some read IOPs! Did you use an X4 add in card , and 2 slots on the motherboard, or does your mobo have six m.2’s?

Trooper_ish · May 19, 2020, 7:13pm

Also, I don’t know much about this new Special Vdevs, but if it’s mostly the metadata, rather than an L2Arc style cache, is it best for like ~~4krand~~ low thread, low queue depth reads and such?

Log · May 19, 2020, 7:13pm

Exactly, using a 4x nvme card. The motherboards I’m using have plenty lanes and slots to spare.

Those two remaining slots are for an slog, but I’m in the middle of properly laying out my disks and configurating my datasets this time. No more “everything in a single dataset left at a default record size of 128K nonsense.

It’s just a pleasure to deal with now.

Log · May 19, 2020, 7:19pm

I don’t really have performance numbers, and honestly they aren’t going to be that big of a deal as long as you are getting the metadata off the slow as hell hard disks. I initially tested this on a pair of 250gb Samsung SSD’s (because I had no idea how space metadata would take up.), which I’m now going to use as a VM pool. Even on just 2 SSD’s it was a massive improvement.

Rather than performance, I went with those obsolete (these are B+M key drives that only use 2pcie lanes and have “ok” read and “meh” write performance) little optane drives because I could get a lot of them for very little cost, and pack them in using very little space.

I did a brief write up on my setup here: Home NAS latency (SOLVED)

Trooper_ish · May 19, 2020, 7:28pm

I’m still holding out for persistent L2Arc so I can combine the pools of flash and rust, but for the moment, I just have fast and wide. These special vdevs sound like a good stop gap

Log · May 19, 2020, 7:44pm

One thing to note is that with l2arc, if you modify the metadata, you still need to write that back to the disks, which is where the special vdev has the upper hand, splitting the worst part of the workload away from the rust so it can focus on the important stuff.

If the metadata is barely changed, then if you have enough ram to keep it in ARC that’s going to be ideal for reading. You can also adjust how much of the ARC is reserved just for keeping hold of any metadata (by default it’s 25% I think).

Record size is going to make a big difference on how much metadata you have. Bigger record sizes allow for less splitting of a file into blocks, which means less metadata. If your doing 4K database writes, then no idea how much that’ll need, but it’ll be a LOT more for the same size of data vs a workload that record size 4M plays nice with.

1ms_hand2eye · May 19, 2020, 7:50pm

@Log
I left out that I’ll have native ZFS encryption enabled.
I have two different pools for different purposes:

workstation / performance:
– 2 M.2s mirrored
– Maybe 1 Optane, but maybe more RAM is better… I’d prefer if ZFS used the memory as its way faster than optane still?
storage / archival:
– 6-8 6TB drives in raidz2

Log · May 19, 2020, 8:07pm

First question would be, if you want the mirrored m.2 drives to be an SLOG, do you have any sync writes? If there are no sync writes then an SLOG cannot help at all. This can be effected by both the applications writing the data, and filesharing intermediaries like SMB, NFS (NFS defaults to sync=always?) or even ISCSI. Generally, virtual machines and databases make sync writes and thus benefit from an SLOG, as well as being in a dataset with higher IOPs (mirrors and SSD’s) and small tuned record sizes that match the writes.

For L2ARC, the question would be “does my workload fit within ARC? If not, would it fit within my L2ARC?”. If you read The same bit of data over and over, ARC and L2ARC are helpful. If you cycle through a lot of data, then they work do much. The general recommendation is to use your money to max ram first.

Another benefit of optane I forgot to bring up, is write endurance that’s even better than SLC flash. If you try to use cheapo QLC drives, not only will performance suck, but it’ll wear out from the constant writes.

kdb424 · May 19, 2020, 9:46pm

Just thought I’d jump in and ask if anyone here is sure that the 16GB optane modules are direct writes to nand and not cached as they don’t have power loss protection. A SLOG that loses it’s cache is not safe in case of a lock or power outage if it’s not direct to nand or has power loss protection. I couldn’t find a clear answer anywhere.

Log · May 19, 2020, 11:06pm

Good question, I looked around and I’m not sure either. I think it’s direct writes, as I can find mention of larger devices having the option to use a built in cache or not, suggesting these small ones lack it.

But this actually lead me to re-evaluate using these as a SLOG for a different reason, I took another look at the write speed, and realized that many of the people recommending them were on 1g networks back then (before they realized that sfp+ parts could be had for cheap) and generally were using the 32gb drive, which is twice as fast as the 16gb. And while optane has good endurance, a 16gb drive isn’t going to beat a SLC or MLC drive that’s likely 250gb at minimum if the controller is doing it’s job.

This review also detracts from these as a SLOG, basically not much of an improvement (or even plenty worse off) from recent nvme drives. https://www.storagereview.com/review/intel-optane-memory-review

And goddamn, if you want any faster optane device, you are basically looking at over $1000 for a pair of devices. And a 32gb device cost what I paid to get 8x 16gb devices.

While these 16gb drives are still a perfect fit for my primary use case of special vdevs (lots of device redundancy for bottom barrel price in a setup that will see mostly reads), I’m now hesitant to suggest them for anything else without testing.

kdb424 · May 20, 2020, 12:26am

I can say that I have added a 16GB as a slog in my hot data pool. I keep really good backups, so I take some risks, but I can say that I am on a 10G network and have noticed a perf boost by adding it. The 32GB are definitely a better bet if you can afford them though.

system · February 17, 2021, 6:27pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.