This is just some follow-up data. I was working on this project with 45 Drive so I used the spare hard drives to see how much difference w/ and w/o metadata special device makes just scanning and accessing files
Raw Crystal Disk Mark #s for These 118gb Optane Devices:
zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
errors: No known data errors
pool: wendell2
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
wendell2 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdf ONLINE 0 0 0
sdh ONLINE 0 0 0
time to scan wendell2 zpool: 2m36s
(only spinning rust)
time to scan tank zpool: 36s
(with optane-based metadata)
That’s a 2 minute savings just scanning the system.
in each zpool was just over 177k files, so the pools were identical. The only difference is where the metadata was stored (spinning rust or the striped mirror optane config).
118gb*2 isn’t a lot of space for metadata but plenty big enough for most home server/homelab use cases up to around 150tb I’d say.
Hmmm.
I just watched Wendell’s new video on this topic in regards to cheap P1600X and 905ps.
Let’s do some maths. Following Wendell’s script I got the following output. Workload for this pool is primarily Plex and bulk storage of data. It is comprised of 3 5-disk RaidZ1s of 10 TB disks, and is like half full.
I created a little Excel cheatsheet do help me figure out what I would theoretically need. Is my math wrong, or would I really only need 35ish gigs for a special VDEV on this pool?
What’s making me second guess this, is that only 19% of my files are using 99+% of my space. lol
Thanks for sharing your data. Facts are always welcome.
As I am looking at the data I am wondering what to make of it.
I mean - it’s no surprise that 4 Optane devices are faster than 3 spinning hard drives. Metadata, data, or anything else. You had already established that special devices work in principle in your previous post.
As I am looking at this data I cannot help but wonder what the impact of metadata caching on such a configuration would be. ARC caching is sophisticated and never quite works as I expect it to.
E.g. I would expect that the metadata for the 177k would potentially fit very nicely into ARC (depending on RAM config).
Do you know anecdotally - or would it be hard to test - a cold scan followed by a warm scan of both configs?
I would hope and expect the warm scan for the hd-only config to speed up quite nicely over the cold config. I would expect the same test for the other config to yield quite similar results.
Asking because I’m in hw transition land and won’t be able to try this out for a while.
I wonder if Primocache is still a good performance benefit for C++ compile and gaming if my boot drive is a SK Hynix P41 Platinum? Would be using this 64GB optane module: Optane Memory M10 64Gb M.2 80 - Newegg.com
I’m in the planning stages for a 64TB or 72TB TrueNAS server I’ll be building in the next 6 months. I was under the impression from the other thread that I would need closer to 3% of the pool size in storage for the special vdev (I believe you had stated 5TB for a 172TB pool). Was this because of the special_small_blocks setting?
I was planning on setting up a three-way mirror with 3x 960GB Samsung PM963 for about $60 each, but maybe I should get 3x 118GB P1600X @ $76 each? It would reduce the size of my special vdev substantially though. I could also spend double that and get 6x P1600X for a 236GB special vdev. I’m struggling to pick between these options.
I would love some more insight into how much space the metadata would require. I anticipate about 3/4 of the usable space will be large media files, and the rest will be documents/photos/music (Nextcloud).
Is it better to use a smaller optane metadata-only special vdev or a larger NAND metadata+small files special vdev?
Just watched the corresponding video and read the posts here.
Metadata Special Device… but RAM instead of Optane?
I’m considering adding a metadata special device to my TrueNAS system, but curious if the same effect (or better) could just be accomplished with RAM?
Here is my thought process:
236GB usable space for 150TB. Let’s scale this down a little since I don’t have a Level1 size data center at home. I think fair to assume ~24GB for a 15TB server.
This got me thinking… 24GB? Why bother with Optane, just use MORE RAM! RAM would be both faster and have a higher write endurance. (I think. Please correct me if I’m wrong.)
Yes, this means the metadata would not survive reboot. Instead of moving the metadata from the rust to the Special Device, can ZFS use a portion of the memory to auto-cache all metadata? By “auto-cache”, I mean automatically cache into RAM 100% of the metadata after booting. The remaining RAM could be used as the typical L1ARC.
I think I am basically in “thats a nice idea” land and that this is not an existing feature in ZFS. I thought others may be interested regardless. And maybe this is possible in ZFS and someone more knowledgeable than I could share some search terms to help me find more on this topic.
Anyway, thanks to Wendell for another deep dive into ZFS and cool tech!
If your metadata was all stored on non-persistent storage, your pool would not survive a reboot. Now, the ARC/L2ARC can/does store metadata, and you can even make the ARC/L2ARC store metadata only.
zfs set primarycache=all pool1
zfs set primarycache=metadata pool2
zfs set secondarycache=all pool1
zfs setsecondarycache=metadata pool2
What the SPECIAL Device does is store it persistently in your pool inside of a VDEV. Your pool is only as fragile as it’s weakest link, and if your Special VDEV fails, your entire pool fails. But it gives you the added benefit of having your metadata always accelerated.
Actually, coupled with the SPECIAL VDEV, I wish there was a ZFS property to tell the ARC/L2ARC never to cache metadata at all, and instead focus on storing data. In the current version of this type of deployment, if you’re using a ZFS Special VDEV, your files can be in both the Special VDEV and RAM/L2ARC at the same time, which is using space that can theoretically be populated by data.
I agree. My suggestion is not to store all metadata in RAM, rather to cache all metadata in RAM. and to load all metadata into the cache when booting up. I agree that this is a critical distinction.
I also agree here. While this is good, the downside here is that the user must wait for the ARC/L2ARC to “build up” by accessing the same file several times. The ARC/L2ARC provides no benefit the first time a file is accessed. The benefit builds as a given block is accessed multiple times.
I guess for the sake of clarity, the best way to describe my suggestion is an expansion of the feature set in ARC/L2ARC to offer the benefits of a metadata special device by using the ARC/L2ARC to cache metadata instead of a special vdev in the pool.
Great timing, Wendell, I’ve got 4 of the 58GB drives arriving today. I’d intended for them to be a SLOG mirror for each of my pools and wait for the higher capacity U.2 Optanes to drop to upgrade my SSD special vdev. In the video though you mentioned using the higher capacity (hundreds of GBs) for SLOG and these little sticks for metadata, saying even 118GB was too small? Is that for running VM block storage on? I’ve only once seen the 16GB SLOG on my large/archive pool be greater than 1GB.
Are these Optane drives sensitive to filling up and slowing down like normal SSDs? Similarly, does ZFS care about free space and fragmentation like a typical vdev, ie “>80% full is bad”?
Are these Optane drives sensitive to filling up and slowing down like normal SSDs?
Optane is not prone to the same behavior as there is no garbage collector running, media can be written in place so no need to do the whole Flash management gymnastics. Optane SSD performance does not change under higher load, mixing the reads and writes, as well as not depend on filling the capacity.
this was an editing and/or lack of coffee snafu on my part. I meant that I’d probably use something way larger for l2arc for “more benefit” and that you really only need 30 seconds * your write speed into the pool SLOG at most
I am comfortable with 30 seconds of in-flight data in my pool. but that is not right for production workloads. most home stuff, yep, is fine.
I have also, on devices that support it, split up nvme with namespaces and had them do double duty. I haven’t tried namespace reprogramming on these though.
So I learned a thing after ordering a bunch of these, now y’all can learn from my mistakes:
tl;dr you can’t shrink a metadata special device
My pool is 4x 8-wide RAIDZ2 with a 3-way mirror of 400GB Intel DC S3700 SSDs for metadata/small blocks running on TN SCALE. My special vdev has ~124GB on it, so the 118GB Optanes wouldn’t be enough. I ordered 6 of them thinking that I could add another 3-disk special vdev to add capacity, then replace the 3 SSDs with the Optanes to make a ~220GB special vdev and pick up some IOPS at the same time.
I’ve shrunk a pool of mirrors before no problem, but just in case I spun up a VM to test and I got an error that the new device was too small!? On a brand new pool with no data on it! Specifically, the error is [EZFS_BADDEV] device is too small. If I try to add a second special vdev and remove the original, I get [EZFS_INVALCONFIG] invalid config; all top-level vdevs must have the same sector size and not be raidz.. I get the same error about being too small trying to replace one of the disks from the command line.
Sadly that puts me in a sticky situation. This pool is only ~2% critical data that’s backed up elsewhere, and while I could recreate most of the rest I’d rather not if I can avoid it. I don’t have anywhere near enough free capacity to offload the data and rebuild the pool. Getting 3x 980GB 905Ps is likely the only viable option right now, but the cost and lack of official power loss protection sways me towards doing nothing. Additionally going to 980GB would cause the same issues if/when the 5800X ever go on fire sale I’d have to get even larger!
Hopefully this saves someone the same headache. I’m going to cancel the backorder unless anyone knows of a way to make it work. Happy hoarding!