ZFS: Metadata Special Device - Real World Perf Demo

Reference ZFS Metadata Special Device: Z

This is just some follow-up data. I was working on this project with 45 Drive so I used the spare hard drives to see how much difference w/ and w/o metadata special device makes just scanning and accessing files

Raw Crystal Disk Mark #s for These 118gb Optane Devices:

Sequential Read: 1699MB/s IOPS=1
Sequential Write: 612MB/s IOPS=0

512KB Read: 1579MB/s IOPS=3159
512KB Write: 583MB/s IOPS=1166

Sequential Q32T1 Read: 1711MB/s IOPS=53
Sequential Q32T1 Write: 618MB/s IOPS=19

4KB Read: 352MB/s IOPS=90189
4KB Write: 252MB/s IOPS=64605

4KB Q32T1 Read: 1678MB/s IOPS=429603
4KB Q32T1 Write: 641MB/s IOPS=164312

4KB Q8T8 Read: 2047MB/s IOPS=524246
4KB Q8T8 Write: 718MB/s IOPS=183897


Our two minimal zfs pool configs:

 zpool status
  pool: tank
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            sdc      ONLINE       0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
        special
          mirror-1   ONLINE       0     0     0
            nvme0n1  ONLINE       0     0     0
            nvme1n1  ONLINE       0     0     0
          mirror-2   ONLINE       0     0     0
            nvme2n1  ONLINE       0     0     0
            nvme3n1  ONLINE       0     0     0

errors: No known data errors

  pool: wendell2
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        wendell2    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdh     ONLINE       0     0     0

time to scan wendell2 zpool: 2m36s

(only spinning rust)

time to scan tank zpool: 36s

(with optane-based metadata)

That’s a 2 minute savings just scanning the system.

in each zpool was just over 177k files, so the pools were identical. The only difference is where the metadata was stored (spinning rust or the striped mirror optane config).

118gb*2 isn’t a lot of space for metadata but plenty big enough for most home server/homelab use cases up to around 150tb I’d say.

12 Likes

Hmmm.
I just watched Wendell’s new video on this topic in regards to cheap P1600X and 905ps.

Let’s do some maths. Following Wendell’s script I got the following output. Workload for this pool is primarily Plex and bulk storage of data. It is comprised of 3 5-disk RaidZ1s of 10 TB disks, and is like half full.

I created a little Excel cheatsheet do help me figure out what I would theoretically need. Is my math wrong, or would I really only need 35ish gigs for a special VDEV on this pool?

What’s making me second guess this, is that only 19% of my files are using 99+% of my space. lol

This is the Excel sheet if anyone wants it for their own use:
Special Metadata.xlsx

Block Size Number of Files Size in Bytes Size in Gigabytes Size in Terabytes
1024 54205 55505920 0.05550592 5.55059E-05
2048 20376 41730048 0.041730048 4.173E-05
4096 25528 104562688 0.104562688 0.000104563
8192 31334 256688128 0.256688128 0.000256688
16384 35429 580468736 0.580468736 0.000580469
32768 41370 1355612160 1.35561216 0.001355612
65536 110681 7253590016 7.253590016 0.00725359
131072 31879 4178444288 4.178444288 0.004178444
262144 32462 8509718528 8.509718528 0.008509719
524288 8846 4637851648 4.637851648 0.004637852
1048576 7566 7933526016 7.933526016 0.007933526
2097152 25330 53120860160 53.12086016 0.05312086
4194304 3058 12826181632 12.82618163 0.012826182
8388608 6254 52462354432 52.46235443 0.052462354
16777216 6166 1.03448E+11 103.4483139 0.103448314
33554432 11238 3.77085E+11 377.0847068 0.377084707
67108864 5774 3.87487E+11 387.4865807 0.387486581
134217728 8245 1.10663E+12 1106.625167 1.106625167
268435456 7312 1.9628E+12 1962.800054 1.962800054
536870912 5580 2.99574E+12 2995.739689 2.995739689
1073741824 6913 7.42278E+12 7422.777229 7.422777229
2147483648 6454 1.38599E+13 13859.85946 13.85985946
4294967296 2637 1.13258E+13 11325.82876 11.32582876
8589934592 160 1.37439E+12 1374.389535 1.374389535
17179869184 75 1.28849E+12 1288.490189 1.288490189
34359738368 9 3.09238E+11 309.2376453 0.309237645
Total Pool Size GB/TB 42667.08443 42.66708443
Small Block File Size Special VDEV Size in GB Special VDEV Size in TB
1024 0.05550592 5.55059E-05 Total Size for Metadata 1K and Below
2048 0.097235968 9.7236E-05 Total Size for Metadata 2K and Below
4096 0.201798656 0.000201799 Total Size for Metadata 4K and Below
8192 0.458486784 0.000458487 Total Size for Metadata 8K and Below
16384 1.03895552 0.001038956 Total Size for Metadata 16K and Below
32768 2.39456768 0.002394568 Total Size for Metadata 32K and Below
65536 9.648157696 0.009648158 Total Size for Metadata 64K and Below
131072 13.82660198 0.013826602 Total Size for Metadata 128K and Below
262144 22.33632051 0.022336321 Total Size for Metadata 256K and Below
524288 26.97417216 0.026974172 Total Size for Metadata 512K and Below
1048576 34.90769818 0.034907698 Total Size for Metadata 1M and Below
Percentage of Files Stored In Special Devices
1024 11%
2048 4%
4096 5%
8192 6%
16384 7%
32768 8%
65536 22%
131072 6%
262144 7%
524288 2%
1048576 2%
Total Percentage of Files Stored in Special Devices 1M and Below 81%
2 Likes

Thanks for sharing your data. Facts are always welcome.

As I am looking at the data I am wondering what to make of it.

I mean - it’s no surprise that 4 Optane devices are faster than 3 spinning hard drives. Metadata, data, or anything else. You had already established that special devices work in principle in your previous post.

As I am looking at this data I cannot help but wonder what the impact of metadata caching on such a configuration would be. ARC caching is sophisticated and never quite works as I expect it to.
E.g. I would expect that the metadata for the 177k would potentially fit very nicely into ARC (depending on RAM config).
Do you know anecdotally - or would it be hard to test - a cold scan followed by a warm scan of both configs?

I would hope and expect the warm scan for the hd-only config to speed up quite nicely over the cold config. I would expect the same test for the other config to yield quite similar results.

Asking because I’m in hw transition land and won’t be able to try this out for a while.

I wonder if Primocache is still a good performance benefit for C++ compile and gaming if my boot drive is a SK Hynix P41 Platinum? Would be using this 64GB optane module: Optane Memory M10 64Gb M.2 80 - Newegg.com

I’m in the planning stages for a 64TB or 72TB TrueNAS server I’ll be building in the next 6 months. I was under the impression from the other thread that I would need closer to 3% of the pool size in storage for the special vdev (I believe you had stated 5TB for a 172TB pool). Was this because of the special_small_blocks setting?

I was planning on setting up a three-way mirror with 3x 960GB Samsung PM963 for about $60 each, but maybe I should get 3x 118GB P1600X @ $76 each? It would reduce the size of my special vdev substantially though. I could also spend double that and get 6x P1600X for a 236GB special vdev. I’m struggling to pick between these options.

I would love some more insight into how much space the metadata would require. I anticipate about 3/4 of the usable space will be large media files, and the rest will be documents/photos/music (Nextcloud).

Is it better to use a smaller optane metadata-only special vdev or a larger NAND metadata+small files special vdev?

Just watched the corresponding video and read the posts here.

Metadata Special Device… but RAM instead of Optane?

I’m considering adding a metadata special device to my TrueNAS system, but curious if the same effect (or better) could just be accomplished with RAM?

Here is my thought process:

236GB usable space for 150TB. Let’s scale this down a little since I don’t have a Level1 size data center at home. I think fair to assume ~24GB for a 15TB server.

This got me thinking… 24GB? Why bother with Optane, just use MORE RAM! RAM would be both faster and have a higher write endurance. (I think. Please correct me if I’m wrong.)

Yes, this means the metadata would not survive reboot. Instead of moving the metadata from the rust to the Special Device, can ZFS use a portion of the memory to auto-cache all metadata? By “auto-cache”, I mean automatically cache into RAM 100% of the metadata after booting. The remaining RAM could be used as the typical L1ARC.

I think I am basically in “thats a nice idea” land and that this is not an existing feature in ZFS. I thought others may be interested regardless. And maybe this is possible in ZFS and someone more knowledgeable than I could share some search terms to help me find more on this topic.

Anyway, thanks to Wendell for another deep dive into ZFS and cool tech!

If your metadata was all stored on non-persistent storage, your pool would not survive a reboot. Now, the ARC/L2ARC can/does store metadata, and you can even make the ARC/L2ARC store metadata only.

zfs set primarycache=all pool1
zfs set primarycache=metadata pool2

zfs set secondarycache=all pool1
zfs setsecondarycache=metadata pool2

What the SPECIAL Device does is store it persistently in your pool inside of a VDEV. Your pool is only as fragile as it’s weakest link, and if your Special VDEV fails, your entire pool fails. But it gives you the added benefit of having your metadata always accelerated.

Actually, coupled with the SPECIAL VDEV, I wish there was a ZFS property to tell the ARC/L2ARC never to cache metadata at all, and instead focus on storing data. In the current version of this type of deployment, if you’re using a ZFS Special VDEV, your files can be in both the Special VDEV and RAM/L2ARC at the same time, which is using space that can theoretically be populated by data.

3 Likes

I agree. My suggestion is not to store all metadata in RAM, rather to cache all metadata in RAM. and to load all metadata into the cache when booting up. I agree that this is a critical distinction.

I also agree here. While this is good, the downside here is that the user must wait for the ARC/L2ARC to “build up” by accessing the same file several times. The ARC/L2ARC provides no benefit the first time a file is accessed. The benefit builds as a given block is accessed multiple times.

I guess for the sake of clarity, the best way to describe my suggestion is an expansion of the feature set in ARC/L2ARC to offer the benefits of a metadata special device by using the ARC/L2ARC to cache metadata instead of a special vdev in the pool.

Damn you Wendell and im $350 poorer!

whats worse is my plan was to transition away from ZFS back to CEPH soon!

3 Likes

Does any one of you know if there are any similar sales for people in Europe? Newegg doesn’t sell to Europe.

1 Like

Great timing, Wendell, I’ve got 4 of the 58GB drives arriving today. I’d intended for them to be a SLOG mirror for each of my pools and wait for the higher capacity U.2 Optanes to drop to upgrade my SSD special vdev. In the video though you mentioned using the higher capacity (hundreds of GBs) for SLOG and these little sticks for metadata, saying even 118GB was too small? Is that for running VM block storage on? I’ve only once seen the 16GB SLOG on my large/archive pool be greater than 1GB.

Are these Optane drives sensitive to filling up and slowing down like normal SSDs? Similarly, does ZFS care about free space and fragmentation like a typical vdev, ie “>80% full is bad”?

Are these Optane drives sensitive to filling up and slowing down like normal SSDs?

Optane is not prone to the same behavior as there is no garbage collector running, media can be written in place so no need to do the whole Flash management gymnastics. Optane SSD performance does not change under higher load, mixing the reads and writes, as well as not depend on filling the capacity.

1 Like

I’m also using optanes under ceph for the osd’s bluestore block db and wal.

not sure I will be able to do that since I got 4 of the m.2 drives and most of my cluster is rust backed so I cant just plug in u.2 drives.

I will look into it though, you know, whenever I find the time…

this was an editing and/or lack of coffee snafu on my part. I meant that I’d probably use something way larger for l2arc for “more benefit” and that you really only need 30 seconds * your write speed into the pool SLOG at most

I am comfortable with 30 seconds of in-flight data in my pool. but that is not right for production workloads. most home stuff, yep, is fine.

I have also, on devices that support it, split up nvme with namespaces and had them do double duty. I haven’t tried namespace reprogramming on these though.

1 Like

Was thinking about getting an Optane for a while, but a 900P or 905P 480GB is like 800€ here :frowning: not exactly a sale.

@wendell Do you know if you can modify namespaces on the 905p?

If you mean have more than one, no. Even the P4800X can’t. I believe the only optane able to do that is the P5800X.

Kinda figured, nothin is free, but had to ask :slight_smile:

So I learned a thing after ordering a bunch of these, now y’all can learn from my mistakes:

tl;dr you can’t shrink a metadata special device :facepalm:

My pool is 4x 8-wide RAIDZ2 with a 3-way mirror of 400GB Intel DC S3700 SSDs for metadata/small blocks running on TN SCALE. My special vdev has ~124GB on it, so the 118GB Optanes wouldn’t be enough. I ordered 6 of them thinking that I could add another 3-disk special vdev to add capacity, then replace the 3 SSDs with the Optanes to make a ~220GB special vdev and pick up some IOPS at the same time.

I’ve shrunk a pool of mirrors before no problem, but just in case I spun up a VM to test and I got an error that the new device was too small!? On a brand new pool with no data on it! Specifically, the error is [EZFS_BADDEV] device is too small. If I try to add a second special vdev and remove the original, I get [EZFS_INVALCONFIG] invalid config; all top-level vdevs must have the same sector size and not be raidz.. I get the same error about being too small trying to replace one of the disks from the command line. :poop:

Sadly that puts me in a sticky situation. This pool is only ~2% critical data that’s backed up elsewhere, and while I could recreate most of the rest I’d rather not if I can avoid it. I don’t have anywhere near enough free capacity to offload the data and rebuild the pool. Getting 3x 980GB 905Ps is likely the only viable option right now, but the cost and lack of official power loss protection sways me towards doing nothing. Additionally going to 980GB would cause the same issues if/when the 5800X ever go on fire sale I’d have to get even larger!

Hopefully this saves someone the same headache. I’m going to cancel the backorder unless anyone knows of a way to make it work. Happy hoarding!