ZFS Metadata Special Device: Z

I finally got around to swapping out the PLX bifurcation card, which seems to have fixed the I/O errors I saw on the Optane drives.

Hello,

I currently run TrueNAS Scale (since many month with this pool design
Asrock Rack X470DU with 2700X and 4x16GB RAM ECC

4x 16TB Toshiba as striped mirror
2x Micron 7300 Pro 960GB mirrored as special vDev (metadata)
no L2ARC
no SLOG

With “sudo zpool list -v” I get the following picture.
First of all looks ok, but I observe that “only” 20GB of the 868GB (but should be ok) are used on the special metadata device AND it’s only showing 6 files at 1K???

I feel that I did something wrong with “ZFS record size” and “zfs set special_small_blocks” size…

How can I read out the currently set zfs set special_small_block size?

Any suggestions?

sudo zpool list -v

find . -type f -print0 | xargs -0 ls -l | awk ‘{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf(“%d %d\n”, 2^i, size*) }’ | sort -n | awk ‘function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf(“%3d%s: %6d\n”, a[1],substr(“kMGTEPYZ”,a[2]+1,1),$2) }’*

Output only shows: 6 for 1k???

sudo zdb -Lbbbs -U /data/zfs/zpool.cache storage

1710682395264.png

Update: I did not change the small block size, when I created the pool.
If I understand correctly, default = zero, which means, the vDev is not used…

Based on my readout above, I changed the small block size to 32k…

What would you recommend?

I’m trying to figure out the right redundancy for my “special” vdev that’s going to be part of DATA-LONG.

I have some 750GB enterprise SSDs (25 of them). I don’t believe I’ll need more than 1.5TB for metadata, so considered running 2x2 mirrors, but getting the following:

zpool add DATA-LONG -o ashift=12 special mirror /dev/sdaa /dev/sdab mirror /dev/sdac /dev/sdad
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 3 vs. 1 (2-way)

My understanding is that once metadata starts being stored on this, if anything happens to the metadata vdev, I will be losing data… Ideally I’d like to keep the same level of redundancy for the metadata vdev, but it might be overkill (given that I’m running enterprise SSDs). Any insight or advice would be greatly appreciated :slight_smile:

The reason I’m considering metadata is because I have a large amount of small files… and transferring them out of my NAS via SMB dips my transfer rates to <100KB/s on a 10GBE line between client and NAS.

1 Like

Before proceeding, make sure your full backup is complete.

Are you sure the command, is not adding two mirrors, the first mirror as special vdev, and the second as a seperate vdev to the datapool?
Command checks out. Perhaps the error is that you are left with 1 drive redundancy in special, and three in main.

Perhaps 4, or just 3 wide mirrors may make it complain less

(Even a single mirror is easy more space than probably needed, depending on small files…but no kill like overkill)

The error indicates that you’re trying to add a special device vdev to the pool that has a different redundancy level than the main pool (raidz3). The existing pool allows loosing three drives before any data loss.

That is not unexpected because you intend to add two mirrored vdevs. Each allows loosing a single drive before any data loss.

It is not recommended using a raidz vdev as special device.

However, it is thinkable to add mirrored vdevs containing more drives and by doing so adding redundancy. Please note that an attempt to add a four-way mirror (three drive redundancy) will not make the error message disappear. The message is meant to be a warning to stop and think about what you’re trying to do.

Now, here we are: what do you think is the correct redundancy level for your zpool? A mirror vdev with 4 drives would match the redundancy level of the raidz3 data vdev.

In some other thread I have argued that some (e.g. enterprise level) SSDs are less prone to failure compared to HDDs and maybe for that reason, a smaller redundancy level is ok for the special device.
However, that is something you need to decide for yourself. The decision depends on how comfortable you are with the risk of drive failures.
Maybe, you consider your type of SSDs to be more prone to drive failures than the HDDs in your data vdev and as a result you want to protect your special vdev with an even higher level of redundancy. Unlikely, but possible. zfs allows configuring this.

Personally, I would ask myself: if I don’t trust I can replace a failed drive in time before more than 3 other drives fail (three drive redundancy) - maybe I am using the wrong drive for the job?

The resilver time would be a lot quicker for a failed small SSD, than a failed larger HDD too, so less redundancy would be slightly less risky.
But does seem silly to have raidz3 tied to a 2 deep mirror

I didn’t consider there would be a lot of risk involved in adding metadata… and I really don’t have where to store 30TB to do a backup. I decided to move forward with a 4x mirror of the SSD for metadata on the entire pool to match the redundancy level the RAIDZ3 and that got me past the error.

I’m glad I got confirmation from you on this, I made the change before seeing your response and it did in fact go through without any error popups.

Now I have 21x 750GB SSDs left… initially I went down the metadata rabbit hole because of the poor small io block experience via SMB… even though I was reading from SSDs, my transfer rates were in the tens of KB/s. Would adding a SSD metadata VDEV to a SSD VDEV have any impact on performance?

I’m building a truenas server and I’ve been thinking about adding a special device to the pool.

I’ll probably start with 6 x 20TB RaidZ2 vdev. The case is 24-bay, so it is possible to add more 6-wide vdevs if needed at later. I’ve been thinking about 3 or 4 wide mirror for special vdev, is that correct way to go?

However, I don’t understand anything about Enterprice ssds and their model numbers. All other hardware is server grade, so I would also like to use server hardware in special vdevs.
Or are consumer-grade SSDs or NVMEs just as good an option? Samsung Pro series perhaps?

Which models do you recommend or use yourself? What size ssd’s should I buy?
I think cheap and used SSD’s or NVME’s from ebay is the way to go? (Should be available in the EU.)

The server will be for home use mainly for Plex and photo use. The idea would be to speed up, for example, browsing image folders with thumbnails.

Nobody can really tell you what level of redundancy to go for as it’s a choice very specific to you… If this is a home-use-only server and you have frequent offline backups (which you should), maybe you don’t even need redundancy. If you can’t use your server for a day or two while you replace a drive restore backups, does that matter much to you?

That said, it’s pretty easy to see the convenience of having at least 1-drive (N+1) redundancy. That’s probably what I’d recommend.

Consumer-grade NVMEs aren’t necessarily “just as good” but the features they lack may not matter to you. DWPD or endurance tends to be substantially higher on enterprise drives (but not always), they’re more likely to have PLP (power-loss protection), they also might be physically bigger or more expensive. Don’t get 22110-sized NVMe drives if you don’t have 22110 slots. Used enterprise stuff off ebay tends to be fairly cheap, but be aware they might be firmware-wiped and actually have far more usage than they appear to.

For your use-case, I’d probably just get a pair of the cheapest 1TB NVMes I could find and hope for the best, with the knowledge that I might have to buy replacements and deal with some downtime and maybe even restores-from-backup if things go wrong. Without a lot more info about your files, it’s hard to get any more specific.

Mirrored SSD should be fine (haven’t had a dead SSD yet).
Best SSDs for special vDev?
Optane SSDs!!!

Very low latency and high durability.
Just bought 4x Intel Optane P1600X 118GB from Amazon Germany (Shipped from Amazon US) for 80€ each for this purpose

Metadata do not need a lot of space and 0.3-0.5% should be fine… In your case 250-400GB for each drive

Sure, all that’s true. 400gb optane would be unnecessarily expensive, and the 1tb leaves room for expansion with future additional disks or with small-files.

If you have space for 22110 drives, used Samsung PM983a drives go for pretty cheap on ebay, and they’ve got very good endurance and sustained performance vs consumer stuff. Also PLP, for whatever that’s worth. Nothing is going to match optane for pure latency performance, but if you want more capacity to have more special_small_blocks on NVME, they’re a solid option. I have nothing more than my biases to go on, but I trust enterprise NVME (even used) over consumer grade stuff for this purpose.

How important is latency for special device? I have compared the specs of the 118GB P1600X intel optane and the Seagate Firecuda 530 1TB (3D TLC NAND).

Optane’s latency is probably significantly lower, but Firecuda’s write and read speeds are many times higher than Optane’s. The 118GB P1600X and Firecuda 530 1TB have very similar TBW values (1.3 PBW).

I did some calculations with these values:

At 1.3 PBW and 4 % metadata, almost 32 PB of data could be written to the pool before the metadata TBW value is reached.

As another example, a 6-wide RaidZ2 with 20 TB hard drives gives a usable capacity of 80 TB (and even less at recommended utilization). That pool could be rewritten nearly 400 times with 4 percent metadata before TBW is reached.

The 1TB Samsung 990 PRO (V-NAND 3-bit MLC) is only guaranteed for 600 TBW, but should be more than enough according to the same calculations. Opinions or other recommendations?

If these calculations are correct, is there any reason to use enterprise grade SSDs in a media server as a special device? At least write endurance shouldn’t be an issue if most of the data on the NAS is static media and photos.

As far as I understand, PLP doesn’t bring any advantage either, because the actual data pool doesn’t have this?

Is there something seriously wrong with my understanding?

As usual the answer is “it depends”.

If you’re using the special device in default configuration (only store and accelerate metadata access), latency is king and the old, small Optane is the right tool. It’s superior latency will offer a noticeable performance impact, the relatively small size and low bandwidth don’t hurt in this scenario. Optane’s extensive TBW specs also fit the use case.

If you intend to also store data records with small record sizes (to overcome the performance hole of HDDs) the Firecuda may be more appropriate. In this scenario, device size, bandwidth are typically more important. You need to do the math.

Please note that you need to consider performance at the expected queue depth for any SSD. The spec’ed bandwidth numbers are only achieved at super high queue depth that basically does not occur in normal use.
Some basic fio runs or chrystal/kdiskmark will give you some indications for the device on hand. Consider a special device configuration that uses partitioned devices in case of low queue depth workloads as are common in home lab configurations (to have zfs create artificially increased queue depths).

I don’t think that max seq read is important in this usecase.
Latency and random read are the key indicators for speed here and Optane is in a very different league here…

Optane does not have a lot of capacity in the cheaper price range, but for my usecase (SMB performance to Windows PCs is priority) with loads of large data, I wouldn’t be able to store a lot of small files on the SSDs (unless I would go for >8TB SSDs).
Optane as special vDev will allow me to search quicker within SMB shares and 128GB RAM (ARC) and striped mirror should bring me into the 500-1000MB/s speed range - so, close to my 10Gbit bandwidth limit to my Workstation and higher than my max Wifi bandwidth

What is the preferred method of specifying the devices for the ZFS special? As /dev/nvmeXnY or as /dev/disk/by-id/nvme-blabla, or some other method?

And another question: is it possible and if so, what are the pros and cons for going with 4096 LBAs instead of 512, for the nvme namespaces used by the ZFS special?

The device should be uniquely identified by the system. /dev/nvmeXnY can change between boots/restarts.
Most modern Linux versions use udev rules that will populate unique device ids in /dev/disk/by-id. Look for the correct device name/link in there.

There are multiple tools that will tell you about the supported LBA sizes for your nvme device. I find smartctl the easiest/most accessible.

root@xyz:~# smartctl -a /dev/nvme0n1
[...]
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         2
 1 +    4096       0         0
[...]

In my example above the + sign indicates currently used LBA size. The number in column Rel_Perf indicates relative performance (duh) - lower numbers indicate higher performance.

Generally, you want to configure your nvme devices to use the LBA size with highest performance. The only reason not to do so is lack of compatibility with your OS, which in 2024 should not be an issue.
Specifically for zfs special devices, you should also use the LBA size with highest performance.

Based on the responses, despite its small size, Optane would be the right way to go.

However, I am really concerned about the small size, larger Optane drives are significantly more expensive and their availability in the EU is generally poor. To be honest, the benefit/price ratio starts to get out of hand with the bigger Optanes in my use case.

With a calculation formula of 0.3% of the pool size, this is unnecessarily small IF the designed pool is full.
Does anyone have experience with how this Optane is sufficient to maintain media server metadata where most of files are large (1GB - 80GB)?

The server is mainly intended for media files organized in folders. Jellyfin, on the other hand, takes care of the metadata it needs, but I would like speed, especially for thumbnails when browsing photo folders.

32TB raw storage pool with mainly large files
With 16Kb small block size, I would need 80GB of space, which is in the 0.3% area.
I might go to 8kb small block size, though that the special vDev faces smaller risks of getting full