How do I reduce/minimise CPU load average on a storage server?`

alpha754293 · June 6, 2023, 4:22pm

The TL;DR question:
How do I reduce my system/CPU load averages when I am copying files using rsync?

Background:

Server specs:
2x Intel Xeon E5-2697A v4 (16-core/32-thread)
8X 32 GB Crucial DDR4-2400 ECC Reg RAM
SuperMicro X10DRi-T4+
Nvidia RTX A2000 6 GB
Mellanox ConnectX-4 dual 100 Gbps IB NIC
Avago/Broadcom/LSI MegaRAID SAS 9361-8i SAS 12 Gbps HW RAID HBA
ZFS zpool1 (8x HGST 10 TB SAS 12 Gbps HDDs (raidz2 vdev), 8x HGST 10 TB SATA 6 Gbps HDDs (raidz2 vdev), 8x HGST 6 TB SATA 6 Gbps HDDs (raidz2 vdev))
ZFS zpool2 (8x HGST 6 TB SATA 6 Gbps HDDs (raidz2 vdev))
OS drive: 4x HGST 3 TB SATA 3 Gbps HDDs (RAID6 HW RAID array)

OS:
Proxmox 7.4-3

As shown in the specs above, I have two ZFS pools - one which is 3 vdevs, each 8-wide raidz2 and the second ZFS pool which is onl one 8-wide raidz2 vdev.

I am noticing that when I am using rsync to transfer files between the ZFS pools, my system load averages shoots up to anywhere between 30-120. (I wished that I was only kidding about the 120, but sadly, that’s what the Proxmox summary shows me, as does top.)

(rsync command that I use:
rsync -avrsh --progress=info2 /path/to/source /path/to/destination)

(These are large file transfers between the ZFS pools (~12 TB).)

My questions:

a) How do I figure out how much load is caused by the ZFS raidz2 parity calculations?

b) Is there a way to test and/or to be able to benchmark how much CPU resources does ZFS take to be able to perform the raidz2 parity calculations? And also benchmark the speed of it (to find out which processors are better at this parity calculation?)

Is there a way to deploy DMA (a la RDMA) on ZFS such that when I copying files using rsync from one ZFS pool to the other ZFS pool, it can “piggyback” off of DMA, so that it would be able to see the benefits that DMA has to offer for storage management operations like this?
Could the CPU load averages be improved if I were to use the RAID6 from the Avago/Broadcom/LSI MegaRAID SAS 9361-8i HW RAID HBA (rather than using ZFS)? I understand that I am trading off some of the protections that ZFS offers, but ideally, I would like to be able to keep using ZFS, but just optimising ZFS (or the transfers themselves) so that it doesn’t cause my CPU load averages to jump up like this.

Any thoughts and suggestions would be greatly appreciated.

TimHolus · June 6, 2023, 4:42pm

Load, load… I would focus not on dry numbers, but on the actual responsiveness of the machine.

You can start by limiting the cores that are used by a process. But heavy load is not only about CPU but also I/O.

bambinone · June 6, 2023, 5:31pm

What’s the concern? Power consumption? Thermal output? Fan noise? Contention with other, more important processes?

You say the load average is high, but that doesn’t really tell us much. I like using htop with the pressure stall indicator (PSI) meters enabled. That will quickly show you whether the cores are actually getting slammed with work—and from which kernel or userland processes—or if processes are simply waiting for I/O (or somewhere in between, or something else entirely).

Were your pools built with xattr=sa? You can check with zfs get xattr <pool dataset>. If not, that could be causing a lot of pain.

Do you have special vdevs attached to your pools? rsync is very metadata-intensive; moving all the metadata to special vdevs should reduce I/O load considerably.

Is there a reason why you’re using rsync instead of zfs send/recv? The latter should be much faster and more efficient.

For a quick fix, I would suggest trying rsync -ashxAHWX for an overall faster and less resource intensive rsync experience. -W makes it operate in whole-file mode—which will paradoxically incur more raw I/O but might require less work overall. Add -A and -X to bring over ACLs and xattrs, respectively. You might be dropping these with your current process.

We should also talk about NUMA, nice, ionice, cpusets, and other fun stuff, but let’s start with the above…

Exard3k · June 6, 2023, 6:21pm

The problem is rsync and the solution is zfs send.

Is there any special feature you need from rsync that can’t be done with send | receive?

send | receive is a million times faster and doesn’t need CPU. It’s a raw (incremental) block-level serialized stream from A to B.

For me, rsync is the oldschool method when you don’t have access to BTRFS or ZFS. Which is why I have those two filesystems everywhere, because I hate backups and rsync.

alkafrazin · June 6, 2023, 6:52pm

removed incorrect information

bambinone · June 6, 2023, 7:09pm

Load average is not a percentage. It’s a count of runnable (running or waiting to run) threads averaged over an interval.

If there’s a thread pool scaled to nproc on the sending end and a thread pool scaled to nproc on the receiving end it would be trivial to hit a load average of 2P * 16 cores/P * 2 threads/core * 2 thread pools = 128. And if they’re all waiting for I/O, they’re runnable.

alpha754293 · June 6, 2023, 8:50pm

Two things:

Under heavy load (as a result of the data transfers), VMs that are running on the Proxmox server has crashed. (My Ubuntu 23.04 VM, has crashed overnight, twice in the last 24 hours as a result of this.)
Storage/data management tasks are the known culprits because once the rsync task(s) finishes, the load averages goes back down to ~3-ish (with 11 VMs of various flavours running).

a) To begin to do this, I would need to know what are the processes that are causing the CPU load averages to increase because top doesn’t tell you that. (It only shows 100% under the CPU column for the two rsync processes (one for read, one for write), but load average right now is 28.31,31.19,31.78.)

Thus, that isn’t really a good way for me to start to try and ascertain which process I would need to start limiting. And the furthermore, if the default operation of ZFS and/or rsync is such that it is allowed to use any and all cores, the implication of limiting the number of cores the responsible processes are taking, can have an adverse impact on performance.

I was trying to research this earlier today (granted, it is quite old) and found that actually giving rsync more cores would be more beneficial to the performance of said rsync rather than limiting the number of cores that the “offending” processes are allowed to have access to.

(Again, that still doesn’t explain a) if rsync is the culprit and/or b) if really, it’s ZFS, and perhaps more importantly, the parity calculations underneath the raidz2 vdevs, that’s the culprit.)

Well, I am trying to figure out what’s causing the load average to be reported as being as high as it is.

The two rsync processes, per rsync task, in top shows 100% under the CPU column.

Just installed htop on the Proxmox server.

PSI some CPU: 0% 0% 0%
PSI some IO: 40.06% 42.24% 44.00%
PSI full IO: 41.52% 41.14% 42.48%
Disk IO: 1815.7% read: 1003 M write: 268M

xattr are “on” for both pools by default.

No, I don’t have a ZIL nor a separate intent log (SLOG) device. I’ve stopped using SSDs altogether after the SSD death 6 and 7, which happened earlier this year and Solidigm said that I was one month past my warranty time period, except that it was a READ failure, and not a WRITE failure. (i.e. I was far from exhausting the write endurance limit of said SSDs).

I didn’t know that you can use zfs send/zfs receive between pools on the same system.

Does it still do the same kind of consistency/checksumming that rsync does?

Probably, but I didn’t think that this should have an impact on the performance of the system and/or the transfer.

(Again, what I am trying to find out is whether it’s rsync that’s the culprit or it’s ZFS (more specifically, the double parity calculation involved with the raidz2 vdevs) that’s causing the issue, and how to ascertain the root cause/be able to split out what’s causing what due to the systems interaction nature of this.)

I believe NUMA is enabled in the BIOS. I don’t remember, but I think it is.

I didn’t know that you can use zfs send and zfs receive between pools on the same system.

Does it do the same kind of consistency/checksumming that rsync does?

Can you only send certain directories?

What happens if you send a directory and the pool that’s receiving it, already has existing data? Will it check to see if it is a duplicate? Or will it just overwrite the existing data? Or will it not send it at all, because the filenames are present, but the data may be different (i.e. one copy might be corrupted whilst the other is the “good” copy)?

If you have a good resource which shows examples of how I can send one folder (at a time) using zfs send/zfs receive – if you wouldn’t mind passing that on to me so that I can study it, that would be greatly appreciated.

I know that rsync does a LOT of checking when it is sending data.

I will have to read a LOT more about zfs send/zfs receive to see if it does those same kind of checks (or more) when compared to rsync.

Any resources/reading material that you can forward would be very useful.

So my VMs live on the second ZFS pool (for the sake of this discussion, let’s say the first ZFS pool is pool A and the second is pool B).

If I am sending from pool A and receiving on pool B, and as mentioned, it would have more raw I/O – will this increase in raw I/O render my VMs completely unresponsive and functionally unavailable? (Because the disks in the vdev for pool B would be too busy servicing the raw IO requests from the ZFS send/receive operation that it can’t serve the VMs simultaneously?)

I use rsync a LOT, especially when I am backing up the data from said Proxmox/ZFS server to LTO-8 tape.

No. A load average of 120 means that it is using 120 threads on a 64-thread system.

Current load average is 33.15 32.34 31.18 which means that’s the number of threads running on my 32-core/64-thread system.

Maybe this is why it has been long recommended that you have as many threads/cores as you have HDDs that you are using/running with ZFS.

(Because I am DEFINITELY seeing that in the load average data.)

What I can’t figure out is if this is caused by ZFS or rsync.

I appreciate all of the help that the team has provided here.

Thanks.

Hopefully, we’ll be able to ascertain whether the high load averages are being caused by rsync itself, or if it is caused by ZFS which lies underneath the rsync transfer between the two ZFS pools.

TimHolus · June 6, 2023, 9:31pm

Take a look at iotop / iostat… iotop --only will only show you active ones.

Exard3k · June 6, 2023, 9:38pm

It’s ZFS. It always checks the checksum on a read.

You can send everything that you can make a snapshot from. So datasets and zvols. If you only want specific directories, make a snapshot from that directory (you created lots of datasets, right?) and send it.

It sends snapshots of the dataset on a block-level. There is no concept of files. Any further “sync” only transfers the delta between the two snapshots. Old files and stuff are still referenced in the older snapshot and can be restored as usual if needed.

Why send one folder at a time? I’m sending snapshots of my entire 25TB pool in minutes. Sequential write is what you deal with via zfs send. And when sending 1PB via incremental snapshot and only 200G have changed, you’re done in 3minutes with 1GB/s sequential write speed without compression. With compression, divide it by your compressratio.

Syntax and stuff:

Basic setup goes sth like this (local source and destination, not different machines):

 zfs send tank/dataset@snap1 | zfs receive tank2/dataset

I like to pipe it through pv for some progress and numbers. otherwise you don’t have any interactive feedback.

 zfs send tank/dataset@snap1 | pv -tba | zfs receive tank2/dataset

Incremental send to only send the stuff that has changed since last snapshot:

 zfs send -i tank/dataset@snap1 tank/dataset@snap2 | zfs receive tank2/dataset

TrueNAS has some fancy GUI and you can do all this with a couple of clicks and make it a periodic task. Otherwise CLI does the job just fine. Crontab makes the world go round.
Official OpenZFS documentation for everything send | receive:
https://openzfs.github.io/openzfs-docs/man/8/zfs-send.8.html

Depends. If ARC can service the reads, then there isn’t a noticeable difference. VMs writing stuff just ends up sharing IOPS on the disk. Same as any other two applications competing for the same resources. Nothing special here. Rsync from my anecdotal experience is far heavier on the drives because it has to read and write a lot of small stuff like files and changes in parts of (small) files. HDDs hate random I/O.

I’m not sure how tapes work (25 years since I last used one) and rsync might be better there. But if you can store a 20TB file on tapes, you can redirect the send stream to a file and make it work. I use zfs send and btrfs send for everything backup. Super fast and I can store 500 snapshots on the target and it only uses capacity of one copy + the delta from the snapshots (minor).

On a sidenote…high CPU usage can be caused by compression. Reading, decompressing and then writing and compressing probably isn’t the easiest of workloads. 32 Cores are great, but heavy compressing/decompressing of multiple GB/s on non lz4 algorithm isn’t trivial.
ZFS runs kernel threads for compression and you should be able to find them in e.g. htop and pin down what’s causing that high CPU utilization. Might be rsync, might be compression. Some PID will have unusually high CPU time. My guess.

edit:
some nice YT lecture covering zfs send vs rsync for replication. Jim Salter does a great job as a presenter:

risk · June 6, 2023, 9:38pm

iostat -x while doing the copy, will show you some stats, atop is also good, also perf top

Out of curiosity, … what does -s rsync flag do for you

bambinone · June 6, 2023, 10:17pm

Your VMs are crashing because the disks are busy and can’t push enough IOPS to keep the virtual block devices from timing out. The high load average is a red herring. Your cores are basically idle while the rsync is running.

Would recommend the following:

Rebuild your pools with SSD-backed special vdevs. Please note that this is different than SLOG and L2ARC. If you can’t stomach putting consumer NAND flash in your array, look into enterprise offerings and/or Optane.
When you rebuild your pools, set xattr=sa. There are possibly some other tunables you’ll want to set as well like dnodesize=auto. This will require some research.
Use zfs send/recv instead of rsync.
If you still can’t get enough IOPS you might need to add more vdevs or switch to mirrored pairs instead of RAIDZ.

One other note, we usually discourage the use of NUMA platforms for ZFS installations, especially on older Xeon platforms. You might consider pulling one processor and half the RAM just to see how it affects storage performance. Alternatively you can look into pinning the ZFS processes and RAM allocations into the single NUMA node where your HBAs (and other PCIe devices critical to the pool) reside. If you have two HBAs you can put each pool in its own NUMA node. Or you can put all your storage in one NUMA node and all your VMs in the other. If none of those are options, look into disabling NUMA/enabling memory interleaving as a slightly less-bad configuration.

rcxb · June 6, 2023, 10:31pm

Okay, now we have something to actually troubleshoot!

Try prefixing your rsync command with: nice -n 40 ionice -n7
It’ll run slower, at a lower priority that is less likely to affect VMs.
rsync also has a --bwlimit= option you can tune to set a low enough speed limit that it won’t can’t max-out your disk or CPU, but that’ll take some trial-and-error.

alpha754293 · June 7, 2023, 4:54am

There are two sets of z_rd_int_n “processes”/tasks that are running where n=0…6 (for both sets).

I’ll have to wait until the ZFS scrubs are finished and then I can take a screenshot of what iotop says.

If you have a 155 TB (usable) pool, and you want to take a snapshot of it, don’t you need at least 155 TB to be able to put that snapshot somewhere?

(i.e. snapshotting doubles the total volume requirement)

Or will I be able to take a snapshot of said 155 TB pool without it taking any additional space? (I don’t have another store to put said 155 TB of data, unless I can use ZFS to set it to LTO-8 tape directly. (sidebar: I never did get an answer to one of my questions that I posted here earlier about how to backup block devices using LTO tape. Maybe zfs send/zfs receive would be one means to achieve this goal/end. But this would also mean that I won’t have a nominally readable text file of the index which lists all of the files stored from the block device. Which also means that if I am looking for a specific file, I would have to restore the entire snapshot to be able to read/find the file(s) that I need.))

Nope. (Never really grasped the concept of datasets (and why they’re important) - especially in a single-user environment like a homelab.)

(i.e. if you have movies, and you had one pool - what advantage would you really have with creating a dataset where you aren’t going to use any of the properties and/or advantages of a dataset?)

Yeah, then this wouldn’t really work for me then, because I just have the two datasets which corresponds to the two pools directly (if you can even call that a dataset vs. a zvol).

Because LTO-8 tape can only store upto 12 TB raw. So, I am having to pick and choose which files/folders I am backing up with each tape because I am manually controlling that with rsync and ltfs. (i.e. I am not using a tape backup software like Yotta or something like that.)

I would LOVE for 32 spinning rust drives to be able to hit 1 GB/s sequential write!

TrueNAS doesn’t support virtio-fs, so it was ruled out when I had an “open competition” to see which solution would work for my needs/what I want the system to do/be able to do. xcp-ng also doesn’t support virtio-fs neither. And when I asked the respective teams about it, the TrueNAS guys were like “why would you want that?” and people (besides myself) told them (in existing threads on their forums) about why they want it, and they were like “use a virtual switch/bridge instead” whereas a part of the whole point of virtio-fs is so that you don’t need a virtual switch/bridge. (It bypasses the networking stack which is AWESOME! SOOO much faster when I am loading the contents of a directory, in Windows, that has like 2 million files in it.)

And after I wrote my piece, they just didn’t say anything. At all.

And then for xcp-ng, the creator decided to get into a totally unrelated semantics argument with me, which their own documentation says what I was saying (because that’s where I picked up the terminology from). So, really, the creator, in the end, was arguing with himself, since he was one of the authors of said documentation.

Thus, between the lack of the feature support for virtio-fs, plus also per apalrd (tech YouTuber), paraphrased quote “TrueNAS devs are assholes”, and xcp-ng creator arguing against his own documentation, Proxmox was the solution that was able to do what I want and need it to do.

But ZFS management is very manual (their GUI can do very limited things with ZFS).

Thank you for the link.

I’ll have to read up about that to see whether I would be able to send only a folder at a time (vs. using it on datasets (which I don’t have)).

From the zfs iostat command – so far, it appears that rsync, at peak, reads and writes at about 600 MB/s. For large files, it can hit somewhere between 300-400 MB/s. For a lot of small files though, it suffers. (I make the analogy that it’s almost like Windows Explorer copy/move where it opens and closes a file (handle) per transfer, rather than sending the data as a “stream”.)

No, not quite yet.

LTO-8 raw capacity is 12 TB/30 TB compressed. (But I have found that the compression is not reliable – i.e. I am not sure what files will be compressed vs. what files won’t. The tape doesn’t tell me that information when I am writing to it. (Or maybe rsync just writes everything as raw, even though LTFS is supposed to handle the decision-making process as to what gets compressed and what doesn’t.) As a result, I compress my files OUTSIDE of LTFS, and then write the resulting, compressed .tpxz file to tape. (I compress it using pixz because it supports parallel compression and parallel decompression.)

LTO-9 is I think 18 TB raw/45 TB compressed. (But again, compression isn’t guaranteed.)

…but that’s one copy for the snapshot PLUS the “live” copy.

i.e. if you have a 155 TB pool, you’d need AT MINIMUM, 310 TB in storage capacity, and then some for the incremental backups.

I don’t have 310 TB available.

The idea with sending the files from pool A to pool B (where pool B is smaller) is because that’s the “scratch” space that I am using to prep the files to be written to LTO-8 tape. Once “cold” data has been written to tape, I can delete it off the “live” system, which means that I might be able to “shrink” the pool (which actually consists of backup everything, destroy the pool, and recreate a smaller pool).

That’s where this is going.

Most people are expanding their hard drive and/or live data storage needs (or backups using spinning rust/RAID (which isn’t a backup).)

And most people are also expanding their networking infrastructure as well, as the number and size of systems grows.

Again, I am going in the opposite direction. This server was actually the result of consolidating 4 of my other NAS servers down to this single system. With it, I was able to take out almost all of my networking equipment save for a 16-port GbE Netgear switch (which consumes 7 W of power). My system that runs the LTO-8 drive, is connected to the Proxmox server over 100 Gbps IB, directly, bypassing my 36-port 100 Gbps IB switch.

For the VMs that support it, the VM <-> host communication is accomplished via virtio-fs.

For those that don’t, I use NFS for *nix VMs and SMB for Windows VMs.

I appreciate the YouTube link. I’ll have to watch that later.

According to the man page: “no space-splitting”

I have NO idea what that means.

I just saw that command somewhere and I just started using it. I don’t even remember the source anymore.

Yes and no.

I mean, SOMETHING is making the system think that the system is very busy.

I would like to ascertain what that “something” is.

NAND flash dies. (They’re the brake pads of the computer world.)
Enterprise solutions cost too much (still).
SuperMicro X10DRi-T4+ doesn’t have NVMe slots, which means that I would need to buy an adapter card to be able to use Optane. (I think that I do have a 16 GB Optane module floating around here somewhere that came with an Intel NUC that I bought years ago.)

Noted.

Per above, it might not work on individual files/folders (TBD - I have to read the reference documentation that was just provided.)

I would prefer NOT to “replace” folders with datasets.

(Datasets assumes that you know what you’re going to be putting on the pool, whereas folders can be created on-the-fly. If datasets were to replace folders, whenever I want to create a “folder”, I would have to create the ZFS dataset instead, which I can’t do, directly from Windows Explorer, as an example. (Nor File Manager in Ubuntu.) Thus, creating a dataset in lieu of a folder means that I would have to be remotely logged into the Proxmox server all the time, to execute this extra step if datasets were to replace folders.

Capacity cost too high.

Can you expand a little bit further on the rationale of this please?

This is interesting.

I’d like to learn more about it.

Yeah – the problem is that getting the files prepped to be written to tape is already a slow-enough process as it is.

It would suck if it were even slower yet/still.

(If there are a lot really small files when the data is backed up to tape, it can take ~3 days to write 12 TB. (~50 MB/s net, effective speed).)

So, if I can prep the files to be written faster, and then run the tape backup job and let it run for however long it takes to be able to successfully write the data to tape without corrupting the data during the write operation. If it takes 3 days, I’ll leave it alone for 3 days.

bambinone · June 7, 2023, 6:36am

This is going to sound dickish, and I apologize in advance. You would be well served by reading a quick introduction to ZFS and Copy-on-Write (CoW) filesystems. Because you do not seem to understand how any of this works.

Snapshots are “free” until you start changing existing data, and then you only pay for what you use. The only way you would need an extra 155TB of capacity is if you were to 1) take a snapshot of your current zpool, 2) delete everything, and then 3) write 155TB of brand new data.

If you only have one kind of data, then no, you don’t need to create separate datasets. But consider my homelab, where I have home directories; system images; installers (ISOs); different kinds of backups for different systems; scratch space; a junk drawer; videos, pictures, and music; storage for various projects; etc. So those all get (hierarchies of) datasets.

For example, I have a “home” dataset to hold all the home directories, and then one nested dataset for each user. Each home directory is owned by its user and gets its own NFS export. Home directories have completely different snapshot, quota, and retention policies cf. scratch space—and I can and do tune recordsize, special_small_blocks, ACLs, etc. for home directories completely differently cf. videos.

Each pool gets a top-level dataset by default. Nested datasets inherit properties from their parents, but those can be overridden and passed down to their own descendants. And yes, you can send the whole pool dataset, just as you’re rsyncing the entire directory structure.

Zvols are separate from pools and datasets. A pool can hold datasets and/or zvols. A zvol is an opaque block allocation that can be used as e.g. an iSCSI target or directly by a local hypervisor as a virtual block device. Are you using zvols?

Easy. What’s the problem? If a single vdev can do ~180MB/s sequential write (which a mirror can do and RAIDZ can beat) then you just need six vdevs of five drives each, leaving two hot spares.

Most VDI implementations on ZFS use image files (qcow2 or raw) in datasets, or zvols (with or without iSCSI), or NFS exports.

Unfortunately it’s severely hamstrung by how you’ve built your pools, because it is constantly seeking and using up all the available IOPS for metadata operations.

All your threads are waiting for I/O and are therefore runnable and causing the load average to spike. It’s that simple. Fix your I/O, fix your load average.

So do hard drives and everything else. That’s why we use ZFS and have redundant vdevs, NICs, HBAs, power supplies, etc. If putting a single physical block device in the critical path scares you, good! Deploy a three- or four-way mirror for your special vdev with staggered PoH and rest easy at night.

If you’re calling your Kioxia rep and paying sticker price for bleeding-edge PCIe Gen5 EDSFF drives… sure. Or you can spend five minutes looking and find great deals on SAS/SATA and PCIe Gen3 enterprise SSDs with 0 PoH and 20 DWPD.

Yes, you’ll need M.2 or U.2 carrier cards, or you’ll need to source SSDs that come in a PCIe add-in card (AIC) form factor. I use two dual M.2 carrier cards in my X10SRH-CF in two x8 slots bifurcated to x4/x4 each, and a U.2 carrier card in a x4 slot. They were $20 each.

You’ll need more than 16GB for your special vdev. How much you need depends on how much metadata you have and whether you want to enable special_small_blocks. I’m pretty sure Wendell has a script somewhere on here to calculate it.

Just zfs send the top-level dataset, or create one dataset per virtiofs and network share and send those.

Applications that aren’t NUMA-aware, like ZFS, can get absolutely hosed by NUMA. Imagine that your ARC gets allocated in node 0 and everything else is in node 1. Now it doesn’t matter that your HBA can do 64 GT/s. It doesn’t matter that eight-channel DDR4-2400 can do 120 GB/s. The performance of your array will be limited by the pitiful speed of the QPI between the two sockets. Not to mention the latency penalty. And if things have to travel between sockets more than once, you’ll really get killed.

Here are some mlc results on my dual-socket Xeon E5-2680 v4 system:

Click to expand

Measuring idle latencies for sequential access (in ns)...
		Numa node
Numa node	     0	     1
       0	  85.2	 126.6
       1	 125.1	  84.8

Nearly 50% higher sequential RAM latency between sockets.

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
		Numa node
Numa node	     0	     1
       0	61891.4	16506.9
       1	16508.4	62357.8

Nearly 75% lower sequential RAM throughput between sockets.

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	35.9
Local Socket L2->L2 HITM latency	39.8
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
			Reader Numa Node
Writer Numa Node     0	     1
            0	     -	  90.9
            1	  90.8	     -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
			Reader Numa Node
Writer Numa Node     0	     1
            0	     -	  90.4
            1	  90.3	     -

125% higher L2 latency between sockets.

If you can keep everything local to a single socket, you’ll be able to realize the performance of your storage array much more readily.

Unfortunately we all have to choose between capacity, resiliency, and performance. If you need more IOPS for your use-case then there might be no getting around it; either your use-case needs to change, or your performance expectations, or your topology, or your investment. Fortunately ZFS gives us all kinds of useful mechanisms for leveraging faster storage tiers to accelerate various parts of normal operation… but you’ll need to get over your SSD trauma.

rcxb · June 7, 2023, 6:54am

That shouldn’t be the case at all. If you’d like to detail how you’re doing backups, I’m sure we’d sort out a better way. I’ve worked with tape for quite some time.

tar + pigz + mbuffer would be a good place to start. Tar can do differential / incremental backups, pigz is fast and multi-threaded, and mbuffer will both buffer and can dynamically split across tapes.

Exard3k · June 7, 2023, 9:17am

Taking a snapshot doesn’t use space. It starts using space if blocks are referenced by the snapshot that aren’t present in the dataset anymore (because you deleted stuff or changed things).

e.g. If you use Ubuntu installer to get ZFS, Ubuntu makes a dataset for each directory. And it’s really great. Dataset are similar to directories. And if you store everything in one directory, things often get complicated, which is why we use folders in the first place. I have 28 datasets and ~40 zvols on my 48TB raw pool for reference. That’s 68 automatic snapshots every day. And I also keep weekly and monthly snapshots.

If you have to rollback a VM, you have to rollback the entire dataset with 80 VMDK files. That’s bad. You get 80 datasets and rollback the single VM via snapshot. I use zvols for my VM disks so I don’t need to create datasets for them. There is virtually no limit on datasets and nesting.

Hundreds or thousands of snapshots on a system isn’t unusual. I know that 45Drives snapshots every 5 minutes for ransomware protection. This sounds ridiculous, but it works just fine.

And more snapshots don’t use more space. The only criteria is how much data has changed on the relevant datasets. If nothing changed, a snapshot is 96kb size = nothing.

modern HDDs do ~200MB/s each. So it depends on your pool config what you want. I get 700-750MB/s from my 3 mirrors with compression. With RAIDZ, you only get write speed of a single disk. Write speed is about choices, not about HDDs.

With zfs send, we don’t deal with files and network overhead is irrelevant. My TrueNAS VM has a passthrough of a 10Gbit port and I run everything over the wire. I like to keep things simple.

You create a (child) dataset for the folder, move the data to that dataset, snapshot it and send it to the target. That’s what I would do and how this this is intended. May require some changes to shares and adjustments to mountpoints. But that’s just the consequences of not using datasets in the first place. Repeat that for every other “logical” fraction of your storage. If you’re lazy and keep everything in a single dataset, you’re just limiting yourself in using what CoW and ZFS has to offer.
If you like to dump everything into one place, you should look into object storage, where dumping things into buckets and let metadata deal with structure and organization is the entire design.

Really bad numbers. Probably lots of random I/O. Poor rsync has to dig into kb worth of large files, one chunk at a time, to find and checksum differences. And this on a slow RAIDZ pool. I’m so glad I don’t have to deal with this anymore.

You always have two copies of the data. Same with rsync. Benefit of snapshots is you can have a thousand snapshots while you can only have one copy with rsync. If you don’t want to send the entire pool, just send the dataset in question that is important.

Sending a snapshot to another pool of HDDs is a backup. It’s a replicated copy. I have 2 cold disks I plug in every week and run replication task (minutes). The medium doesn’t make it a backup or not a backup. It’s determined by where the copy is stored.

“RAID is not a backup” just refers to a single copy with redundancy not being a backup. Two or three copies located elsewhere is what is needed to call it backup/disaster recovery.

You should learn about CoW, snapshots and ZFS in general. You leave a lot on the table in how you do things. Results in more work, more disks running syncs for hours locking up VMs, bad pool performance despite 32 disks, looking for workarounds…when you got the tools right in front of you. Don’t use ZFS like ext4 with checksums and ZFS will amaze you.

Why a text file?
You can browse the contents of all snapshots if you want. /mnt/tank/.zfs/ (or wherever your pool is mounted) has all the snapshots ready for browsing. Just search for stuff via ls | grep or whatever you prefer. No telephone book or rollback needed. Need that spreadsheet state from Feb 6th 4:30 pm? just copy it over to wherever you need it.

This works for throughput, but not for IOPS. Setting limits to e.g. 50MB/s will still be bad if the disks are doing random I/O, which is what rsync does and why it takes so long. It can help, but if you can’t rate limit IOPS, I doubt it will do much unless you set it to very low limits, choking sequential parts of the process as well.

There is some work in progress on rate limits as a dataset property in OpenZFS development. But they want to get this right and useful while still having performant SATA/NVMe queues, so we probably won’t see it short-term. You can limit IOPS per block device on some SANs and Ceph already. And we all hate noisy neighbors, especially in the age of cloud computing.

bambinone · June 7, 2023, 2:16pm

Preach!

alpha754293 · June 7, 2023, 2:52pm

No worries.

I’ve read the Solaris ZFS Administration Guide that was originally published, I think back in 2006 as part of the Solaris 10 6/06 release, when ZFS was first introduced into the “mainline” Solaris installation (rather than from OpenSolaris), where it states: " A snapshot is a read-only copy of a file system or volume." (emphasis mine)

Thus, my initial understanding with it stating that it is a copy, is that, it is, a literal “copy” - (i.e. select all, CTRL+C, CTRL+V).

Admittedly, that might be my incorrect understanding about it, but that is also informed by a lack of experience with it (and there weren’t any YouTube videos/demonstrations of the feature/functionality back in 2006, when I started in ZFS administration.)

Also admittedly, there is also the risk that a lot of the information that I probably have in my head about ZFS are a) outdated and b) Solaris specific (and neither of which may apply to ZFS on Linux).

(sidebar: I’ve LONGGG avoided deploying ZFS on a production server because of the issues that I had with the Solaris ZFS implementation/my deployment back then, where I have an email thread (still) with Gwen Nicodemus (Sun Microsystems support) when my ZFS deployment died. So it’s only VERY recently (within the last few years), that I’ve started to deploy ZFS on one of my production servers (first via TrueNAS Core, because it has a nice GUI to help with the most common ZFS admin tasks), and now on my Proxmox server (which consolidated my TrueNAS core system into) – and Proxmox requires more manual ZFS adminstration. As I mentioned before, the Proxmox GUI is very limited in terms of what you can do in regards to ZFS admin.)

Gotcha.

But again, my understanding right now (as I am working through the replies here), is that the snapshot act at the pool/dataset level, rather than on a specific file and/or folder level, and therefore; because I didn’t create the dataset (that I can take a snapshot off) in lieu of folders, that the entire “architecture”/“plan” of my ZFS layout would need to be redesigned/re-architected to support this.

(The initial layout was quickly set up with the purpose of ingesting the data from the servers that were being consolidated, so that the drives can be pulled from the old servers, and added into the Proxmox server (as a vdev and added to the pool), so that it would be able to ingest more data, and rinse/repeat.)

“kind” - no. As you have written, like yourself, I have different “kinds” of data. But, my thought process is that I look at it, less so as being differentiated by the “kinds” of data vs. it being “my” data. (i.e. ownership look rather than looking at the type of data).

Yeah - you definitely thought a LOT more about your ZFS layout, and architected your ZFS system according to that plan than I did (where I didn’t put ANY thought about the ZFS layout plan).

My thought process was “I need to pull the drives from my own NAS systems, so move the data over quickly, yank the drives, add it to the Proxmox system, expand the ZFS pool, and ‘suck up’ (ingest) the next round of data from the next NAS server. Rinse. Repeat.” (until all data has been migrated over).

Now, I am in the process of manually de-duping some of the data (which was previously spread across multiple NAS servers), and once it has been checked, send it to tape for backup.

I am not sure.

I created the pool using the zpool create command.

So when I type zpool list, it just shows the two pools.

I am not sure whether they’re zvols or not. (The ‘reporting’ tools of ZFS doesn’t really make it explicitly transparent between what’s a pool, what’s a zvol vs. what’s a dataset. You have to know/remember that, based on how you created them.)

Yeah, I’ve never seen that when I am rsync-ing JPEGs.

Again, I would have to re-architect the entire ZFS layout to be able realise this benefit.

(But even then, the problem would still be that a dataset can easily exceed the capacity of a LTO-8 tape (12 TB), which I would need to ‘break up’ a dataset to be able to prep it to be backed up onto LTO-8 tape.)

i.e. If I created a ‘movies’ dataset and the entire dataset is 30 TB - how would I use zfs send/zfs receive where I can only send 12 TB of the dataset at a time?

This is where I run into a bit of a problem.

If you’re sending the snapshot/dataset over to another ZFS system where you don’t have a 12 TB per volume limit, then this issue doesn’t apply to you.

But if you’re prepping your dataset to be backed up to tape, where each tape has a finite space/capacity – I can’t just “blindly” send a 30 TB ‘movies’ dataset to a 12 TB tape. It will fail.

virtio-fs is more for VM <-> host communication without going through the network stack.

It is agnostic to the VDI format.

What command would I use to be able to “study” this (to confirm and ascertain that this is the root cause of the issue)?

With all due respect, that’s a “whataboutism”.

The manner in which HDDs die is NOT a function of the volume of writes.

SSDs, on the other hand – their death is LITERALLY measured in terms of TBW and if you hit that TBW value, the death of the SSD is GUARANTEED.

This guarantee of death is wholly absent on HDDs.

Yes, all hardware can fail and die. But where it differs is that the death of SSDs is GUARANTEED, by design, whereas the death of HDDs is NOT a guarantee.

ZFS (or RAID in general) is to protect against unplanned hardware failures. SSD should be a planned failure (given that their deaths are guaranteed, once the finite number of erase/program cycles have been exhausted).

If the brake pads on your car wear out, you expect it and plan for their replacement.

But if your engine dies, that’s NOT expected, and that’s unplanned. You don’t keep a spare engine in your home/garage in the event that the engine of your car dies.

I’ve looked for Optane 905p 960 GB drives on eBay.

They’re still around $400 USD a pop.

The cheapest is $350 a pop. The second cheapest listing is $396 a pop.

But more importantly than this – the next consideration, knowing that SSDs have a finite number of erase/write cycles, which means that again, their deaths are GUARANTEED (brake pads) – the question is going to be - when it finally dies, are you going to be able to source a replacement?

If you have eBay and/or Amazon and/or Newegg links – that would be greatly appreciated.

(It is always good to know what works (vs. what doesn’t).)

I’m confused by this.

Why would I create a dataset for a, for the lack of a better term - a VM-to-host communication protocol?

Can’t. Not without re-architecting the layout such that it is purposely ‘subdivided’ into 12 TB chunks. (Or where the quota for each dataset is 12 TB.)

I mean, can you imagine setting the quota for all of your datasets to be only 12 TB?

As a network share, the pool is shared at the top level.

(So that I don’t end up with 22 drive letters in my Windows VMs for those that don’t support virtio-fs, with having to mount each of the datasets separately.)

re: NUMA and ZFS
My understanding is that the host OS (where ZFS is running on) is responsible for schedule and allocating resources.

Therefore; my thought process is that if there are multiple ZFS processes running, they will get distributed to the different processors/sockets/RAM.

I bought this dual Xeon server off of eBay because it was relatively cheap. (The RAID card alone, if purchased separately, from eBay, would be more than 50% of what I paid for, for the entire server.)

(I don’t remember if I have NUMA enabled in the BIOS or not - I forget - because I don’t remember if that was required for the GPU passthrough or not.)

And yes, I fully understand what you are saying re: latency and the adverse impact it has on bandwidth.

But right now, given that rsync can’t do more than ~400 MB/s, even the agreed pitifull QPI link would still have a LOT more bandwidth available than my system is able to leverage at the moment.

Until I start maxing out 8*150 MB/s writes = 1.2 GB/s (writing to pool B), I’m not even really going to worry about that.

(My preference, would actually be to see if there is a way to enable DMA (a la RDMA) and bypass the kernel stack altogether. But from the discussion here, it sounds like that even with going through the kernel stack, sending the data over as a block stream rather than as files/folders would peg the write STR, which I don’t need DMA for that. But that would also mean that the dataset would need to be set up and prepared specifically for zfs send/zfs receive vs. me having to “compose” what I am going to be backing up to tape, which is what I am doing right now.

(i.e. all of the datasets will have a quota of 12 TB. (It’ll actually be more like 11 TB since I read somewhere that it’s not great to write to 100% of the tape. I think that it is recommended to only write to about 90% of the tape.) And then my data will have to be split up into multiple datasets in accordance with this quota, which will pretty much automatically generate the “tapes” for me, so that whenever I am ready to run the tape backup job, I can do so with the zfs send/zfs receive commands to pool B, which is where the tape system will pull the data from, to write to said tape.)

This is the high level summary that I am pulling together, in my head, as I am reading through the replies.

For an enterprise, the maintenance cost of constantly having to buy more SSDs (and the mound of dead SSDs/e-waste) that this creates is just a part of the cost of doing business.

For me, I’m on the opposite spectrum where the objective function is to maximise cost on a FINITE budget.

Money is only allocated to the initial capex of the system itself. No additional funds are allocated for maintenance.

This is why I am using LTO-8 tape for cold storage. Once the data is backed up, the cost of keeping the data is $0/month. Yes, there was the initial capex for the tape drive and the tapes, but that’s it. It’s a one-time capex, and afterwards, once the data is backed up, it’s a done deal.

$0 maintenance fee.

It’s just that the process of migrating from hot/live storage to cold/zero-watt storage - that’s the part where this problem manifests.

alpha754293 · June 7, 2023, 3:39pm

pigz is parallel compression only (IIRC). If my information is outdated, I welcome updated information.

It is my understanding that pigz does NOT support parallel decompression (which would be absolutely critical, for the total volume of data that I am backing up).

This is why, if I do run the compression manually, I use pixz instead, because it supports both parallel compression and parallel decompression (which many compression algorithms/tools, does not support).

(It’s fine to parallelise compressing a 5 TB project folder. What sucks is when you have to decompress said 5 TB compressed file and the decompression process isn’t parallelised. Then the decompression task takes forever.)

pixz can decompress that with threads/parallelisation, and then once that is done, the tar -x isn’t parallel anymore, but at that point, it will unpack the tarball as fast as your storage medium can handle it.

It is when it is rsync-ing a bunch of JPEG files. Writing lots of small files to tape is a very slow process.

Here are the details for how I am backing up my stuff to LTO-8 tape.

Hardware:
MagStor LTO-8 external HH tape drive
ATTO Express 980 SAS2 HBA with two external ports.
Core i7-3930K (which I have killed at least one out of the 6 cores from previous overclocking, so in the BIOS, it is set to use only 4 cores)
Asus X79 Sabertooth
Maybe a Nvidia Quadro 600. I forget. Some low level video card, for basic video out.
Mellanox ConnectX-4 100 Gbps IB NIC

OS:
CAELinux2018 (based on Xubuntu 16.04 LTS)
IBM LTFS single drive edition

LTFS was compiled according to instructions that I found, and I was able to get it up and running to run the tape drive.

A temp directory is created when the system boots up. mkdir -p /tmp/lto8
mkltfs formats the tapes
Tapes are mounted with ltfs to the /tmp/lto8 folder.
The Proxmox shares are mounted over NFS.
Backup task is started with the command:
path/of/source/data$ rsync -avrsh --progress=info2 * /tmp/lto8

And it will start writing the data to the tape.

I’ve never used mbuffer before. Will have to read more about that.

Thank you. I appreciate you and the rest of the team here, educating me in regards to this.

That’s actually REALLY smart of Ubuntu to do that.

It’s a shame that Proxmox (Debian) doesn’t do that automatically for you.

So…stupid question then – if you have say like a thousand folders – does this mean that Ubuntu is going to have like a THOUSAND datasets then???

(i.e. if you have pictures, and you sort the pictures by date and/or place(s), and each date and place has a folder – based on what you wrote, Ubuntu would create a dataset for each time and place for your pictures?)

That can be a LOT of datasets. Yikes!

You’re going to need a ZFS dataset explorer just to manage that, if that’s the case.

Yeah, right now, in Proxmox, they just live as I think qcow2 files in the folder where the disk files are stored.

I don’t have datasets nor zvols lying underneath that the disk image files sits on top of.

(Once I am done backing up my “user” data, this round, the system VMs will be next. I think.)

That’s awesome.

With it being that frequent, I am guessing that, as mentioned, it doesn’t really take up that much more space to do that then?

What about raidz2?

I was under the impression that because of the writes being striped across the disks, and the parity being calculated in the background, that I should still be getting decent write performance, no?

I have skipped 10 G networking altogether by using virtio-fs.

No 10GbE switches (of the RJ45 nor SFP+ variety). No cables. No fiber/SFP+ modules. None of that stuff. (16-port 10 GbE switches still aren’t cheap. And on a $/Gbps basis, it’s actually more expensive than my 100 Gbps IB switch.)

By taking the networking stack completely out of the picture, I don’t have to worry about networking as an extra headache layer.

My current level of understanding is that if I have an existing folder (e.g. /path/movies) and I were to create the dataset – would I need to move the movies outta there only to put it back in?

(I am not sure how it ZFS will handle the fact that the /path/movies already exists when I create the dataset ‘movies’ under ‘path’ (so that the location that the ‘movies’ will point to is also /path/movies. I hope that I am able to properly and adequately communicate my question.)

I was under the impression that the CoW will happen regardless of the number of datasets that I have, no?

For mostly dumb storage, there isn’t much that ZFS offers (other than data resiliency) that mdraid or hardware RAID doesn’t offer.

It has been my observation that one of the biggest benefits to ZFS is the fact that I can create a pool that spans across two HBAs, whereas with a HW RAID HBA, I can’t do that.

And with ZFS on Linux, the other thing that I have also noticed that’s a benefit (vs. the original Solaris 10 6/06 implementation) is that you can add the drives by /dev/disk/by-id instead of like the /dev/sd naming convention (or in the case of Solaris 10 6/06, the cxtydz notation.)

That way, if you take out the drive and/or move the drives around – ZFS on LInux doesn’t care.

These are the two most significant advantages that I have found, thus far, with ZFS on Linux.

Yeah, I’ve thought about that.

But where I run into issues/concerns is “how do I back that up to tape?”

Two questions:

Can you send a ZFS dataset to tape directly?
If (1) is true, then doesn’t that mean that I would have to put an 11 or 12 TB quota on all of the datasets so that I won’t exceed the capacity of the tapes?

If there is a way to get ZFS in Proxmox to automatically create a dataset for every folder, I’m all for it.

I don’t remember what my total folder count was the last time I ran a check on the filesystem, but I wouldn’t be surprised if it was hundreds of thousands of folders, and I think something like 15 million files. Something like that.

I’m not going to manually create a dataset for every folder that I have. That’s just not practical/feasible.

It’s nice that Ubuntu does it for you automatically.

Unfortunately, Proxmox doesn’t.

(I’d probably use that feature more if the creation of the datasets was automated at the folder level.)

Some of my text-based index files can be 1 GB in size.

I am not sure how fast ls | grep is going to be when it has to comb through the filesystem (even with the use of snapshots) vs. indexing the filesystem, and then I can just cat index.txt | grep to find the file.

It takes a little while to build the index (4 minutes?), but basically no time at all to look for something in the index file vs. using ls | grep.

Yeah - I’m not so sure about that.

If this were true, then it would not be possible to hit 400 MB/s on the larger, more contiguous files.

I’ll have to run Xinorbis again on the file system so that I can get a count as to what the current statistics are. It’ll probably take that application a few days to scan through the system.

bambinone · June 7, 2023, 3:49pm

Yes, snapshots act at the dataset level. But you can still zfs snap and send the top-level dataset which would effectively be snapping and sending the entire pool.

Porque no los dos?

I literally just explained what a zvol is and what it’s used for. As an administrator you should know whether you’re using them or not. Or at the very least don’t misuse the terminology when you mean zpool and don’t know what a zvol is.

Here’s a zvol:

$ sudo zfs create -V 1G tank/alpha-zvol
$ zfs list -t volume -r tank
NAME              USED  AVAIL     REFER  MOUNTPOINT
tank/alpha-zvol  1.03G  3.76T       56K  -
$ ls -l /dev/zvol/tank
total 0
lrwxrwxrwx 1 root root 9 Jun  7 15:13 alpha-zvol -> ../../zd0

I can treat it like any other block device in my system:

Click to expand

$ sudo fdisk /dev/zd0

Welcome to fdisk (util-linux 2.37.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x20b2e2cf.

Command (m for help): g
Created a new GPT disklabel (GUID: 6A2DEFBC-4540-164B-A82E-86E83E9C6C59).

Command (m for help): n
Partition number (1-128, default 1):
First sector (2048-2097118, default 2048):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-2097118, default 2097118):

Created a new partition 1 of type 'Linux filesystem' and of size 1023 MiB.

Command (m for help): p
Disk /dev/zd0: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 8192 bytes
I/O size (minimum/optimal): 8192 bytes / 8192 bytes
Disklabel type: gpt
Disk identifier: 6A2DEFBC-4540-164B-A82E-86E83E9C6C59

Device     Start     End Sectors  Size Type
/dev/zd0p1  2048 2097118 2095071 1023M Linux filesystem

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

$ sudo mkfs.ext4 /dev/zd0p1
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done
Creating filesystem with 261883 4k blocks and 65536 inodes
Filesystem UUID: 6686c2f3-940d-4756-b59b-6a22557cf2b6
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

I know. Image files (qcow2 or raw) in datasets and zvols (without iSCSI) accomplish the same.

Sorry, I meant virtual desktop infrastructure, not virtual disk image.

zpool iostat?

This is flat-out wrong. Most SSDs will last long past their rated lifespans, which have more to do with warranty limitations than anything else. If your write amplification factor is low enough, a good SSD will last practically forever. But it’s a moot point in a production system as I’ll discuss below.

Also flat-out wrong. HDDs have endurance ratings as well and it might surprise you how short the rated lifespans of most drives are these days. They will die if you write them enough.

Ultimately you should be completely replacing your production drives—spinning rust or solid state—every three, four, or five years anyway (depending on your depreciation schedule). Otherwise you’ll fall off a failure rate cliff and it won’t matter how many parity disks you have in each vdev. Spec your drives appropriately and replace them on a schedule and you won’t have to worry about any of this shit.

Gee, it sounds like that would be really easy to plan around.

Sure, with something larger, faster, cheaper, and with more endurance.

If that’s too pricey there are plenty of threads here and on reddit, Serve The Home, unRAID forums, etc. discussing cheap enterprise drive options. I have a mix of Samsung PM983 and PM1725a and Intel Optane P1600X drives in my homelab fileserver. Two PM983 960GB drives came brand new for $75 each on eBay and are rated for 1.3 DWPD for three years—more than enough for a special vdev. Two P1600X 58GB drives came brand new for $40 each from Newegg and make absolutely perfect little SLOG devices.

Any risers from reputable brands should work fine. It’s a dumb card, pretty hard for even the clumsiest manufacturers to goof up. I have two SuperMicro AOC-SLG3-2M2 and one StarTech PEX4SFF8639, all sourced from eBay and r/homelabsales.

So you can snap and send it independently…

Yes, it will distribute processes evenly, which is exactly the problem. You don’t want the two halves of your application running in different NUMA nodes.

You can’t look at one number and arbitrarily decide that you don’t have an inter-socket bottleneck. Think about how much longer each ARC hit/hit-miss might take. Think about ten different data streams fighting for priority over QPI. It’s not so cut and dry and you can’t just hand-wave it away.

Data still has to get from point A to point B, e.g. from a buffer in node 0 to a buffer in node 1, regardless of whether the CPU is involved or not.

I can’t wrap my head around your engineering goals here and quite frankly my mind is so boggled by this whole conversation that I think I’m out. You keep saying you can’t use datasets to send between pools because a tape can only hold 12TB. That’s the craziest thing I’ve heard in weeks. It’s like saying you can’t go for a walk today because your car has four wheels.

Architect your primary pool to service your VMs and other use-cases. Use the features of ZFS as they were intended and get over your irrational fears of SSDs dying or generating e-waste or whatever the problem is.
Architect your secondary pool to reliably receive snapshots from the primary pool.
Design your tape strategy to read data from the secondary pool and archive it however is appropriate for the medium. A single backup can span multiple tapes if it has to.

Good luck. Toss out your manuals from 2006 and do some more reading.