ZFS advice sought - hopefully without XY problem

I was looking for some pointers/ideas on how to manage a storage buffer and because of @wendell I know that there is an experimented ZFS community in here, so I thought I’d pick your brains.

TL;DR

  • Data acquisition system with 2x 25GbE in and 1x 10GbE out.
  • Input is bursty, output is continuous (but may have occasional outages).
  • Over dozens of minutes, the average input rate is lower than average output rate (but it is useful for data to be stored locally for a couple of hours).
  • Q: How to best configure the local SSD/HDD buffer?

What we have:

  • AMD Ryzen Threadripper PRO 5955WX.
  • 8x 16GB Kingston DDR4-3200 Registered DIMM CL22 2Rx8 1.2V Hynix D 9965754-006.C00G.
  • ASUS Pro WS WRX80E-SAGE SE WIFI with 7 full x16 PCIe slots.
  • 4x 2TB Seagate FireCuda 530 ZP2000GM30013 in a x16 PCIe board so that each disk has x4 PCIe directly to the CPU.
  • 2x Mellanox MT27710 ConnectX-4 Lx dual 10/25G SFP28.
  • 3x 4TB Seagate IronWolf Pro ST4000NE001-2MA101 (and we have 5 more if that makes a difference).
  • Alma Linux 9 running kernel 6.3.8-1.el9.elrepo.x86_64.
  • Other SSDs for OS and other motherboard devices and NICs.

Details:
This is a data acquisition server for a particle detector.
The inputs come from a FPGA-based custom ATCA board via 2x 25 GbE links and the output from the server is sent to a computer centre via a regular 10 GbE link.
We receive 50G of data in bursts of up to 4 seconds and can output basically continuously at 10G, so the average data rates are balanced.
The input data are UDP custom packets that need to be processed and aggregated into separate files prior to further analysis. I.e., we cannot “just” have the FPGA send data to the computer centre.
The goal for this server is two-fold:

  • buffer between the FPGA’s bursty 50G and the 10G to the data centre, and
  • allow for some immediate data analysis to happen directly in the server.
    Therefore, it needs some form of storage that cannot be only a few seconds but also needs not be more than a few hours.

From my experience with Linux, I would have naively used some combination of a ramdisk, mdraid, and inotifywait to move things around, but I thought it was time to join the XXI century and ZFS looked like a great solution.
With that in mind I thought I’d go with:

  • a generous allowance of dirty_data as write cache,
  • a generous allowance of ARC as read cache,
  • SSDs as ZIL, and
  • HDDs in raidz1, in case the output link goes down for too long.

My hope was that the data would be absorbed by ZFS at RAM speed for the most part and then would trickle its way out to the ZIL and the HDDs as possible.
Clearly, I knew nothing about ZFS and the first thing I dropped was the idea of a huge ZIL (2TB when all the SSDs were mirrored).

Right now, I am looking at using the SSDs in a ZFS pool of two mirrors with two SSDs each.
Of course, that means some form of inotifywait to move from the SSD pool to the HDD pool, and I would still have no idea how to tune dirty_data or the ARC, so…

I was wondering if someone has experience with this sort of high-speed bursty buffering use for ZFS and how to tune the plethora of parameters available.
Of course, it may be that ZFS is not the right tool and suggestions for alternatives are equally welcome.

Thanks in advance!

2 Likes

What volume of data are you working with? You mention NIC size but not amount of data and how quickly it arrives - would help to know.

I wrote a guide covering the most interesting tuning parameters, including dirty_data_max. Where to tune it depends on your system. Avoid TrueNAS Scale. I use 30GB dirty data max on my pool, along with higher txg timeout and outside of complications when using high compression (short lock-ups, because compress 30GB NOW pls, resulting in 100% CPU on all cores), it runs fine.

You need the memory for it though. That’s 30GB in your ARC and the usual cached data will be evicted if ARC has to store dirty data, so keep this in mind. But it works like a write cache. It runs fine with my 10Gbit networking.1.17 GiB/sec sequential writes according to htop, with 4k random, my HDDs write faster than the NVMe can read them.
Eventually, this is limited by my HDDs (3x mirrors, they write with 600-750MB/s) once the 30GB are full. I rarely have >50-80GB to write at once, so this 30GB buffer is perfect.

That’s async writes. Not sure how sync writes work on NVMe pools. Another option would be sync=disabled (yet another heretical feature of ZFS) to effectively use ARC as your LOG.

This is almost certainly not the recommended way of doing things, but it works for me and gets the job done (fast).

scroll down to second posting to find tuning stuff. But basics will be on the top posting and may be useful too.

2 Likes

Added an edit:

We receive 50G of data in bursts of up to 4 seconds and can output basically continuously at 10G, so the average data rates are balanced.

Multiplying through, that’s 200 GB in one burst, which is more than the RAM we have.

Thanks for engaging and HTH.

Thanks @Exard3k.

We have looked into quite a few of the things that you mentioned like sync=disabled and zfs_txg_timeout=40.
We’re also using compression=off and recordsize=1M and could probably go higher on the latter.

We did see stalls when doing write tests when the base pool was HDDs even with the SSD ZIL.
That’s when we decided to just make a SSD pool and try that out. The best we could get with that was ~18 Gb/s writing; during that, I could see the regular flushes to the SSD pool, so it would seem that that was not the problem.
Interestingly this was the case regardless of the pool configuration. We get 18 Gb/s for:

  • ssd1 ssd2 ssd3 ssd4
  • mirror ssd1 ssd2 sd3 ssd4
  • mirror ssd1 ssd2 mirror ssd3 ssd4

This was puzzling and our conclusion was that it pointed at the RAM, but we’re no experts.

Now, there’s one thing you mentioned that is very interesting:

[…] cached data will be evicted if ARC has to store dirty data, so keep this in mind.

Does this mean that zfs_dirty_data_max and zfs_arc_max are not separate memory pools?
I mean, it would make sense to me that the ARC is both read- and write-cache but I did not get that feeling from the docs.

Thanks again!

It’s the “last line of defense” so to say. ZFS will commit the TXG earlier if other thresholds are met. That’s why you see a gradual decrease in dirty data once it hits the max and corresponding writes to the disks.

There is only one ARC and all data has to fit into zfs.arc_max. If you have an ARC of like 64GB, usually this consists of like 55-60GB of compressed data and metadata and the rest being dirty data, L2ARC headers and other misc stuff. If you have an influx of new 30GB, ARC will evict stuff to make it fit. Dirty data is top priority and by setting dirty_data_max , you tell ARC it can evict everything else.

So get more memory if you need more memory. Simple wisdom.

I’ve seen multiple graphs from major companies as well as from OpenZFS themselves where ZFS throughput converges to that number. Its a ZFS limitation on NVMe, related to ARC and general ZFS architecture. Software isn’t ready for NVMe in general and ZFS is no exception. 18GB/s is still amazing for proper storage in a single server.

If you have compressible data, you can probably boost the write speed with compression. But at 15~GB/s, I doubt that any CPU and memory can handle this (prove me wrong! lz4 might work). It works very well for HDDs where we see gains in throughput. If you have to write 50% less data to disk, it’s usually faster. But with NVMe, the drives are rarely the limiting factor.

2 Likes

How do you cope with insufficient RAM at the moment?

The payload larger than RAM buffer, you must be dropping data? Doesn’t look like you can write it out fast enough to keep up with input?

Or would the ideal ingest be 50gb/s?

No. If you reach the dirty data max or any other limit, it will throttle until the disks are done writing and will accept new dirty data in parallel. Works very transparent from my experience.

Basically the same as with default settings. But defaults are 10% of arc_size and txg timeout of 5 seconds, so you won’t usually perceive it working like that on a default pool.

ZFS aside, I was meaning how does OP cope currently, at the moment.

data coming in, at 200gb over 4 seconds, UDP, so it’s coming whether it can be handled or not

total system memory is 128gb

Three seconds is already more than total memory, and the system can’t write out enough to the available flash to free up space before more data comes in, regardless of the medium filesystem of storage in RAM / Flash

1 Like

Yeah, 50GB/s ain’t gonna happen in ZFS. And 128GB is what I got on my homeserver and I didn’t build it to buffer 200G of data :wink: Need 512GB+ for that kind of work.

And probably a RAMDisk for temporary storage and write it to the pool afterwards. Unclean solution, but throughput requirements are insane.

At this point you can just use HDDs again and write the 200G to disks in a couple of minutes and wipe the RAMDisk after writes are committed.

1 Like

I could be wrong, but looks like OP wants a system for a buffer, so perhaps a bunch more ram, copy to some existing flash for a safe buffered copy, and slowly transfer the buffer over the network to the main systems/ servers.

So kind of using the in machine flash as a glorified zil/slog.

Will murder the flash drives, but be a good use for them.

Edit: just looked up the flash speeds. Looks like raid0 should be able to cope with near 50GB/s, so I got my maths wrong. (4*6.9 / s)

I apologise

I never played with it, but a system with a xeon and persistent optane RAM would be more akin to what OP lists, if intel didn’t kill the project.

Of course, I am not in charge of OP’s budget…

Just to be clear, I meant 18 Gbit/s, not 18 GByte/s. The latter would be more than enough for now :smiley:

So this is what is picking my brain: wouldn’t ZFS-managed dirty pages be better than a manually-managed ramdisk?

You’re looking at multiple issues and you need to tackle each one of them…

Looking at this you probably need care about NUMA

On top of that you likely need to use some kind of compression (or a ton of RAM) to handle the bursts without dropping data which also will be a challenge. https://github.com/openzfs/zfs/pull/5846

2 Likes

Really, at least until DirctIO is released, the biggest thing to consider doing as a non-default setting of setting

primarycache=metadata

So that your ARC cache doesn’t hold data from the drives.
Proxmox VE: ZFS primarycache=all ou metadata | IKUS (ikus-soft.com)

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.