TrueNAS Memleak

pppZero · October 4, 2023, 9:48pm

Greetings Humans!

I’ve run into an issue I cannot explain, and hope someone here might have seen this before and have some advice.

I have Truenas CORE running on a supermicro chassis, 2x Xeon 4241 CPU, 96GB of RAM, and ~100TB of usable storage, with deduplication enabled (it’s a backup target)

The machine runs SMB, SMART, NFS, SSH, and the webUI, absolutely nothing else.

After booting and before accessing any disks, it will use about 8GB of RAM. Within the hour, it’s gone out of memory and has ground to a halt. If I keep rebooting it every 45 minutes it’s fine, but that makes it hard to use it as a backup destination!

I’ve tried to find the culprit myself but htop isn’t reporting anything using gigabytes and gigabytes of memory, the most was middlewared using 0.3% - and my BSD-foo isn’t that excellent.

Has anyone seen/heard of something similar, and/or have something I can try?

Right now the machine isn’t even staying up long enough to copy out the data that’s on it, let alone use it for it’s intended purpose.

I have half a mind to attempt a SCALE upgrade, but I don’t know if that’ll fit inside my 45 minute runtime window before whatever is breaking, breaks.

NicKF · October 4, 2023, 11:15pm

This is most probably the culprit.

pppZero · October 4, 2023, 11:34pm

Can you elaborate / how do I verify this?

In the GUI the RAM usage is “Services” not ZFS/Cache - and it seems to load the dedupe data into RAM before booting. I have a second system doin dedupe with slightly less storage and slightly more RAM which also runs into this issue, just on longer timelines.

NicKF · October 5, 2023, 12:05am

TLDR, you probably shouldn’t be using dedupe in ZFS at all…at least for now. Work is underway on that topic…

ZFS Deduplication | (truenas.com)

Cause: Continuous DDT access is limiting the available RAM or RAM usage is generally very high RAM usage. This can also slow memory access if the system uses swap space on disks to compensate.
Diagnose: Open the command line and enter top. The header indicates ARC and other memory usage statistics. Additional commands for investigate RAM or ARC usage performance: arc_summary and arcstat.
Solutions:
Install more RAM.
Add a new System > Tunable: vfs.zfs.arc.meta_min with Type=LOADER and Value=bytes. This specifies the minimum RAM that is reserved for metadata use and cannot be evicted from RAM when new file data is cached.

For now this may also be helpful
My experiments in building a home server capable of handling fast + consistent deduplication | TrueNAS Community

Also this may be interesting if you are willing to move to Linux or SCALE.
TrueNAS SCALE Cobia Has Reached BETA
*ZFS Block Cloning (Pseudo Deduplication) for SMB & NFS file copies

Also, this is “coming soon” tm

and

Fast ZFS Dedup - Sponsorship Request | TrueNAS Community

pppZero · October 5, 2023, 3:30am

You certainly can elaborate, I appreciate that!

Reading along, it sounds like the DDT read/writes overwhelms the systems, not that they’re running out of RAM.

I’m waiting on the system to come back up so I can check things (it flat-out does not seem to want to boot today), but from memory there’s a pair (maybe?) or SSDs for the metadata, as well a a different pair of SSDs for the DDT.

The 128Gb RAM system (which also crashes, it just takes longer) is currently using 70Gb for the DDT in RAM: dedup: DDT entries 415924268, size 560B on disk, 181B in core

which ends up being 70Gb - the 96Gb system might simply be out of RAM.

Buuut it won’t come up right now, which is slightly concerning haha

NicKF · October 5, 2023, 5:02am

Feel free to send me a debug file if you want. I’ll look to see what the kernel logs are complaining about. Or you can just post the contents of /var/log/messages here

Create Debug Core | (truenas.com)

pppZero · October 25, 2023, 9:32pm

I have two machines “litte/old” (dual CPU Xeon, 128GB RAM, SAS SSDs for DDT, rust for storage) and “big/new” (two better xeons, 96GB RAM, mirrored NVMEs for DDT and Metadata, rust for storage) - this was started about big/new

I finally managed to get it to boot to a point where I could log in to it! (I did nothing aside from power it on and walk away, it eventually booted)

zpool status -D has the DDT entries @ 641357337 X 210 bytes in memory, about 125GB, which is decidedly more than the 96Gb of RAM in the system, so on the assumption that it wasn’t helping the case, we put another 96Gb in there.

While we were waiting for the second 96GB to show up for the main system, the litte/old machine starting running into “the same” issue.

litte/old ended up with more RAM from a sibling, it’s now 256GB, and big/new is 192GB RAM.

After some digging around on little/old mostly via gstat i found the two DDT disks were indeed being overwhelmed (they’re old SAS SSDs, I’m not really surprised) which is kinda good news, little/old was the proof-of-concept and while it only had 60-70TB of data on it, was fine. Now it’s at 100Tb used, and it’s much less fine.

It has enough disk under it to disable deduplication, so that’s what I’ll do for that one. It does not need it, but it’s where I was testing.

New/Big has been less cranky since we put more RAM in it. it’s uptime has been days so far, instead of hours.

Exard3k · October 25, 2023, 10:12pm

compressed ARC? Not sure how this works for DDT but they are blocks which can be compressed after all. And I certainly had 300GB of data sitting in my 112GB ARC at some point. CARC ftw.

Just means that ZFS has to evict and fetch DDT from disks all the time because your memory is too low. If your memory has the same size as the data in your DDT special vdev, ZFS will be happy. And ZFS will remind you of a certain girlfriend of your past, this is normal.

And yeah, dedup is a memory hog. Nice to experiment and play with, but cost and opportunity cost are outrageous. I do see the point in very special niche datasets. But pool-wide DDT is suicide.

“Friends don’t let friends use dedup”

pppZero · November 14, 2023, 10:04pm

the machines in question are purely for long term backups. local regulations require that at the very least we can pull up 7 years of data for audit purposes. the rate of change of a lot of our data is pretty low, there’s just lots of it. Which seemed perfect for dedup.

We were in a situation where we were buying more storage every year just to hold archive data, and I was hoping that would stop with some dedup, and testing was very promising with just 50-100Tb of data. Not as fast as a non-deduped array, but as long as the backups finish before the next round kicks off, i don’t have any more need for speed.

In hindsight “lets just do dedup” may not have been the silver bullet it looked like, I knew going in we’d need a lot of resources for the machines, but clearly under estimated how much “a lot” was!

I’m pleased to report both machines are staying up now.

One was falling over because the DDT SSDs are sorta-old “generic workload” SAS jobbies and being absolutely overwhelmed by the DDT reads/writes/whatever it’s doing. This machine didn’t actually need Dedup, so I’ve turned it off. One problem solved.

The second one was running out of RAM for the in-RAM DDT. The Dedup NVMEs in this machine also work very hard, but isn’t being swamped by the work load. So yeah, don’t use SAS SSDs for dedupe, and gen4 NVMEs are pretty borderline.

Lesson learned: don’t do dedup.

Exard3k · November 14, 2023, 10:41pm

JRE should

Don’t snapshots work? You can store like 1000 versions of a particular file via snapshots at a fraction of the capacity. That’s the magic of CoW after all. 7 year old snapshots certainly are a challenge by themselves…the amount of referenced blocks and potential fragmentation can be staggering too.

Otherwise just get more HDDs and plan ahead. Capacity requirements are usually somewhat predictable for the coming years. Always good to just plug in new drives/ new shelf in a solution that has spare bays. Adding more memory all the time certainly isn’t THAT easy as adding drives

edit: cloned datasets are another possibility for branching data.

pppZero · November 16, 2023, 11:31pm

Honestly, I didn’t even look at snapshots - I’m “pretty sure” our backup solution has no idea how to deal with them, though.

NicKF · November 17, 2023, 3:07am

Once you have dedupe enabled on a pool, it is always present. It’s like a virus that cannot be eradicated.

I can’t stress this enough…backup your stuff with ZFS send (replication task) and just stop doing dedupe. I fear you are past the point of no return and your dataset is just too much for dedupe right now, given normal sane amounts of RAM. Crank up a higher ZSTD compression ratio (or hell, GZIP9) for now and fight CPU bottlenecks instead, you’ll save alot of your time IMO.

Your only other real option is to dump a TB of RAM in this system and cross your fingers your dataset doesn’t ever exceed that.

Dedupe has a very narrow band of usability where you have a predetermined dataset that you only grow in a very controlled manner. It’s unfortunate, but this is true.

EDIT: Just for a bit of empathy here, because I don’t want to come off as in any way cold or rude or wahetevr…

I ran datacenter ops at a large k12 for many years. I don’t know who you are or what you do, but you seem to be coming from a similar place of “no budget but we have to do our best to do the right thing” so I hear you.

All I am saying here is that dedupe aint your answer

I’d be happy to help you come up with a better one, if not man power to you and good luck. There be more monsters ahead. If I’ve learned any lessons in my life, “ain’t nuttin that easy” and “nuttins for free”