Filesystem for storing lots of rsync snapshots

risk · June 4, 2021, 10:46pm

TL;DR I have a (fun) problem - I’m trying to avoid writing my own filesystem.

Being the data hoarder that I am, I have about a years worth of hourly snapshots of alpine linux package repositories - this helps me get reproducible build capabilities with my alpine based container images.

Due to entirely unrelated reasons, I’m actually kind of running out of disk space, in the middle of Chia craze - perfect timing (not!) , and so I thought “hold on, I’ll run ncdu and figure out if I’m wasting space on something stupid” which revealed a problem.

I’ve been using rsync with --link-dest to incrementally sync and hard link the repository into a dated directory.

The part of the repository I care about / x86_64 for a single release has about 80G spread around 13k packages, and because I have dated directories, this ended up as about 100M directory entries (or somewhere around that order of magnitude - hard to tell).

In case you’re curious, a gzip compressed ncdu inventory file takes up 1.4GiB .

So basically, to figure out how much space it’s all using, I need ncdu to issue hundred-ish million stat calls - which basically doesn’t really work.

However, as far as its intended use case, this works great. I have a local mirror for my containers, it’s space efficient, and my builds can travel forwards / backwars in time almost arbitrarily - I can easily pin a build to a particular version.

However, simple things like ncdu as I mentioned above, or just find across all versions - they just take forever.
I don’t even know how much overhead / space the filesystem end up using for this, I’m afraid to look.

So I was thinking, storing separate “directories” per timestamp is probably the stupidest way to store this data on disk. It works, it makes sense from a compatibility and simplicity perspective, but on a given day/week, I need the 10 or 20 versions out of thousands available - I could garbage collect the ones I don’t need, but that’s not as fun - what if I want to rollback? or bisect a container.

There’s probably a better way.

Naive solution:

We have files that appear at one timestamp, disappear at another, perhaps it might be conceivable that file with same name appears again…, but that’s unlikely for a software repo.
Surely there’s a way to store this more efficiently, in some mostly readonly form that does not pollute btrfs filesystem with so many actual directories.
For example, I could store this inventory of 13k rows/entries in some kind of csv / json / sqlite / apache arrow / protocol buffers or some other “structured data format”, and for each path in the file tree, we’d have a pair of appeared/disappeared timestamps. This directory data (a few megs at most - probably) could be presented as a fuse mount somehow, and could proxy reads to the underlying real filesystem.

I’d still need to make it rsync compatible somehow.

There goes my weekend - ugh, I don’t like this solution.

Better solution:

Surely I can’t be the first person to think of this - why reinvent the wheel and write yet another poorly written file-system that I’d have to keep maintaining. As long as it’s rsync compatible so I can keep hard linking - or similar.

Probably what I’ll do:

Don’t care about filesystems much. I’ll keep rsync, and I’ll keep the “latest” dir, but I’ll also hard link files into an archive dir where nothing ever disappears. Whenever rsync pulls down a new version of “latest”. My script will update the inventory, any files that appeared will get a “created” timestamp associated.
Any files that disappeared will get a “deleted” timestamp associated.
I can then write a small webserver to reconstruct the directory tree when access to a file is needed.

Any ideas for a similar solution?

risk · June 4, 2021, 10:47pm

reserved.

Dutch_Master · June 5, 2021, 7:35am

Your “problem” screams BTRFS to me. Also, the interval is quite short, I’d suggest a 4 or 6 hr cycle, if not longer. This greatly reduces the amount of data collected/processed but doesn’t compromise on availability/flexibility for your use case. IMO, that is.

risk · June 5, 2021, 9:40am

Good morning, thank you for reading through my long post.

Good guess about BTRFS, but actually it mostly works.

It’s just that things like find and ncdu , any kind of locate database building that actually traverses everything (sometimes accidentally) that turns out to be stupidly slow - it does usually finish. (except e.g. ncdu that just runs out of ram), but usually it’s ok.

This does in fact sit on BTRFS on top of luks volumes and on top of spinning rust, which also happens to be on a 10 year old, hp n40l 6GB ram microserver.

I wouldn’t expect the speediest quickest behavior with any filesystem really, but it should still work, and it kind of mostly does.

In my mind, this kind of thing, I think should really be possible to do on a raspberry pi class device somehow, or similar.
Which gives me an idea for another project - get a $40 SBC and a $40 microsd, and turn it into a distro cache that can saturate a gigabit link when the weather is right. (I’ll start another thread asking folks about microSD cards in 2021)

Do you think other filesystems would do any better somehow ? (ext4/xfs/zfs/jfs/reiser4 … and why/how)

The directory structure I have in place currently is:

alpine-mirror
- rsync.latest -> (symlink) rsync.<latest timestamp>
- rsync.<timestamp> (x8000 give or take)
  - v3.12
    - community (8k packages / dirents)
    - main (5k packages / dirents)
    - releases (300 dirents + more subdirs - 500 total)
  - v3.13
    - …similar to v3.12 above…

Per each timestamp, this makes for 40 directories, and about 30k files - which expands into about 240 million files / stat calls which should be mostly cache-able. I figured I can just drop the old 3.12 version of Alpine and cut this in half - probably - so only 120 million files.

As for frequency - hourly vs. 6 hourly. It’s not strictly hourly already, it’s “as needed”, but it averages to hourly.

Dropping the frequency would probably help, but what I’m looking for is some 10x - 100x improvement if possible (basic computer science indicates that it should be possible to reorganize these directories somehow)?

Is there a filesystem or a tool, or some kind of automounting solution.

how is the interval variable exactly?

Alpine has an mqtt-exec tool/thingie, that can listen for a message from an mqtt broker and invoke a script whenever a new version of the repo is available - which is what triggers the sync in my case. They also suggest an hourly cron in case mqtt gets stuck or there’s issues. I do both, as suggested.
The sync script I have for mirroring, always starts by invoking rsync into a staging directory - and sometimes there’s no changes. If this staging directory has an updated last-updated file the staging directory is promoted into rsync. where timestamp comes from this file, and a symlnk rsync.latest is updated to point to it. Oherwise it just remains there - maybe it gets updated next time around. That way, only “changed” things are saved. Sometimes - there’s just a single changed package. Sometimes there’s more, sometimes there’s a few hours with no new packages (probably due to issues or just people being asleep I guess - these are stable repos), but on average there’s a once an hour update.

I was hoping this “browseable snapshots” use case is probably common enough, and that there’s probably an efficient solution for it out there somewhere – maybe that was optimistic.

e.g. Should I just leave it and chmod the alpine-mirror directory traversable (+x) but not readable, to prevent random things from enumerating it by accident - could it really be that simple?

Dutch_Master · June 5, 2021, 5:14pm

Reading your script flow, I reckon you’d be better off setting up Git instead. It does exactly what you want for less overhead. Still, might be a good idea to upgrade the hardware, just 6GB isn’t cutting it these days. More like 64GB and an entry-level Epyc 3000 series system sounds better IMO the 3151 SoC is the sweetspot for most home labs. Board+SoC come in at around 500-600 USD from various suppliers. I’m eyeing up the Gigabyte MJ11-EC0 for a similar project, but other solutions are available as well.

As for alternative file systems, you mentioned a few.

ZFS is fine, but not native to Linux. It works and has good support, but it’s still a BSD FS from Oracle.
XFS: good performance for large files, less so for smaller files.
JFS: good performance with all file sizes overall.
Ext4: continuation of the ext fs line (ext, ext2, ext3) with increasing complexity troubleshooting. Is the default choice for many distro’s. 'Nuff said, I s’pose

Reiser is basically dead. It’s main (and mostly sole) developer is in jail for some capital crime and won’t get out anytime soon, if at all (no idea what exactly his sentence was, but it was substantial).

My choice, other then BTRFS, for this use case would be JFS, as it handles large numbers of small files better then XFS. There are comparisons between both somewhere on the web and I’ve read those, but no idea if these are still available online. Google should know

Anyway, my mention of BTRFS was about taking advantage of its CoW (copy-on-write) capabilities. Essentially, you write the file once, then only store the modifications to those parts, thus saving considerable on storage. If you haven’t explored that avenue, it might be worth your time as most things (or at least a sizeable chunk) you now do in your script can probably be offloaded to the underlying FS.

HTH!

risk · June 5, 2021, 9:25pm

I’m considering it - and I’m looking into git lfs. Currently, I have a single copy of data itself (hard linked many times over), and it’s servable from http. The way to mirror the data to local is through rsync, so I wonder if I rsync into a workdir, is there a way to use lfs to avoid making a second copy in a local repo.

right now it’s hardlinking all the files between versions - rsync is creating these directories and hard linking for me. It’s something that’s supported on many/if not all filesystems.

vhns · June 6, 2021, 1:21am

risk,

Just do ZFS with dedup and zstd compression.

Best regards,

vhns

risk · June 6, 2021, 6:20am

I don’t need dedup, I’m doing btrfs with zstd compression? Why would zfs be faster?

jonnymac · June 6, 2021, 2:09pm

what about using locate command vs find, you might need get the cron job that updates locate database to run more frequently for your application. but it should find the files faster as it’s not looking through the file system directly.