TL;DR I have a (fun) problem - I’m trying to avoid writing my own filesystem.
Being the data hoarder that I am, I have about a years worth of hourly snapshots of alpine linux package repositories - this helps me get reproducible build capabilities with my alpine based container images.
Due to entirely unrelated reasons, I’m actually kind of running out of disk space, in the middle of Chia craze - perfect timing (not!) , and so I thought “hold on, I’ll run ncdu
and figure out if I’m wasting space on something stupid” which revealed a problem.
I’ve been using rsync
with --link-dest
to incrementally sync and hard link the repository into a dated directory.
The part of the repository I care about / x86_64 for a single release has about 80G spread around 13k packages, and because I have dated directories, this ended up as about 100M directory entries (or somewhere around that order of magnitude - hard to tell).
In case you’re curious, a gzip compressed ncdu
inventory file takes up 1.4GiB .
So basically, to figure out how much space it’s all using, I need ncdu to issue hundred-ish million stat calls - which basically doesn’t really work.
However, as far as its intended use case, this works great. I have a local mirror for my containers, it’s space efficient, and my builds can travel forwards / backwars in time almost arbitrarily - I can easily pin a build to a particular version.
However, simple things like ncdu
as I mentioned above, or just find
across all versions - they just take forever.
I don’t even know how much overhead / space the filesystem end up using for this, I’m afraid to look.
So I was thinking, storing separate “directories” per timestamp is probably the stupidest way to store this data on disk. It works, it makes sense from a compatibility and simplicity perspective, but on a given day/week, I need the 10 or 20 versions out of thousands available - I could garbage collect the ones I don’t need, but that’s not as fun - what if I want to rollback? or bisect a container.
There’s probably a better way.
Naive solution:
We have files that appear at one timestamp, disappear at another, perhaps it might be conceivable that file with same name appears again…, but that’s unlikely for a software repo.
Surely there’s a way to store this more efficiently, in some mostly readonly form that does not pollute btrfs filesystem with so many actual directories.
For example, I could store this inventory of 13k rows/entries in some kind of csv / json / sqlite / apache arrow / protocol buffers or some other “structured data format”, and for each path in the file tree, we’d have a pair of appeared/disappeared timestamps. This directory data (a few megs at most - probably) could be presented as a fuse mount somehow, and could proxy reads to the underlying real filesystem.
I’d still need to make it rsync compatible somehow.
There goes my weekend - ugh, I don’t like this solution.
Better solution:
Surely I can’t be the first person to think of this - why reinvent the wheel and write yet another poorly written file-system that I’d have to keep maintaining. As long as it’s rsync compatible so I can keep hard linking - or similar.
Probably what I’ll do:
Don’t care about filesystems much. I’ll keep rsync, and I’ll keep the “latest” dir, but I’ll also hard link files into an archive dir where nothing ever disappears. Whenever rsync pulls down a new version of “latest”. My script will update the inventory, any files that appeared will get a “created” timestamp associated.
Any files that disappeared will get a “deleted” timestamp associated.
I can then write a small webserver to reconstruct the directory tree when access to a file is needed.
Any ideas for a similar solution?