Least bad filesystem for millions of files in a directory

In ML work, I occasionally end up needing to temporarily wrangle millions of files of size 0B - 2MB in a single directory. This is generally temporary, as such datasets can be packaged into other formats like webdataset that are a little friendlier to I/O in general, but I’m wondering if needing to do this can be made less painful by choosing a specific filesystem? I imagine the answer will be Ext4 or XFS, but if anyone knows if it matters and if so which handles this better, please share your wisdom!

Edit: I’m aware that high-IOPS NVMe is the preferred hardware for this sort of thing…

1 Like

What software are you dealing with? The directory syscall interface itself is probably the most unfriendly part of millions of files in a directory.

If you have any control of the software, turning “filename” into “f/i/lename” (or “h/a/sh”) can make a big difference.

Though side thought, dealing with millions of files in a flat namespace is something object storage is great at.

3 Likes

Generally when I’m in this situation I’m using shell scripts or self-written python scripts, shuffling data around or downloading lots of files from around the internet before things are packaged into a better format.

Interesting points about f/i/lename, etc.

Re: object storage, I have no experience with this, can you name some examples that aren’t too much of a hassle to self-administrate? I can go read the documentation. Thanks

IMO a ram disk is going to be the most ideal. Maybe you could create a dedicated filesystem (or btrfs subvolume) to use just for the temporary files, and turn off sync / barriers? You lose crash safety, but you’re no longer bound by Sync IO speeds.

The S3 API would probably be the standard example. But if it’s all local, it doesn’t make a lot of sense here.

I’ve also seen sqlite databases used as a way to store many tiny files more efficiently. I believe lightroom does this with thumbnails.

It’s all horrible, posix ensures that it has to be with it’s user friendly atomic, sequential interfaces.

Nested directories get around that problem.


You might be able to write your own fuse filesystem and just never materialize the directory, but if you’re doing that you might as well just fix the software to not use a filesystem for storage.

Anything that has a limited number of inodes might lead you scratching your head on why the file system is full if there is still double digit percentage available.

Yes EXT{3,4} I’m looking at you.

Anything aimed at HPC/Scientific Compute, I would think. They are all kinda a PITA to use and maintain though, may be worth it though.

I used to run BeeGFS with 2 combined Meta-Data & Storage Nodes and 3 Storage Nodes. It does wonderful for many parallel empty file creates and read/writes. However, maintenance is a bit, well, 5 nodes…

Ugh, I remember this pain so well. Type ls and wait 15 mins for a result, on a fast system!

+1 for a functionally nested directory structure.

Data pipeline accelerator tools like Nvidia Dali may help. https://developer.nvidia.com/dali

Dependent on your code and data pipeline, perhaps you could create a file index right at the start of the pipeline, so you only ever need to list the dir contents once, and then are only reading/writing specific files. Could hold the index in a pandas dataframe?

But that is maybe starting to sound like a DB.

Fwiw we used to use mongodb for image storage. A simple flat json schema so easy to work with and manage, and pretty fast, reliable, with some cluster options for increasing read or write performance if needed. Backup pretty easy too if needed. Not sure where mongodb is with free licensing now, but there is a docker container that might allow easy setup of a single instance or cluster. Docker

There is also this. SQLite Archive Files

Thanks everyone, there are several ideas here that I wasn’t familiar with. I’m currently in a rabbit hole learning about object storage… Seems like it can be hosted locally with MinIO and probably other projects as well. I’m not sure if I’ll end up adopting this or not, I’ll come back to the other suggestions when I have time.

1 Like

What ever you do, I suggest doing some simple benchmarking to show if one method is significantly faster than the other.

4 Likes

Update: When I made the original post, I was anticipating having to do a bunch of work with many small files and remembering painful fast experiences. Well, the time has come. I did some reading about various options suggested here but due to time pressure I decided to just try it with plain old XFS, and I’m finding that it is performing okay so far. The underlying hardware is enterprise grade SATA SSDs at the moment, pooled via LVM (no redundancy). Metadata operations like listing directory contents aren’t too bad, I can work with this.

Speaking from experience, Linux handles millions of files in a folder vastly better than it handles millions of folders… Don’t go too far down the path of breaking up into folders or you’ll really put yourself in a terrible spot.

As for performance tuning, mount your filesystem with noatime. On ext4 test with/without the dir_index option (see /etc/mke2fs.conf). On spinning rust you’ll want to make extensive use of spd_readdir.so wherever possible.

1 Like