I’m running into a recurring issue with my Docker setup and could really use some help.
Docker’s root (by default /var/lib/docker) directory is located on a dedicated 240GB SSD using BTRFS (in single mode).
I’m using Docker’s default overlay2 storage driver.
About once a week, the system reports the drive as full (df -h shows 100% usage), and Docker containers start failing or shutting down because they can’t write any data. However, when I check the actual disk usage with ncdu, it only shows around 30GB used.
To temporarily fix it, I stop Docker using systemctl stop docker, wait a little while, then start it again. After that, the disk space is reported correctly and everything goes back to normal. Just doing a fast restart doesn’t help—it has to be a proper stop/start cycle.
I’ve already tried scrubbing and rebalancing the BTRFS filesystem, but no issues were found.
I’m planning to move to a mirrored BTRFS setup using multiple drives soon, and I really don’t want to bring this issue over to the new setup.
Has anyone seen something like this?
Any tips on how to fix it or prevent it from happening again?
PS: I have no idea why but I cannot receive emails from this form platform. I’ve tried a password reset but I didn’t receive a thing. This has been going on for a while. Checked gmail filters and nothing in there. Also the winraid forum seems to be having the same issue.
what does sudo btrfs filesystem df /my/disks/mountpoint show?
maybe even sudo btrfs subvolume list (or show) /my/disks/mountpoint, could be millions of snapshots for all we know.
A lot of non-CoW tools in Linux are having trouble making sense of the CoW mechanics. I wouldn’t trust anything other than the CLI commands for BTRFS. I remember my Nautilus stating 80TB used out of 1.9TB…it was just adding all the .snap directory to the total. Utterly useless , but technically the snapshots are POSIX-compliant and have filesizes, because browsable
Are you perhaps using -x (search single file system)? In that case it won’t be accurate since it stops at btrfs subvolume boundaries. Also make sure to run as root since otherwise not all files can be accounted for due to permissions.
When the issue happends the df shows the disk as full.
As of now no snapshots or subvolumes has been created due to the issue above.
That is why I also tried with btdu. and the space he is reporting is still around 30GB and the rest of the space is marked as errors/unreachable.
Yep, running as root and with -x. but Iḿ not using subvolumes/snapshot.
Was thinking the same but btdu is showing that the fs has errors. But by restarting docker they go away… and I think docker temp files should show up on the disk tree.
I’m confused… Your opening post says /var/lib/docker is running out of space and now you show /mnt/storage-docker?
What does docker info say? If I read the documentation correctly, overlay-2 is not supported on btrfs? So it might just ignore that setting. Or… perhaps that’s the issue? Would it be a lot of work to switch to the btrfs driver?
Next time it happens, try to stop containers sequentially while logging disk usage. It could help to find the culprit that is holding all that space.
Maybe there is a process that is keeping open some big files that have been deleted. Until a process close a file that became anonymous, the hoarded fs space won’t be given back
if this magically goes away stopping + starting containers it’s probably a docker (or configuration) issue. Btrfs is either fragmented or it isn’t, stuff isn’t just fixed by stopping applications.
In the last two posts in that thread mention how this seems to be a fundamental issue between docker and btrfs (with the btrfs driver anyway), and how using overlay2 storage driver or podman (which is another software to run docker containers) works differently and does not have this issue on btrfs
I use docker routinely on btrfs too, and apparently also with overlay2 (which is the default on my fedora machines and I’ve never changed it). I’ve never seen such issues though…
Does seem like something strange going on. Maybe another driver could help? But it seems worthy of a bug report perhaps…