Strange Docker issue in Ubuntu 22

southern_magilla · May 7, 2023, 6:46am

Let me start by saying I am a potato and my understanding of things may be wrong, so I apologize for that, but I’m doing my best here.

I am running Ubuntu 22.04 on a Dell PowerEdge R620. The OS runs off an ESXi dual SD card add-in board thingie. The server also currently has three disks, two are 136 GB HDDs in a RAID 1 array as a virtual disk and the other is a 500 GB HDD. These show as, and are mounted at boot as, /media/<username>/RAID and /media/<username>/DATA. This happens normally, and I’ve not made any edits to /etc/fstab or anything, other than for my SMB share from my NAS.

Among other things, I am running a number of game servers on Docker using Portainer and these are configured to use bind mounts to dirs on the DATA disk (I know volumes are better but since I needed to add a lot of files and want to do backups and such, bind mounts seemed easier).

Additionally, due to the relatively small size of the SD cards the OS is using, I also configured Docker to use the RAID disk(s) as its storage location for things like volumes and images using a /etc/docker/daemon.json file with { "data-root": "/media/<username>/RAID/docker" }.

Things were going okay but gradually the system slowed down, especially Docker, and Portainer got really really slow. It would hang, I’d get weird errors, containers would hang and refuse to start or stop and couldn’t even be killed, and I would struggle to even get the Docker service to restart.

But the main issue was, after the system is rebooted, I would see apparent duplicates of the RAID and DATA disks in the /media/<username> folder, so there might be RAID and RAID1 and DATA and DATA1 and etc. These folders sometimes appeared to contain a copy of at least some of the files, but the game servers would be running fresh worlds and such.

However, if I click into one of the number-appended copies, it shows as being the original non-appended disk.

Gradually, I realized (or came to suspect, at least), that these issues are all related and the root cause issue is that the disks are not yet mounted when the containers in question start. The disks do take a few minutes to become available- it’s an older machine, and they’re older disks, and they’re spinning rust, and they’re in RAID, etc. But since the containers are using bind mounts, they are creating new dirs called RAID or DATA and then when the disks mount, they get pushed to RAID1 or DATA1 (or even DATA2 sometimes). And this has all kinds of bad knock-on effects.

So I need Docker to not start until the disks are mounted. I looked online and saw you can do edit sudo systemctl edit docker.service and add something like:

[Unit]
Requires=fs.target
After=local-fs.target

But will fs.target work? Do I need to either use something else or use this but also add something else? I am questioning it because the filesystem on the SD card is often ready and useable well before the disks are mounted and available. I am finding analogues online (this same issue but using an external HDD, or a USB drive, or an SMB share) but nothing exactly like my situation, and I don’t feel like I have a good enough grasp of what’s going to to make changes without fully understanding them.

I also saw you could add the specific mount from systemctl list-unit-files but I searched through there and didn’t see anything that jumped out.

One final solution I saw suggested was:

[Service]
ExecStartPre=/bin/sleep 10

However, this seems kind of janky and could be unreliable. I guess I could just set to 30 seconds or 60 seconds or something and live with it, but I’d rather have a more intentional solution.

I am also wondering if using volumes would fix this but I tried messing with it and couldn’t really get it working in a way where I could easily copy files (in some cases, A LOT of them) into or out of the volume, but this is basically required to use an existing world, or mods, with most game servers. Also, since the volume would be on RAID, I think it might just cause the same issue.

Any suggestions are welcome. Pls remember I am classified as a meat popsicle and don’t roast me too hard.

Thanks!

MadMatt · May 7, 2023, 8:46pm

This is an extermely bad idea, your SD cards will fail suddenly and you will lose all your OS data, or they will slow down with read/write error and your system will become unusable and then you will lose most of your OS data …

southern_magilla:

So I need Docker to not start until the disks are mounted. I looked online and saw you can do edit sudo systemctl edit docker.service and add something like:
[Unit]
Requires=fs.target
After=local-fs.target

Since you need to wait for a specific mount point to come online you can use

[Unit]
Requires=fs.target
RequiresMountsFor=/media/DATA

make sure the fstab entry has the ‘auto’ option otherwise it will be ignored

LiKenun · May 7, 2023, 9:50pm

There are industrial SD cards for embedded applications that use SLC NAND, but even then, there are better choices among industrial/enterprise SSDs on the market in terms of performance and robustness.

southern_magilla · May 8, 2023, 12:26am

They’re in RAID 1, so that’s less of a concern for me. It’s something the server is designed to use, because it’s an official product- Dell made the card, and it shipped with the server, the server even has a dedicated slot for it and BIOS/iDRAC tools for it. So it’s probably safer than using an SD card for a Rasperry Pi, which seems to work fine over a period of at least a couple years.

I don’t store important data on it anyway, I use the HDDs (hence the issue I am having), which are also on RAID 1 and periodically backed up to my TrueNAS Scale ZFS server. And the server is not doing anything important, just local-network stuff in Docker like game servers, Home Assistant, Jellyfin and Navidrome (with the actual music and videos stored on ZFS NAS, btw) and etc., but it still has a backup identical Ubuntu install in a USB key in the internal USB port, so if both cards were to fail at the same time, and I absolutely needed the server to be back up and running immediately, I can just boot to that as a failover.

Since you need to wait for a specific mount point to come online you can use
[Unit]
Requires=fs.target
RequiresMountsFor=/media/DATA
make sure the fstab entry has the ‘auto’ option otherwise it will be ignored

I will give this a shot, thank you so much!

EDIT: It’s best to do that as an override file and not directly edit the systemd file, correct?

southern_magilla · May 8, 2023, 12:27am

There definitely are. But they are a lot more expensive than two SanDisk 64 GB SD cards and a add-in board that came with the server.

MadMatt · May 8, 2023, 4:55am

I know, and it is supposed to be running esxi, not an os that constantly writes to the card…
Truenas would be ok, but plain Ubuntu is not…

Absolutely, given the same amount of writes, it will take more time for an SD to fail, but it will fail, and these modules are infamous for reporting failures even when the cards are still ok…

MadMatt · May 8, 2023, 4:57am

In the long run, yes.
To test whether it does what you need, it is fine, and one thing less to check if it doesn’t behave when testing…

southern_magilla · May 8, 2023, 5:17am

I’m just not that assed about it, to be honest. The cards were brand new when I installed them a few months ago, they are name brand, they’re in RAID 1, the server is doing nothing critical, nothing on it is irreplaceable, and even the data I care about like my family’s Minecraft world and such is stored on an HDD, backed up to a local RAID 1 HDD array, and further backed up to a ZFS server running a RAIDZ1. At worst, it’s an hour’s headache replacing the cards and reconfiguring a few things. But I do understand your concern.

southern_magilla · May 8, 2023, 5:31am

If anyone finds this having the same problem, the solution actually turned out to be slightly different.

As I went to implement the suggested solution, I noticed @MadMatt said to use RequiresMountFor=/media/DATA but the drive was mounted at /media/magilla/DATA since it uses the username (mine is ‘magilla’ as in the gorilla). So I started looking into whether that matters, because I suspected it did, and if so, which was the correct path.

I found this post discussing the difference between /media and /mount, then as I was looking at the folder, realized that the drives might not be mounting correctly on boot at all. I found this post about how to make sure the drives are mounted automatically, and followed the solution given in the first answer, which appears to have worked and fixed the issue entirely.

As a bonus, things are running a lot better now, so that’s great.

Just in case, I did still implement the solution @MadMatt provided in this thread, but instead used:

[Unit]
Requires=fs.target
RequiresMountsFor=/mnt/DATA /mnt/RAID

I did that by creating an override file: /etc/systemd/system/docker.service.d/override.conf

I don’t know if it helped but at worst, it does not appear at the moment to have caused any issues.

After that, I went back and edited the daemon.json file to use the new RAID mount location, and changed the bind mounts for my containers to use the new DATA mount location.

I appreciate the help!

-magilla