Why no Docker in LXC?

At the tail end of a recent video from Wendell, A (Re)Certifiably INSANE 330tb RAW Home Server Build (~23:30 minute mark), he talks about a bug that he is hitting when running Docker in an LXC container on Proxmox that causes a crash. As almost a throwaway comment, he says, you should not be running Docker in LXC. I’ve never heard this before and is contrary to a lot of information I’ve come across.

Is anyone able to expand on this thought? What is the suggested alternative? Docker in a VM or directly on the Proxmox host?

Thanks in advance.

1 Like

This

It’s not the best idea, because both do somewhat of the same thing. Especially security is harder to do, as an unprivileged LXC doesn’t always work well with docker.

Doesn’t mean you can’t though. I’m running 3 LXC containers with a docker agent on each, to seperate resources between sets of docker containers. But it is definitely not the safest way to deploy them. A VM is safer and will cause less problems.

1 Like

Wrapping a bun in two layers of foil… Can’t you just have a bun in the VM?

Will this VM have multiple buns that we need to separate for the greater good?

Does double separation make sense?

1 Like

If you only have one bun, it somewhat makes sense, as soon as the buns are two or more, it will isolate one bun from the others … You don’t want your hot barbacoa taco to mix with your salad and your Mac and cheese…

You want to use the exterior foil to isolate the inner buns from the basket that may change at any point in time?

2 Likes

since some update from October if you e got a heavy load in docker in lxc the host crashes hard. confirmed it doesn’t happen outside lxc. thought I had a hardware problem but no it happens on multiple proxmox hosts.

docker issue tracker intermittently has reports of similar happenings… it’s baa aaaaack whatever it is.

only reason I do it is pass through zfs but often NFS especially if your nic doesn’t accept NFS is just as fast or faster

4 Likes

Mhm that’s what I always thought too, it sounds reasonable logically.

What if we used VM per bun? It would take us a lot more resources than a container but we would still get that isolation per bun.

You can also argue about whether bun per container in VM make that much sense in terms of security.
If a container is considered good isolation for bun then you could just as well skip the VM and keep containers with bun on the host directly.

What I want to point out is that neither VM nor container provide extreme isolation for our buns and that’s why paranoia forces double wrapping of bun.
So we’re back to the beginning and we can ask ourselves about the quality of containerization. :slight_smile:

2 Likes

We live in a free it world, so nobody / no external feeling forces anything to anyone :smile:
Security is a rabbit hole that will bring you into multiple places, at once, repeatedly, someone likes that for the sake of feeling more secure, someone else doesn’t.
I can’t talk for others , for me having multiple containers (about 60 at last count) running on a VM that rins on a hypervisor is convenience…
Sure, I could have 60 VMs, a resource requirement 10 times bigger than the 8gb I give to my container VM, and God forbid a month worth of work when patching…
Withe bun in double foil I need to patch exactly twice, once the hypervisor, once the VM, then image updates will take care of the rest
Also, my docker VM was originally created on a Synology using virtualbox, then moved to freenas scale using Bhive, then to scale using KVM, then to proxmox, the only changes were the hypervisor specific drivers…
Is it super secure, definitely not, is it easy to maintain, absolutely, is it easy to add/try different services without breaking things, absolutely and that’s what has worked for me in the past years…
Will it work for you? No idea :man_shrugging:

2 Likes

I agree. :slight_smile:

I look at things more from the network side… If I tell someone you need two firewalls in a row, they’ll look at me like I’m wearing a tinfoil hat. :slight_smile:
But the devil is in the details. Two firewalls but completely different from each other. One x86, the other IBM Power, two different brands of network cards, two different OS, one based on BSD, the other on Linux.
Although both do exactly the same thing, the philosophy behind it is that they have nothing in common, both in terms of software and certain hardware components, in order to eliminate 0day deeply hidden.

I’m mainly thinking here about a firewall that stands directly at the LAN-WAN interface and is constantly exposed to everything the Internet can aim at it. If an attacker finds a way to get through No. 1 or take it over, the second FW should be different enough to not allow this method.

Whelp, I need to go rework a reverse proxy now I guess.

Welcome to the forum!

It depends. To counter some of the arguments brought up (or rather, to add to the discussion, as I’m not directly debating “which one is best” flame war bait questions), here are some examples of why I would use one or the other.

I’ll start with the obvious, i.e. running OCI containers in VMs. This offers the security and isolation of a VM and the stability of any direct container operations on an OS. I’d be using this if I don’t trust the workload and want to isolate it from other workloads, but without the investment of dedicated hardware for it. It’d also be a good idea if you want to secure the workload from other workloads, like say a container that runs something that’s supposed to be secret (like a VaultWarden setup for the ultra-paranoid).

But on the plus side, you get actual live migration support (if you cluster), which helps with host maintenance without downtime and if you run a small OS in the VMs (like alpine) or a very slow stable (debian), you won’t be getting lots of updates often (but it’s still something to consider).

The big disadvantage of this is scalability. You have to update the hypervisor, all the VMs and the containers (which can be automated, so it’s not all doom-and-gloom), but the more important part is the hardware requirements. Virtualization is expensive if you plan to run a bunch of workloads, especially if you’re into the low-end hardware that sips power at idle and isn’t a power-hog at full blast (generally the really low-end PC and the SBC market).

That gets us to running OCI containers on baremetal. You get none of the security (other than the built-in cgroups stuff from the kernel and whatever user permissions are handled by unix groups), but you waste no resources for virtualization. If I trust all my workloads (if I run them, that almost always means I trust them) then this is an ok setup. You update the containers as usual and the OS too, but when you update the OS, you’ll need to reboot and take downtime on your workloads.

If you use something like k8s / k3s / k0s / mikrok8s (and whatever k*s there are), then you will get a bit of downtime while the service gets started on a different host and it’s up to your software to handle the failover (i.e. something like vaultwarden will definitely be down for a couple of seconds, while something like a personal website might not even have time to give you a http 500 or site unreachable error).

But now we have a problem. If your host OS is a hypervisor, where you’re not technically supposed to mess with the main OS, then you should try your best to sandbox your workloads, to prevent the main system from real and / or potential failures (although I still encourage people to exercise that freedom, which is why I suggested in the past that people run proxmox for VMs and LXC and install portainer directly on proxmox and run containers on the host - just be mindful that this is not a supported configuration and it’s on you to troubleshoot things).

To get around that “”“problem”“” one can simply run OCI containers inside LXC. You waste a tiny amount of resources on LXC to jump-start your OCI containers, but it’s nowhere near as much as virutalization (it’s mostly a couple of MB of RAM overhead). Some advantages to this method, like Wendell mentioned, is that you can pass some file systems to LXC, then the OCI container will just overlayfs on top of that.

You still get the downsides of VMs, in that you need to update the OS inside LXC and you get none of the benefits like live-migration or security confinement. LXC and OCI containers make use of cgroups, meaning you overlay groups on top of other groups. Sometimes they don’t play along well, sometimes they work. The real advantage here is now your OS is separate from your OCI container workloads and your workload will not affect your host OS, but the OS inside LXC. You also get some portability (to transfer LXC to another host). I believe Proxmox is working on LXC live-migrations through CRIU, but it’s not yet anywhere near ready.

But not all’s bad on the lack on live-migration for both lxc and oci containers. Some people reported jellyfin in lxc to restart on another host and not notice any stream interruption, so even that small delay to restart isn’t really a problem, depending on the workload (for jellyfin, that’s likely because the stream was already buffered enough). However, keep in mind if lxc is highly available and is started on another host, then once it starts, it’ll need to start the OCI containers, meaning you have a slightly bigger delay. Probably still not noticeable, but it’s there.

The disadvantage to OCI in LXC is the complexity. A VM is arguably more complex, because you’re virtualizing a whole set of hardware and run a full OS that interacts with said virtual hardware. But from a software standpoint, that’s actually easier and less of a headache. In LXC, you are enabling nested containerization, which is somewhat of a security risk if your LXC container is untrusted. The LXC contaienr with nested option enabled has access to the host OS’ /proc and /sys file systems with read and write. The underlying OCI containers won’t have the same privilege, but this is already something to be wary of, from both security and stability standpoint. If the LXC container messes with these too much, your host OS can crash.

LXC with nesting is not inherently insecure, but it’s a less secure option than the default (non-nesting) option. Just make sure that you trust the workload in LXC (both the LXC and OCI containers). But I’d say it’s a trade-off for stability which everyone should decide on, when doing it.

1 Like