I've been running a few containers for several years. They were all running under rootless Docker with a single user.
Initially, I planned to migrate the containers to VMs, but I couldn't get a stable workflow after about two months of effort. Later, gVisor caught my attention, and I decided to migrate to Podman with gVisor instead.
The new plan is to run each container with --userns=auto
and use Quadlet for systemd integration. This approach provides better isolation and makes writing firewall rules easier.
I'm now close to migrating all my containers. Here are a couple of rough edges I'd like to share.
Network Layout
I compared various networking options and spent a few hours trying the one-interface-per-group approach before giving up. I settled on a single macvlan network and decided to use static IP addresses for my containers.
To prevent a randomly assigned IP address from conflicting with a predefined one, I allocated a large IP range for my containers and assigned random addresses from that range.
Routing Issues
I ran into a tricky routing problem. Let's say my host has a network interface eth0
and a veth pair where veth0-host
is on the host and veth0-ctr
is in the container.
Here are the IP addresses:
eth0
: 192.168.0.1veth0-host
: 192.168.13.1veth0-ctr
: 192.168.13.100
To allow an external client to talk to a service on 192.168.13.100:1234
, I set up a prerouting DNAT rule in nftables to forward traffic from 192.168.0.1:1234
to 192.168.13.100:1234
. To my surprise, the host itself couldn't access 192.168.0.1:1234
.
It turned out there were two issues:
- DNAT in the
prerouting
chain doesn't apply to loopback traffic from the host itself. - Masquerade and SNAT also didn't work, likely because the kernel has a short-circuit mechanism for local traffic, so the transport happens at Layer 2.
Because of this, the service at 192.168.13.100
would send reply packets to 192.168.0.1
instead of 192.168.13.1
. The packet would still be received on veth0-host
, but my firewall would complain that this violates the "strong host model".
I didn't have this issue before, probably because port forwarding was handled at Layer 3 by slirp4netns. This seems to be a "hairpin NAT" issue, but it's more complicated with a veth pair.
In the end, I just configured the host to talk to the container at 192.168.13.100
directly.
File Permissions
Because I'm using --userns=auto
, the :U
flag is almost a must when mounting volumes. Surprisingly, this didn't cause too many problems as long as I set up the correct user and group.
Sometimes a container needs to access files on the host. If a group HOST_GID
has access to a file, we can grant access to the container's primary user with --userns=auto:gidmapping=$CONTAINER_GID:$HOST_GID:1 --group-add $CONTAINER_GID
. Here, CONTAINER_GID
is an unused GID inside the container.
However, this only works well with the default crun
runtime. With gVisor, I found two problems:
First, the permission only works if CONTAINER_GID
is the primary group of the container's main user. It doesn't work if it's a supplementary group.
Second, gVisor does not seem to support POSIX ACL. This means the dac_override
capability is needed if the CONTAINER_GID
doesn't appear to have permission according to standard Unix permissions.
It's not too bad in practice, but it was surprising until I figured out what was going on.
Default DNS Server
Docker provides a DNS server at 127.0.0.11
for each container. Podman, however, creates a dedicated DNS server for each bridge network. Some of my containers relied on the Docker behavior and had this IP address hard-coded, which caused quite a bit of trouble.
DNS Servers for Multiple Networks
If a container joins both an internal bridge network and an external macvlan network, the container only sees the DNS server from the bridge network. IP routing still works, meaning the container can access the internet by IP, but it can't resolve external domains.
This is a bug in Podman that has been fixed in the latest version, but it still exists in the version I'm using from Debian. Below are the hacks I considered for this old version.
Option 1: Override DNS
If the container doesn't need to resolve internal container names, we can force it to use an external DNS server. However, the --dns
flag didn't work as I expected.
According to Issue #17500, the server from --dns
is added into other DNS servers as upstream, such that internal container names can be resolved first. This doesn't work in my case because the internal DNS server can't access the internet, so it can't forward the query.
A simple workaround is to override /etc/resolv.conf
using a bind mount.
Option 2: HTTP Proxy
If the container supports HTTP proxy, we can remove it from the external network. Instead, we can add a proxy container to both the internal and external networks. The proxy can use its own external DNS server, and the original container can use this proxy via its internal IP.
This should work in theory, but it felt like too much effort. It also surprised me that nginx doesn't support proxying HTTPS traffic (CONNECT
method) without extra efforts. If I had to go this route, I would probably use mitmproxy
.
Option 3: Transparent HTTP Proxy
I have set up a transparent HTTP proxy in my network. The container thinks it is talking to an external server, but my firewall redirects the traffic to an nginx server, which can forward or reject the traffic. Unlike a normal HTTP proxy, this is easy to do with nginx and is completely transparent to the client.
Now, if a container only needs to access a few known HTTP/S domains, I just add entries for those domains to the container's /etc/hosts
file with an arbitrary IP (like 1.1.1.1), using AddHost=
. Nginx completely ignores this IP, it reads the domain from the request and resolves it on its own. It's very hacky but also very practical.
Namespaces
Some containers assume they share the same user namespace, which is common when they are running under docker-compose.
Podman has --userns=container:id
to join an existing container's user namespace, but this doesn't work with gVisor. From what I've learned, this is related to gVisor's sandbox model.
The solution is to put containers into a pod. However, with gVisor, containers will not join the network namespace of the pod due to the same security model. This wasn't a big deal for my use case, but it was unexpected.
The same thing happens with the UTS namespace. When running with gVisor, a container's hostname becomes empty if it joins a pod. Issue #7995 is relevant here. Apparently, some binaries (like busybox's sendmail) don't like an empty hostname. The solution is to give the container a private UTS namespace: --uts=private
.
Shared Volume
Some of my containers use flock
on a shared volume to communicate. I think this is a bad design. And guess what? It doesn't work with gVisor.
I thought I might need to set up some mount hints annotations, but that didn't help, so I guess inotify
wasn't the issue. Turning on shared file access didn't work either.
Other Notes
I had planned to use socket activation extensively, but it turned out I didn't. Most containers need networking anyway, and it's much easier to manage port forwarding with simple firewall rules.
It is possible to mount an empty volume, which acts as an upper overlay on existing files at the mount destination. This is very useful for read-only containers.
Final Thoughts
This wasn't a trivial migration, but I guess that was expected since I was changing so many variables at once. In any case, I'm quite happy with the new setup.
Comments