Skip to main content

A Rocky Migration: Moving from docker-compose to Podman and gVisor

I've been running a few containers for several years. They were all running under rootless Docker with a single user.

Initially, I planned to migrate the containers to VMs, but I couldn't get a stable workflow after about two months of effort. Later, gVisor caught my attention, and I decided to migrate to Podman with gVisor instead.

The new plan is to run each container with --userns=auto and use Quadlet for systemd integration. This approach provides better isolation and makes writing firewall rules easier.

I'm now close to migrating all my containers. Here are a couple of rough edges I'd like to share.

Network Layout

I compared various networking options and spent a few hours trying the one-interface-per-group approach before giving up. I settled on a single macvlan network and decided to use static IP addresses for my containers.

To prevent a randomly assigned IP address from conflicting with a predefined one, I allocated a large IP range for my containers and assigned random addresses from that range.

Routing Issues

I ran into a tricky routing problem. Let's say my host has a network interface eth0 and a veth pair where veth0-host is on the host and veth0-ctr is in the container.

Here are the IP addresses:

  • eth0: 192.168.0.1
  • veth0-host: 192.168.13.1
  • veth0-ctr: 192.168.13.100

To allow an external client to talk to a service on 192.168.13.100:1234, I set up a prerouting DNAT rule in nftables to forward traffic from 192.168.0.1:1234 to 192.168.13.100:1234. To my surprise, the host itself couldn't access 192.168.0.1:1234.

It turned out there were two issues:

  1. DNAT in the prerouting chain doesn't apply to loopback traffic from the host itself.
  2. Masquerade and SNAT also didn't work, likely because the kernel has a short-circuit mechanism for local traffic, so the transport happens at Layer 2.

Because of this, the service at 192.168.13.100 would send reply packets to 192.168.0.1 instead of 192.168.13.1. The packet would still be received on veth0-host, but my firewall would complain that this violates the "strong host model".

I didn't have this issue before, probably because port forwarding was handled at Layer 3 by slirp4netns. This seems to be a "hairpin NAT" issue, but it's more complicated with a veth pair.

In the end, I just configured the host to talk to the container at 192.168.13.100 directly.

File Permissions

Because I'm using --userns=auto, the :U flag is almost a must when mounting volumes. Surprisingly, this didn't cause too many problems as long as I set up the correct user and group.

Sometimes a container needs to access files on the host. If a group HOST_GID has access to a file, we can grant access to the container's primary user with --userns=auto:gidmapping=$CONTAINER_GID:$HOST_GID:1 --group-add $CONTAINER_GID. Here, CONTAINER_GID is an unused GID inside the container.

However, this only works well with the default crun runtime. With gVisor, I found two problems:

First, the permission only works if CONTAINER_GID is the primary group of the container's main user. It doesn't work if it's a supplementary group.

Second, gVisor does not seem to support POSIX ACL. This means the dac_override capability is needed if the CONTAINER_GID doesn't appear to have permission according to standard Unix permissions.

It's not too bad in practice, but it was surprising until I figured out what was going on.

Default DNS Server

Docker provides a DNS server at 127.0.0.11 for each container. Podman, however, creates a dedicated DNS server for each bridge network. Some of my containers relied on the Docker behavior and had this IP address hard-coded, which caused quite a bit of trouble.

DNS Servers for Multiple Networks

If a container joins both an internal bridge network and an external macvlan network, the container only sees the DNS server from the bridge network. IP routing still works, meaning the container can access the internet by IP, but it can't resolve external domains.

This is a bug in Podman that has been fixed in the latest version, but it still exists in the version I'm using from Debian. Below are the hacks I considered for this old version.

Option 1: Override DNS

If the container doesn't need to resolve internal container names, we can force it to use an external DNS server. However, the --dns flag didn't work as I expected.

According to Issue #17500, the server from --dns is added into other DNS servers as upstream, such that internal container names can be resolved first. This doesn't work in my case because the internal DNS server can't access the internet, so it can't forward the query.

A simple workaround is to override /etc/resolv.conf using a bind mount.

Option 2: HTTP Proxy

If the container supports HTTP proxy, we can remove it from the external network. Instead, we can add a proxy container to both the internal and external networks. The proxy can use its own external DNS server, and the original container can use this proxy via its internal IP.

This should work in theory, but it felt like too much effort. It also surprised me that nginx doesn't support proxying HTTPS traffic (CONNECT method) without extra efforts. If I had to go this route, I would probably use mitmproxy.

Option 3: Transparent HTTP Proxy

I have set up a transparent HTTP proxy in my network. The container thinks it is talking to an external server, but my firewall redirects the traffic to an nginx server, which can forward or reject the traffic. Unlike a normal HTTP proxy, this is easy to do with nginx and is completely transparent to the client.

Now, if a container only needs to access a few known HTTP/S domains, I just add entries for those domains to the container's /etc/hosts file with an arbitrary IP (like 1.1.1.1), using AddHost=. Nginx completely ignores this IP, it reads the domain from the request and resolves it on its own. It's very hacky but also very practical.

Namespaces

Some containers assume they share the same user namespace, which is common when they are running under docker-compose.

Podman has --userns=container:id to join an existing container's user namespace, but this doesn't work with gVisor. From what I've learned, this is related to gVisor's sandbox model.

The solution is to put containers into a pod. However, with gVisor, containers will not join the network namespace of the pod due to the same security model. This wasn't a big deal for my use case, but it was unexpected.

The same thing happens with the UTS namespace. When running with gVisor, a container's hostname becomes empty if it joins a pod. Issue #7995 is relevant here. Apparently, some binaries (like busybox's sendmail) don't like an empty hostname. The solution is to give the container a private UTS namespace: --uts=private.

Shared Volume

Some of my containers use flock on a shared volume to communicate. I think this is a bad design. And guess what? It doesn't work with gVisor.

I thought I might need to set up some mount hints annotations, but that didn't help, so I guess inotify wasn't the issue. Turning on shared file access didn't work either.

Other Notes

I had planned to use socket activation extensively, but it turned out I didn't. Most containers need networking anyway, and it's much easier to manage port forwarding with simple firewall rules.

It is possible to mount an empty volume, which acts as an upper overlay on existing files at the mount destination. This is very useful for read-only containers.

Final Thoughts

This wasn't a trivial migration, but I guess that was expected since I was changing so many variables at once. In any case, I'm quite happy with the new setup.

Comments

Popular posts from this blog

Determine Perspective Lines With Off-page Vanishing Point

In perspective drawing, a vanishing point represents a group of parallel lines, in other words, a direction. For any point on the paper, if we want a line towards the same direction (in the 3d space), we simply draw a line through it and the vanishing point. But sometimes the vanishing point is too far away, such that it is outside the paper/canvas. In this example, we have a point P and two perspective lines L1 and L2. The vanishing point VP is naturally the intersection of L1 and L2. The task is to draw a line through P and VP, without having VP on the paper. I am aware of a few traditional solutions: 1. Use extra pieces of paper such that we can extend L1 and L2 until we see VP. 2. Draw everything in a smaller scale, such that we can see both P and VP on the paper. Draw the line and scale everything back. 3. Draw a perspective grid using the Brewer Method. #1 and #2 might be quite practical. #3 may not guarantee a solution, unless we can measure distances/p...

Qubes OS: First Impressions

A few days ago, while browsing security topics online, Qubes OS surfaced—whether via YouTube recommendations or search results, I can't recall precisely. Intrigued by its unique approach to security through compartmentalization, I delved into the documentation and watched some demos. My interest was piqued enough that I felt compelled to install it and give it a try firsthand. My overall first impression of Qubes OS is highly positive. Had I discovered it earlier, I might have reconsidered starting my hardware password manager project. Conceptually, Qubes OS is not much different from running a bunch of virtual machines simultaneously. However, its brilliance lies in the seamless desktop integration and the well-designed template system, making it far more user-friendly than a manual VM setup. I was particularly impressed by the concept of disposable VMs for temporary tasks and the clear separation of critical functions like networking (sys-net) and USB handling (sys-usb) into the...

Exploring Immutable Distros and Declarative Management

My current server setup, based on Debian Stable and Docker, has served me reliably for years. It's stable, familiar, and gets the job done. However, an intriguing article I revisited recently about Fedora CoreOS, rpm-ostree, and OSTree native containers sparked my curiosity and sent me down a rabbit hole exploring alternative approaches to system management. Could there be a better way? Core Goals & Requirements Before diving into new technologies, I wanted to define what "better" means for my use case: The base operating system must update automatically and reliably. Hosted services (applications) should be updatable either automatically or manually, depending on the service. Configuration and data files need to be easy to modify, and crucially, automatically tracked and backed up. Current Setup: Debian Stable + Docker My current infrastructure consists of several servers, all running Debian Stable. System Updates are andled automatically via unattended-upgrades. Se...