VM Networking From Scratch

Now that I've settled on my VM image pipeline, the next logical step is to tackle networking.

My Requirements

So far, I've been using QEMU's default user-mode networking. It's convenient for quick tasks, allowing for easy port forwarding, Samba shares, and DNS with just a few flags. However, this setup is ultimately insufficient for my needs for a couple of key reasons:

Security and Isolation: In the default user-mode setup, a VM can access the host's services via localhost. Worse, because it uses NAT, the VM can also access the host's entire LAN using the host's IP address. Ideally, VMs should have their own identifiable IP addresses, and more importantly, there should be strong network isolation between the host and the VMs.
Centralized Auditing: I want to audit all network traffic from my VMs through a centralized solution. This means I need a way to route all VM traffic through a single point of control.

Choosing the Right Tool

For most people, tools like libvirt or Incus are the best choice for this task. They are well-maintained, thoroughly tested, and have well-designed command-line interfaces that are less error-prone. I should probably just choose one of them and be done with it.

...except that I'm genuinely interested in learning the underlying building blocks and terminology. This is the main reason I chose to write QEMU scripts manually in the first place. Meanwhile, I find myself constantly referring to the documentation for these tools anyway when I'm studying security options.

Maybe one day, when I'm satisfied with my knowledge, I'll migrate my scripts to one of these excellent tools. But for now, let's learn by ~~suffering~~ doing.

Bridge + TAP

As many guides suggest, for anything beyond basic networking, the place to start is with a Linux bridge and TAP devices. A bridge acts like a virtual network switch, and a TAP device acts like a virtual network port for a VM to connect to that switch.

Thankfully, systemd-networkd makes this setup fairly easy. In the .network file for my bridge, setting IPv4Forwarding=yes and IPMasquerade=ipv4 saves me from writing custom nftables rules for NAT, which is a huge time-saver. QEMU also makes it simple to attach a VM's network interface to an existing TAP device.

To keep things tidy, I decided to automatically generate the systemd-networkd configuration files (e.g., .netdev and .network) directly from my VM configuration files. I save these generated files to /run/systemd/network/. This ensures I don't have to manually keep two sets of configurations in sync.

IP Addresses

The easiest way to assign IP addresses to VMs is to run a DHCP server on the bridge. Most standard cloud images, including bootc images, are configured to use DHCP by default.

However, I ultimately decided to use static IP addresses. Setting up a DHCP server securely, whether on the host or in a dedicated VM, takes some effort. Even with a DHCP server, I would likely configure static reservations to make it easier to write firewall rules to prevent IP address spoofing.

So, my process for each VM looks like this:

Generate a unique MAC address and a static IP address, and store them in the VM's configuration file.
Before starting the VM, generate a temporary systemd-networkd .network file that matches the VM's MAC address and configures its static IP, gateway, and DNS settings.
Pass these configuration files into the VM at boot time using systemd's network.* credentials feature.

This should work perfectly... right?

Wrong! I quickly discovered that CentOS does not ship systemd-networkd (1, 2, 3).

After looking through the official options for bootc images, I settled on using NetworkManager. This requires me to generate a NetworkManager keyfile and embed it into the container image. This isn't ideal because updating the network configuration requires rebuilding the image, which is slow. In the future, I might explore better options, such as:

Separating the Linux kernel from the image and booting it directly with QEMU, allowing me to pass network configuration via kernel parameters.
Using a different base image that includes systemd-networkd.

Inter-VM Traffic

By default, all VMs connected to the same bridge can communicate with each other freely. This isn't what I want; my goal is to enforce a "default deny" policy and only allow traffic that is explicitly permitted.

After some research with some help from AI, I learned a few key terms: port isolation, private VLAN, and proxy ARP. It turns out these concepts are perfect for my use case.

Here’s what I discovered when I put it into practice:

I started with a standard bridge and TAP setup, with host firewall rules in nftables that block all traffic. As expected, the VMs could not connect to the internet. However, they could still talk to each other. Why?

A quick debugging session with nft monitor revealed that packets traveling between VMs on the same bridge never hit my inet family firewall rules. This is because the bridge was forwarding the traffic at Layer 2 (like a real switch), so the host's Layer 3 IP-level firewall was never consulted. nftables has a bridge family specifically for filtering this kind of traffic.

Next, I enabled port isolation on the bridge. Now, even the bridge family rules couldn't see the packets between VMs. This confirmed that port isolation operates at an even lower level, preventing the bridge from forwarding frames between isolated ports altogether.

This gave me the perfect foundation. Now, if I want to allow two VMs to communicate, I have to do it explicitly. I have two main options:

Force Gateway Routing: I can remove the local subnet route inside each VM, forcing them to send all packets (even to other VMs on the same subnet) to the bridge's gateway IP address. The host's routing stack will then receive the packets, which can be filtered by my standard inet family nftables rules.
Use Proxy ARP: I can enable IPv4ProxyARPPrivateVLAN=yes on the bridge's network configuration. The host will then respond to ARP requests on behalf of the VMs. This tricks the VMs into sending all their packets to the host's MAC address.

Ultimately, both options achieve the same goal: they force Layer 2 traffic up to Layer 3, where it can be inspected by a central firewall. Option #2 is more elegant because it doesn't require custom network configuration inside the VMs. Option #3 seems less hacky.

Notes:

My initial assumption was that with Proxy ARP (Option #2), the traffic would be captured by the bridge family in nftables. This is incorrect. The ARP resolution happens at Layer 2, but the subsequent IP packets are routed, so they are captured by the ip or inet families.
Proxy ARP doesn't remove the need for a Layer 3 firewall. A malicious VM could simply add its own static routes (as in Option #1) to try and communicate directly. The key is to have a firewall at the gateway that inspects all traffic, ensuring that even if a VM tries to bypass the intended path, the traffic is still filtered. The main benefit of port isolation is preventing direct, unfiltered Layer 2 communication.

Outgoing Traffic

For controlling traffic leaving the host, I have a draft plan that provides strong isolation. The idea is to create a dedicated firewall VM.

On the host, I'll set up two bridges: bridge-internal and bridge-external.
All my regular VMs will be connected to bridge-internal. The host itself will not have an IP address on this bridge. This ensures the VMs cannot directly talk to the host. If needed I can set up SSH connection over vsock.
I will set up a special firewall VM that has two network interfaces: one connected to bridge-internal and the other to bridge-external.
The host's physical network interface will be connected to bridge-external.

With this setup, all outgoing traffic from the VMs must pass through the firewall VM, giving me a single place to manage all rules. It also isolates the host's network stack from the VMs by default.

For services, I can configure the firewall VM:

For non-HTTP services like NTP, I can set up forwarding or proxy rules.
For HTTP/HTTPS traffic, I can set up a transparent proxy using Nginx. Previously, I thought this would require a separate proxy configuration for each domain, e.g. dedicated proxy and DNS entry, but AI showed me a much better way:
- Nginx's ngx_stream_ssl_preread_module allows it to inspect the SNI (Server Name Indication) in the TLS handshake without decrypting the traffic.
- I can use firewall rules to redirect all outgoing HTTPS traffic from bridge-internal to this Nginx stream proxy.
- In the proxy, I can maintain a simple allowlist of domains and block everything else.

I plan to explore this design further. For example, is it better to use the host as the firewall? Or split the firewall services into multiple VMs? Could macvlan be useful here? These are questions for a future post.

Conclusion

In the end, I've replaced QEMU's basic networking with a much more secure, custom setup. Using a Linux bridge and port isolation, I can now force all VM traffic through a central firewall for inspection.

While it was more work than using a tool like libvirt, building this from scratch was a fantastic way to learn the fundamentals of VM networking and gain complete control over my environment.

Exploring Immutable Distros and Declarative Management

My current server setup, based on Debian Stable and Docker, has served me reliably for years. It's stable, familiar, and gets the job done. However, an intriguing article I revisited recently about Fedora CoreOS, rpm-ostree, and OSTree native containers sparked my curiosity and sent me down a rabbit hole exploring alternative approaches to system management. Could there be a better way? Core Goals & Requirements Before diving into new technologies, I wanted to define what "better" means for my use case: The base operating system must update automatically and reliably. Hosted services (applications) should be updatable either automatically or manually, depending on the service. Configuration and data files need to be easy to modify, and crucially, automatically tracked and backed up. Current Setup: Debian Stable + Docker My current infrastructure consists of several servers, all running Debian Stable. System Updates are andled automatically via unattended-upgrades. Se...

WangLu's Notes

Search This Blog