Today, various services (native, LXC, Docker) are running on my server. I'm mostly happy with the setup, but I decided to revisit my server's defenses under the assumption that a remote attacker or malicious code could compromise my services. A service might break out of its container or even gain root privilege.
VMs are a better security boundary than containers; they can limit the damage if an attacker gains root privilege. I cannot afford to run a dedicated VM for each service, so I will need to carefully group the services and run a dedicated VM for each group. Each group should be carefully designed based on the data accessed and the features/capabilities required. For example, some VMs may have access to my photos, while others may not have network access.
The Goal
There are two particular issues I want to address:
First, I want VM images to be easily reproducible, which makes backup and restore trivial. NixOS and GNU Guix System are great examples, where you only need to back up the configuration file. However, I don't really like them because of their domain-specific language/design.
Second, I want to seal the system as much as possible. Even a compromised root user inside a VM should not be able to permanently infect the VM. Many so-called "immutable" Linux distributions are not truly 100% immutable. Often, they just mean a read-only /usr. Some can be easily broken via `mount -o remount,rw`, and most of them allow self-upgrade, meaning a malicious root user can still inject code via "upgrade and reboot."
The Approach
I use bootc containers. This allows me to build the whole system with standard scripts, and it offers the standard "immutability."
Furthermore, I run QEMU with `--no-reboot --snapshot`, which means the system cannot update itself even with root privilege.
Lastly, I'll regularly build new images and restart the VM to pick up the latest security fixes.
This approach is essentially managing VMs like containers. It's not a new idea; frood and gokrazy are good examples of this principle.
On a side note, I also plan to learn more about KubeVirt and Nix VMs. Especially, I like the idea (from NixOS) that the guest can directly use the store from the host.
Notes about QEMU
Permanent machine-local data is stored in /var, which is put into a separate disk image.
Secrets are sent to QEMU via systemd credentials.
I tried virtiofsd, but didn't like it. I ended up with Samba anyway. Maybe I'll revisit virtiofsd later.
To shut down the VM (e.g., via systemd), I created a special admin user with special privilege defined in the sudoers file, so that I can run `ssh admin@vm sudo poweroff`. The SSH key pair is regenerated before each VM boot. Related: In a systemd unit, ExecStop= does not have access to LoadCredential.
I use `-chardev socket,logfile=...` and `-serial` so that the systemd logs are not filled with console output, and I can view or attach to the serial console later.
I plan to learn more about virtio-balloon and pmem later.
Conclusion
I find it very beneficial to deploy VMs. It allows me to shrink and harden the host OS (e.g., disable unprivileged user namespaces), and it allows me to design fine-grained access control.
Next, I'll start investigating how to organize the containers inside VMs.
Comments