Layer 1: The Immutable VM — Your First Wall
A virtual machine running an OS that cannot be modified, even with root access, puts a locked room inside a locked building.
Why This Layer Exists
Your host machine runs Ubuntu or Debian. It has a filesystem, SSH, package managers, log files, and dozens of ways to change its own configuration. If someone breaks into the k3s cluster at the application layer, the usual path is: escape the container → escalate to root inside the VM → pivot to the host.
This layer installs an additional room with a very specific property: even root cannot permanently change the operating system on disk. The OS partition is either cryptographically verified on every read (so tampering causes a kernel panic and reboot) or mounted completely read-only by design.
Imagine a thief gets into your home safe. Inside is another safe. The inner safe is made of a different material: no matter what tools the thief has, they cannot install a backdoor inside it — every attempt to write to its walls bounces off. That is what an immutable OS is. The “walls” are the filesystem. Even the administrator cannot permanently change them without triggering a reboot and a fresh re-download of the original.
The VM boundary (KVM/QEMU) adds a second wall: the hypervisor sits between the guest OS and your real hardware. A bug inside the guest VM cannot directly touch your host’s memory or devices without first escaping the virtual machine emulator — which is itself a much smaller attack surface than a full Linux distribution.
What you get from combining these two:
- Compromise of a running container can write to memory but cannot survive a reboot.
- Compromise of the VM OS cannot modify the OS on disk.
- Compromise of the VM’s QEMU process is contained by sVirt (per-VM AppArmor labels) from reaching the host or other VMs.
The Bottlerocket Dead End — Read This Before You Search
You may have read that Amazon Bottlerocket is the gold standard immutable container OS. It is — on AWS. For a self-hosted KVM setup, it is a dead end in 2026, and this guide will not lead you down that path.
Bottlerocket is not viable for self-hosted KVM + k3s as of June 2026. Using it here will cost you significant time for zero payoff.
The facts, from the official Bottlerocket GitHub:
- Bottlerocket ships in variants. The only variant designed for bare-metal/KVM is
metal-k8s. The lastmetal-k8svariant released wasmetal-k8s-1.29— which reached its end-of-life in December 2024. GitHub issue #3794 explicitly states: “Bare metal variants for Kubernetes 1.29 and beyond will not be released.” There is nometal-k8s-1.30and there will never be one. - Bottlerocket has no k3s support. The official documentation states: “There is no variant that includes K3S at this time. Bottlerocket supports Kubernetes, but not K3S.” The Bottlerocket API manages a kubeadm-style kubelet; k3s’s embedded server model is fundamentally incompatible.
- The VMware variant (
vmware-k8s) produces OVA files that require VMware’s GuestInfo API — unavailable on KVM. metal-dev(the only other bare-metal variant) is a development-only image with Docker and debug tools. It does not include Kubernetes and cannot join a cluster.
Building your own metal-k8s-equivalent using Twoliter (Bottlerocket’s build system) means compiling the entire OS from source. That is a full-time engineering project, not a homelab step.
Bottlerocket’s security model is genuinely excellent — immutable root via dm-verity, always-on SELinux, no shell, TUF-signed updates. When AWS eventually supports a k3s variant or you move to an AWS deployment, revisit it. For now: move on.
The Two Viable Paths
Two immutable operating systems work cleanly on KVM with a homelab setup. They represent different points on the security-vs-operational-flexibility spectrum.
Comparison Table
| Flatcar Container Linux | Talos Linux v1.13 | |
|---|---|---|
| Runs k3s | Yes, first-class | No — runs vanilla k8s |
| KVM image format | Official qcow2 from flatcar.org | Official qcow2 from Image Factory |
| Root filesystem | /usr read-only (overlayfs root) | SquashFS — inherently read-only at format level |
| SSH access | Yes (core user, key-based) | None — API-only via talosctl (mTLS gRPC) |
| Shell on host | Yes (bash via SSH) | No shell; 12 internal binaries only |
| SELinux | Optional, not default | Experimental permissive default in 1.13 |
| Atomic updates | A/B dual /usr partition | Image-based (talosctl upgrade) |
| ARM64 / Pi | Official arm64 qcow2 | Pi 4 official; Pi 5 community-tested |
| Provisioning | Butane → Ignition (YAML → JSON) | talosctl + machine-config YAML |
| Operational complexity | Low — familiar Linux | High — new mental model required |
Which to Choose
Choose Flatcar if:
- You specifically need k3s (its lightweight SQLite backend, edge agent model, or the
k3sCLI) - You want SSH access to the node for debugging
- You have existing CoreOS/Container Linux experience
Choose Talos if:
- You want the strongest possible security posture and are willing to learn a new operational model
- You are willing to run vanilla upstream Kubernetes instead of k3s
- You never want a shell or SSH exposed on your cluster nodes
The primary path in this guide is Flatcar + k3s. The Talos path is documented as a clearly-marked alternative. Both paths are production-quality.
openSUSE MicroOS is a third option — it uses btrfs snapshots for rollback and officially supports k3s. It is heavier than Flatcar and provides a more traditional mutable Linux experience with atomic updates rather than a truly read-only OS partition. This guide does not cover it; the openSUSE wiki does.
Raspberry Pi: Skip the VM Layer
If you are on a Raspberry Pi, the VM layer described in this chapter does not apply. Run the immutable OS directly on bare metal instead.
The reason is architectural, not a software limitation. The Pi 5’s CPU (Cortex-A76) implements ARMv8.2-A, which does not include the ARM Nested Virtualization extension (NV/NV2, which requires ARMv8.3+). This means a VM running inside the Pi’s KVM hypervisor cannot itself run KVM. Since containerd uses Linux kernel namespaces (not a nested hypervisor), k3s inside a VM works fine — but the security benefit of the VM layer comes from the KVM boundary, which on a Pi costs you memory and CPU without adding the isolation properties you get on x86_64.
For Pi, jump to the Pi Bare-Metal Path section near the end of this chapter.
How the VM Is Provisioned (x86_64)
The script scripts/host/13-vm-provision.sh automates this. What it does, step by step:
Step 1: Download the Flatcar qcow2
Flatcar publishes signed qcow2 images for every stable release at https://stable.release.flatcar-linux.net/amd64-usr/. The image is verified against a GPG-signed digest before use.
A qcow2 file is a “virtual hard drive in a file” — the VM’s entire disk, compressed, sitting on your host as a single file. It is efficient: unused space is not allocated until written. You download this file, resize it, and hand it to the VM as its disk.
Step 2: Render the Ignition Config from Butane
Flatcar (and Talos) use a provisioning system called Ignition to configure the OS on its very first boot. You write a human-readable YAML file called a Butane config; a tool called butane compiles it to the exact JSON format Ignition expects. Ignition runs once — on first boot — and never again (unless you re-image).
The Butane template lives at scripts/config/vm/flatcar-butane.yaml. It does one primary job: drop a systemd unit file that installs and starts k3s, pointing k3s at a config file that Layer 2 will have already written to /etc/rancher/k3s/config.yaml inside the VM.
The Layer 1 / Layer 2 boundary: This chapter (Layer 1) provisions the VM and wires the plumbing. The k3s config.yaml — with all the hardening flags, admission controller settings, etcd encryption keys, audit policy — is authored entirely by Layer 2 (Chapter 12). Layer 1 tells k3s: “your config lives at /etc/rancher/k3s/config.yaml”. Layer 2 writes that file. Neither layer touches the other’s domain.
Step 3: virt-install with Hardened Virtio Devices
virt-install creates the KVM virtual machine definition (a libvirt “domain”). The domain is configured with:
- virtio-blk: the virtual disk, using the paravirtualized virtio protocol instead of emulated SATA/IDE. Smaller attack surface, better performance.
- virtio-net: the virtual network interface. Same rationale.
- No USB controllers: USB emulation (UHCI/EHCI/XHCI) in QEMU has a long history of escape vulnerabilities. A k3s node does not need USB. Remove it entirely.
- No sound, no serial, no ISA devices: every emulated legacy device is an additional QEMU code path that an attacker inside the VM can probe.
- host-passthrough CPU: the guest sees the host’s real CPU feature flags instead of a generic emulated CPU. This enables CPU-level mitigations (Spectre/Meltdown/IBRS) to function correctly inside the guest.
- Q35 machine type: modern PCIe-based chipset emulation (replaces the ancient i440FX/PIIX). Still smaller than the real thing, but the right baseline.
Choosing virtio-only devices is like replacing a room full of old plug-in adapters, converters, and jury-rigged cabling with one clean modern cable. Fewer adapters = fewer ways for someone to find a hidden gap to crawl through. Each emulated legacy device in QEMU is a legacy code path that might contain a bug. Remove the devices you don’t need and the code path ceases to exist.
sVirt: Per-VM Isolation via AppArmor
libvirt automatically applies a unique AppArmor profile to every QEMU process it launches. The profile name embeds the VM’s UUID, so the k3s node VM’s QEMU process has an AppArmor label like libvirt-a3f82c1d-.... This label:
- Prevents the QEMU process from reading any file that does not belong to that specific VM.
- Prevents it from writing to the host filesystem outside its own disk images and socket paths.
- Means a bug in QEMU’s virtio-net driver that gives an attacker code execution inside QEMU cannot pivot to host files belonging to a different VM or to the host OS itself.
This is called sVirt (security-enhanced virtualization). On RHEL/Fedora it uses SELinux with MCS categories; on Ubuntu/Debian it uses AppArmor profiles with the same isolation effect.
Verify sVirt is active after provisioning:
aa-status | grep libvirt
You should see entries like libvirt-<uuid> in the enforce mode list.
Primary Path: Flatcar + k3s via Butane/Ignition
The Butane Config
The template at scripts/config/vm/flatcar-butane.yaml does the following on first boot:
- Creates the directory
/etc/rancher/k3s/(the drop location Layer 2 uses). - Installs a systemd unit
k3s-install.servicethat runs a one-shot script to download and install the k3s binary via the official installer, then enablesk3s.service. - Creates
k3s.service— the actual running k3s server — withExecStart=/usr/local/bin/k3s server --config /etc/rancher/k3s/config.yaml.
The --config /etc/rancher/k3s/config.yaml flag is the injection point. Layer 2 writes this file (via a second Ignition pass or a provisioning step) before k3s starts. Do not put k3s flags directly in the systemd unit — put them in config.yaml so Layer 2 owns them cleanly.
The Butane config also sets the core user’s SSH authorized keys (replace YOUR_SSH_PUBLIC_KEY with your actual key), configures a static hostname, and disables password authentication system-wide.
How Ignition Gets Passed to the VM
On KVM/libvirt, Ignition config is passed to the guest via a small virtual CDROM image mounted at boot. The guest’s initrd contains Ignition, which reads the CDROM, applies the config, then unmounts it. virt-install handles this with the --disk path=ignition.iso,device=cdrom flag.
Alternatively, the config can be embedded in the qcow2 disk label (the flatcar-install method). The provision script uses the CDROM method as it is simpler and does not require modifying the downloaded image.
Alternative Path: Talos Linux (No k3s)
This path runs vanilla upstream Kubernetes, not k3s. Everything from k3s-specific features (embedded SQLite, k3s CLI, Helm controller) to the Layer 2 k3s hardening chapter applies differently. Choose this only if you want the strongest possible security posture and are comfortable replacing k3s with full Kubernetes.
The provision script includes a clearly-marked Talos branch. The Talos path:
- Downloads a qcow2 from the Talos Image Factory (currently v1.13.4).
- Generates a machine config bundle via
talosctl gen config. - Boots the VM with the controlplane config injected via virtual CDROM (Talos reads a
talos-configlabel from the CDROM). - After boot, runs
talosctl bootstrapto initialize etcd and start the control plane.
The machine config template is at scripts/config/vm/talos-controlplane.yaml. It is heavily commented.
Key Talos security properties:
- Root filesystem is SquashFS — a format that is literally impossible to remount writable; the kernel enforces this at the format level.
- No SSH daemon, no shell binaries. The
coreuser does not exist. There is no/bin/bash. Management is entirely viatalosctlover gRPC with mutual TLS. - Since Talos 1.10: systemd-boot + Unified Kernel Images (UKIs). The kernel command line is signed into the UKI binary; an attacker cannot tamper with it after the image is built.
talosctl logs auditdexposes the audit log stream without requiring SSH.
Talos is like a vending machine that only has one button: the one you need. There is no panel to open, no way to insert a screwdriver, no maintenance port. You talk to it through one heavily monitored window. Flatcar is like a normal server room — access-controlled, logged, restricted — but you can still open a terminal and look around. Talos removes the terminal. Whether that is a feature or a limitation depends on your comfort with unfamiliarity.
Raspberry Pi: Bare-Metal Immutable OS
Because nested virtualization is architecturally impossible on the Pi 5’s Cortex-A76 CPU, the VM layer offers much less security benefit on Pi hardware. The recommended Pi path skips KVM entirely.
Option A: Talos Linux Directly on Pi Hardware
Replace the host OS entirely with Talos. Flash a Talos arm64 image to a USB drive or NVMe; boot the Pi from it. Talos Pi 4 support is official; Pi 5 is community-tested as of v1.13.
Pi hardware
└── Talos Linux (bare metal, arm64)
└── vanilla upstream Kubernetes
This gives you the full Talos security model (immutable SquashFS, no shell, API-only) directly on the Pi hardware, with zero VM overhead. The tradeoff: no k3s.
Option B: k3s on Hardened Pi OS (Simplest)
Run k3s directly on a hardened Raspberry Pi OS (Debian Bookworm 64-bit) using all the host hardening from Chapter 10. This is the simplest option.
Pi hardware
└── Raspberry Pi OS 64-bit (hardened per Chapter 10, AppArmor)
└── k3s server (bare metal, no VM)
No immutable OS layer here — you accept that trade-off in exchange for the lowest operational complexity.
Option C: KVM + Flatcar/Talos on Pi (Viable, Not Recommended)
KVM does work on Pi 5 — the Cortex-A76 has hardware virtualization. You can run Flatcar or Talos inside a KVM VM on the Pi. The penalty is real: 10–20% CPU/memory overhead from virtualization, on already constrained hardware (8GB on Pi 5), with the k3s workloads competing for the same resources. Consider this only if you need hypervisor-level isolation and have budgeted the hardware accordingly.
On Pi, KVM adds overhead without the nested-virt limitation being removable. Test memory pressure carefully before running production workloads.
The Libvirt Domain XML
The template at scripts/config/vm/k3s-vm.libvirt.xml defines the VM structure. Key choices:
<cpu mode='host-passthrough'/>: pass real CPU feature flags to the guest so mitigations work correctly.<controller type='usb' ... />removed entirely: no USB controllers exposed to the guest.<disk type='block' device='disk'>with<driver name='qemu' type='raw' discard='unmap'/>for the virtio-blk disk.- A placeholder
<!-- USB_PASSTHROUGH_PLACEHOLDER -->marks where Layer 5 (Storage) attaches the encrypted USB device as a virtio-blk passthrough. Layer 5 fills this in; Layer 1 leaves it empty. - Memory and vCPU counts use
FK_VM_RAM_MBandFK_VM_VCPUSenvironment variables so the script can be tuned without editing the XML.
What This Layer Bought You
| Threat | Before This Layer | After This Layer |
|---|---|---|
| Attacker escapes container, gets VM root | One hop to host | Writes to disk bounce off read-only /usr (Flatcar) or immutable SquashFS (Talos) |
| Attacker finds bug in QEMU virtio driver | Full host access | sVirt AppArmor label confines QEMU to only that VM’s files |
| Attacker plants persistent backdoor | File write to host FS | First boot re-applies Ignition state; no persistent writable OS partition |
| VM “noisy neighbor” memory overflow | Unconstrained | Hard memory limits via libvirt cgroup enforced by kernel |
| Cross-VM data exfiltration | Direct if QEMU unconfined | AppArmor labels prevent QEMU process from reading other VM disk files |
| Host pivot after VM compromise | Direct file system access | KVM boundary — escaping requires exploiting the KVM/QEMU process |
The wall is not impenetrable — no layer is. But an attacker must now chain: container escape → VM kernel exploit → KVM/QEMU exploit → sVirt AppArmor escape → host. Each additional step dramatically reduces the realistic threat pool from “any script kiddie” to “motivated, resourced attacker with a zero-day.”
The next layer (Chapter 12) hardens k3s itself — admission controllers, audit logging, etcd encryption, and the API server lockdown. By the time both layers are in place, the runtime attack surface inside the cluster has shrunk from “default open” to “only what you explicitly allowed.”