Layer 3: Sealed Runtime
Give every container its own locked box — and put the dangerous ones inside a second box inside that.
What this layer does
By the time a container actually runs, Layers 1 and 2 have hardened the host OS and locked down the Kubernetes control plane. This layer seals the runtime — the moment code inside a container starts executing. It answers one question: if an attacker gets code execution inside a container, how much damage can they do before hitting a wall?
The answer is: very little. We layer four independent controls on top of each other. All four apply to every pod, all the time. High-risk workloads get a fifth control — a software or hardware sandbox — that doubles the wall.
Think of a container like a rented office in a skyscraper. Without this layer, the tenant can wander into the server room if they find an unlocked door. This layer: locks every door, removes the skeleton key, puts the tenant in a glass box they can’t leave, and for the really sketchy tenants, moves them into a separate building with its own security guards.
The baseline every pod gets
These five controls are not opt-in. They apply cluster-wide, to every container, automatically.
1. Pod Security Admission: the admission gate
Before any pod starts, Kubernetes checks whether its configuration meets the restricted standard. If it does not comply, the cluster refuses to run it. This is the enforcement point — nothing unsafe gets through the gate.
The restricted policy is the bouncer at the door. It checks a list: no root, no extra privileges, read-only filesystem, no dangerous host access. If your pod spec doesn’t comply, it doesn’t get in.
The restricted profile requires, at minimum:
runAsNonRoot: trueallowPrivilegeEscalation: falsecapabilities.drop: [ALL]- A seccomp profile (RuntimeDefault or a custom tight one)
readOnlyRootFilesystem: true- No
hostPath,hostPID,hostIPC, orhostNetwork
Apply the restricted label to every namespace that runs workloads:
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: v1.36
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
System namespaces (kube-system, kube-public) cannot comply with restricted — they legitimately run privileged infrastructure. Apply audit and warn modes there so you get visibility without breaking the cluster.
2. Seccomp RuntimeDefault: shrinking the system call surface
Every program running inside a container talks to the host operating system using system calls — requests like “open this file”, “create a network socket”, “launch a new process”. Linux has over 400 of them. Many of the most dangerous ones (ptrace, keyctl, personality, raw socket creation) are never needed by typical applications.
Seccomp is a Linux kernel feature that filters system calls. The RuntimeDefault profile (defined by containerd) blocks approximately 50 dangerous ones while allowing everything a normal application needs.
This is already set cluster-wide. Layer 2 added seccomp-default=true to the kubelet configuration at /etc/rancher/k3s/config.yaml. Every pod gets RuntimeDefault automatically, without needing to declare it. The per-pod securityContext.seccompProfile field shown in examples is explicit confirmation — it enforces the right value rather than relying on the default.
securityContext:
seccompProfile:
type: RuntimeDefault
3. AppArmor RuntimeDefault: restricting what the container can touch
AppArmor is a Linux security module that applies a mandatory access control policy to every process. The runtime’s built-in RuntimeDefault AppArmor profile restricts file access, network operations, and capability use to a safe subset.
Unlike seccomp (which filters system calls), AppArmor filters what those calls can act on — for example, blocking writes to /proc or /sys even if the syscall itself is allowed.
If seccomp is “you can only use these tools”, AppArmor is “even with those tools, you can only work in this room”. Two independent guards, not one.
securityContext:
appArmorProfile:
type: RuntimeDefault
4. User namespaces: UID 0 inside ≠ root outside
User namespaces are a Linux kernel feature, GA in Kubernetes 1.36. Setting hostUsers: false tells Kubernetes to map UID 0 inside the container to an unprivileged UID on the host — something like UID 100000.
What this means in practice: if an attacker achieves a container escape and “breaks out” to the host, they arrive as an unprivileged user with no special access. The escape becomes far less valuable.
Being “root” inside the container is like being the CEO of a toy company. If you escape the container and land on the host without user namespaces, you are still CEO — of the real company. With user namespaces, you land outside as an intern with no badge, no access, and no keys.
spec:
hostUsers: false # GA in k8s 1.36; no feature gate needed
Requirement: Linux kernel 5.19 or later (standard on current Ubuntu/Debian releases). The kernel flag /proc/sys/kernel/unprivileged_userns_clone must be 1 (default on modern distros).
5. Capabilities and filesystem hardening
Linux capabilities are fine-grained slices of the traditional root privilege. A container that starts with no capabilities cannot bind to ports below 1024, cannot modify network interfaces, cannot load kernel modules. Combined with a read-only root filesystem, a container escape produces a read-only, no-privilege, no-tool environment.
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
add: [] # add back only what the app provably requires
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65534 # nobody — no special significance, just non-root
runAsGroup: 65534
Never add capabilities back unless you have verified the application actually needs them and have a written record of why. Capability creep undoes this entire layer.
The hardened pod template
This is the complete baseline securityContext for any workload. Every new deployment starts here.
# See the full file at scripts/config/runtime/hardened-pod-template.yaml
spec:
hostUsers: false # user namespace isolation
hostPID: false
hostIPC: false
hostNetwork: false
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
seccompProfile:
type: RuntimeDefault
appArmorProfile:
type: RuntimeDefault
containers:
- image: cgr.dev/chainguard/static:latest # distroless — no shell, no package manager
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir:
medium: Memory # tmpfs — no disk persistence; wiped on pod restart
emptyDir with medium: Memory is the correct way to give a read-only-filesystem container a writable scratch area. The data lives in RAM and disappears when the pod stops — no host disk involvement.
Hardened images: no shell, no tools, no CVEs
The container image is the attack surface inside the sandbox. An image with bash, curl, and a package manager gives an attacker everything they need once inside. A distroless image gives them nothing to work with.
A normal container image is a furnished apartment: tools everywhere, bash shell, package manager. An attacker who gets in can redecorate, install things, phone home. A distroless image is an empty concrete room: just your application binary, nothing else. An attacker who gets in has nothing to use.
The hierarchy from smallest to largest attack surface:
| Image base | Shell? | Package manager? | Notes |
|---|---|---|---|
FROM scratch |
No | No | Only for fully static binaries (Go/Rust with CGO disabled) |
cgr.dev/chainguard/static |
No | No | Wolfi-based; CA certs + tz data only; best for Go/Rust |
cgr.dev/chainguard/<runtime> |
No | No | Python, Node, Java, etc. — runs as non-root by default |
gcr.io/distroless/<runtime> |
No | No | Google; Debian-stripped; slightly slower CVE patching |
debian-slim / ubuntu-minimal |
Yes | Yes | Still widely used; much larger attack surface |
Use Chainguard or distroless images. Both ship with SBOM (software bill of materials) attestations and Sigstore signatures so you can verify exactly what is inside.
FROM cgr.dev/chainguard/go:latest AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o /app/server .
FROM cgr.dev/chainguard/static:latest
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]
Runtime sandboxes: the second wall for untrusted workloads
The baseline above applies to every pod. For workloads that process untrusted input — public APIs, user-uploaded content, third-party code execution, anything internet-facing — we add a runtime sandbox. This is a second, independent isolation boundary around the entire container.
There are two sandbox options:
Option A: gVisor — a software kernel in userspace
gVisor’s runsc binary intercepts every system call the container makes and handles it inside a userspace kernel called the Sentry — a Go program that implements about 250 Linux syscalls. The host Linux kernel only ever sees the narrow set of calls the Sentry itself needs. An attacker who exploits a kernel vulnerability inside the container reaches the Sentry, not the host kernel.
Normally, a container shares the host’s operating system kernel — like tenants sharing the building’s plumbing. gVisor gives each tenant their own fake plumbing system. Attacks on the plumbing only break the fake one; the real building plumbing is untouched.
Platform choice: gVisor can operate using two mechanisms. The one you want for homelab nodes running inside VMs is systrap: it uses a kernel security feature called seccomp SECCOMP_RET_TRAP to intercept system calls without needing hardware virtualisation. The alternative, KVM, requires direct access to hardware virtualisation extensions (/dev/kvm) — which is unavailable or severely degraded inside a VM.
Use the systrap platform when your k3s nodes are themselves VMs (Proxmox, VMware, cloud instances). Do not use the KVM platform inside a VM — it requires nested virtualisation and performs poorly under it.
Raspberry Pi / ARM64 caveat: gVisor requires a 48-bit virtual-address kernel (CONFIG_ARM64_VA_BITS_48=y). Raspberry Pi OS ships a 39-bit VA kernel. gVisor will fail on Raspberry Pi OS. The fix is to run Ubuntu on your Pi — Ubuntu for Pi ships a 48-bit VA kernel out of the box. Ubuntu on Pi 5 works with gVisor without any kernel customisation.
If you are on a Pi running Raspberry Pi OS, do not install gVisor. Switch to Ubuntu first, or skip gVisor and rely on the baseline controls only.
Option B: Kata Containers — a full hardware microVM per pod
Kata runs each pod inside a real lightweight virtual machine using KVM hardware virtualisation. The container filesystem and processes live inside a guest VM; the host sees only a KVM virtual machine process. The isolation is as strong as it gets — a full hardware boundary between the container and the host.
gVisor gives the tenant a fake plumbing system. Kata gives them their own entirely separate building on a separate plot of land. The separation is real and physical (in hardware), not just logical.
The catch: Kata requires /dev/kvm (hardware virtualisation support on the CPU). If your k3s nodes are VMs themselves, Kata needs nested virtualisation — VMs inside VMs. This works on modern x86 Intel/AMD CPUs with nested virt enabled in your hypervisor (Proxmox, VMware), but many cloud providers do not offer it on standard instances. It also carries significant overhead: ~600 ms pod cold-start time, 8–12% steady-state CPU/memory cost, and 5–30% additional I/O overhead from double-translation.
Kata is the right choice when: your k3s nodes are bare metal (no nesting needed, full performance), or when you need the strongest possible isolation and can accept the overhead.
Latest stable Kata release: 3.31.0 (May 2025).
Comparison table
| runc + seccomp/AppArmor | gVisor (systrap) | Kata (QEMU/KVM) | |
|---|---|---|---|
| Isolation model | Linux namespaces + cgroups + MAC | Userspace kernel (Sentry) intercepts all syscalls | Full hardware microVM per pod |
| Host kernel exposure | High — full kernel via namespaces | Low — Sentry filters to narrow interface | Very low — only KVM/VMM interface |
| Syscall overhead | ~70 ns | ~800 ns (≈11× slower per call) | Near-native inside VM |
| Steady-state CPU overhead | Baseline | <1% for most apps; worse for I/O-heavy | 8–12% |
| Memory overhead | ~0 | ~100 MB per sandbox | ~64–128 MB per pod (guest kernel) |
| App compatibility | Full | ~90% — no SELinux, no io_uring, limited ioctl | Near-full — full Linux in guest |
| Needs KVM | No | No | Yes — mandatory |
| Works inside a VM | Yes | Yes (systrap) | Needs nested virt |
| ARM64 / Pi support | Yes | Needs 48-bit VA — Ubuntu on Pi; not Raspberry Pi OS | Yes (Pi 4/5 have KVM; nested Pi VM: not practical) |
| Confidential computing | No | No | Optional (CoCo + TDX/SEV-SNP) |
| Installation complexity | None | Medium | Medium–High (kata-deploy DaemonSet) |
For most homelab setups (k3s inside a VM): use gVisor with systrap for untrusted workloads. It requires no KVM, installs via apt, and provides a strong second isolation layer. Kata is optional unless you need the absolute strongest isolation and have either bare metal nodes or nested virt available.
How to mark a workload as untrusted (use the sandbox)
To route pods in a namespace through the gVisor sandbox, label the namespace with the exact key and value that the Kyverno policy enforces:
kubectl label namespace <ns> fortress.k3s/trust=untrusted
The Kyverno policy (scripts/config/supplychain/kyverno-policies/require-runtimeclass-untrusted.yaml) matches on fortress.k3s/trust: untrusted — pods in any namespace carrying that label must declare runtimeClassName: gvisor or admission is denied.
One field in the pod spec activates the sandbox:
spec:
runtimeClassName: gvisor # routes container creation to runsc instead of runc
hostUsers: false
# ... rest of hardened securityContext unchanged
The runtimeClassName field tells Kubernetes which registered runtime to hand this pod to. The gvisor and kata RuntimeClass objects (installed by script 22-runtime.sh) tell containerd which binary to use.
runtimeClassName: gvisor is like a label on a package: “handle with extra care”. The cluster sees the label, looks up “gvisor” in its runtime registry, and sends the container to runsc instead of the normal runc launcher.
What to sandbox: any workload that processes input you do not control. Public-facing HTTP services, file parsers, code execution environments, anything that talks to third-party systems. Internal-only workloads talking only to trusted services can stay on the baseline.
gVisor does not support SELinux inside the sandbox, and has limited ioctl support. A small number of applications (mainly those using io_uring for I/O performance, or applications that need in-container raw block devices) will need the runc runtime. Test before deploying to production.
Installing gVisor: step by step
This is handled by scripts/cluster/22-runtime.sh. For reference, the steps are:
- Add the gVisor APT repository and install
runsc(the gVisor binary). - Write
/etc/containerd/runsc.tomlselecting thesystrapplatform. - Write the containerd template extension at
/var/lib/rancher/k3s/agent/etc/containerd/config-v3.toml.tmpl— this file extends the k3s-managed base template, so it survives k3s upgrades cleanly. Never edit the rendered config at/var/lib/rancher/k3s/agent/etc/containerd/config.toml— k3s overwrites it on every restart. - Restart k3s so containerd picks up the new runtime.
- Apply the
RuntimeClassobjects (gvisor,kata).
k3s ships containerd 2.0 (from v1.31.6+k3s1 / v1.32.2+k3s1 onward), which uses the v3 TOML schema. The template file must be config-v3.toml.tmpl, not config.toml.tmpl. The plugin key in the v3 schema is plugins.'io.containerd.cri.v1.runtime' — if you see examples using plugins."io.containerd.grpc.v1.cri", those are for the older v2 schema and will not work.
Tight per-workload seccomp profiles with the Security Profiles Operator
RuntimeDefault seccomp blocks ~50 dangerous syscalls. For high-value workloads you can do much better: generate a profile that only allows the exact syscalls your application actually makes. An attacker who exploits a memory-corruption vulnerability cannot pivot via any syscall the application never calls.
The Security Profiles Operator (SPO) automates this. It is a Kubernetes operator (a program running inside the cluster) that:
- Distributes custom seccomp and AppArmor profiles to all nodes as CRDs (custom resources).
- Records profiles from running workloads using eBPF — a safe, kernel-native tracing mechanism.
- Binds profiles to pods without requiring changes to pod specs.
Instead of using a standard “block the 50 most dangerous syscalls” filter, SPO watches your application running normally and builds a custom list: “this app only ever calls these 37 specific system calls — block everything else”. If an attacker tries a technique that requires a system call your app never uses, the kernel refuses it outright.
How profile recording works
- Create a
ProfileRecordingresource pointing at a pod. - Run the pod through its normal workload.
- SPO’s eBPF recorder captures every system call made.
- SPO generates a
SeccompProfileCRD with the allow-list. - Reference the generated profile from your pod’s
securityContext.
Requirement for eBPF recording: the node kernel must expose /sys/kernel/btf/vmlinux — available on kernels built with CONFIG_DEBUG_INFO_BTF=y, which is standard on Ubuntu 22.04+ and Debian 12+.
apiVersion: security-profiles-operator.x-k8s.io/v1alpha1
kind: ProfileRecording
metadata:
name: my-app-recording
namespace: default
spec:
kind: SeccompProfile
recorder: bpf # eBPF recorder
podSelector:
matchLabels:
app: my-app
After recording, reference the generated profile:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: operator/default/my-app-recording.json
Record in a staging environment under realistic load. Profiles generated from idle pods will be too narrow and will cause failures in production when the app tries a syscall it did not make during recording. Always test the generated profile before enforcing in production.
SPO is installed via Helm by 22-runtime.sh. Values are in scripts/config/runtime/spo-values.yaml.
What this layer bought you
Before this layer, a container escape gave an attacker the same privileges as the container process on the host. After:
- PSA restricted blocks misconfigured pods before they start — the most common escape vectors (privileged containers, hostPath mounts, host networking) never get scheduled.
- Seccomp RuntimeDefault (set cluster-wide in Layer 2) removes ~50 dangerous system calls. SPO lets you tighten this further to exactly what each application needs.
- AppArmor RuntimeDefault adds a second, independent MAC filter on what the syscalls can access — blocking reads from
/proc, writes to/sys, and more. - User namespaces (
hostUsers: false) mean that a successful container escape lands the attacker as an unprivileged user on the host, not as root. - Drop ALL capabilities removes every sliver of elevated Linux privilege. No capability means no kernel module loading, no raw network sockets, no filesystem ACL manipulation.
- Read-only root filesystem means there is no writable disk surface inside the container to drop tools, persist payloads, or modify the running application.
- Distroless images remove the shell and tools an attacker would use to act on any of the above.
- gVisor (for untrusted workloads) interposes a second kernel between the application and the host — a kernel exploit inside the container reaches the gVisor Sentry, not the host.
- SPO tight profiles narrow the seccomp surface from “everything except 50 bad calls” to “only the 30-something calls this app actually makes”.
The result: a container escape that bypasses all seven independent controls simultaneously is not a realistic attack path.