Layer 3: Sealed Runtime

Give every container its own locked box — and put the dangerous ones inside a second box inside that.

How tightly is the container sealed? Each column shows what stands between a container and your real kernel. More walls = more isolation. runc (default) weakest isolation Container process HOST KERNELdirect syscalls — full surface one bug ⇒ host kernel exposed gVisor (our default) strong · no special CPU Container process gVisor Sentryuserspace kernel intercepts syscalls Host kernel narrow, guarded interface only Kata (optional) strongest · needs nested virt Container process MICRO-VM (own guest kernel) guest kernel Host kernel (via KVM) hardware-enforced boundary
Three runtimes, three levels of isolation. We default to gVisor.

What this layer does

By the time a container actually runs, Layers 1 and 2 have hardened the host OS and locked down the Kubernetes control plane. This layer seals the runtime — the moment code inside a container starts executing. It answers one question: if an attacker gets code execution inside a container, how much damage can they do before hitting a wall?

The answer is: very little. We layer four independent controls on top of each other. All four apply to every pod, all the time. High-risk workloads get a fifth control — a software or hardware sandbox — that doubles the wall.

Think of a container like a rented office in a skyscraper. Without this layer, the tenant can wander into the server room if they find an unlocked door. This layer: locks every door, removes the skeleton key, puts the tenant in a glass box they can’t leave, and for the really sketchy tenants, moves them into a separate building with its own security guards.

In plain English

The baseline every pod gets

These five controls are not opt-in. They apply cluster-wide, to every container, automatically.

1. Pod Security Admission: the admission gate

Before any pod starts, Kubernetes checks whether its configuration meets the restricted standard. If it does not comply, the cluster refuses to run it. This is the enforcement point — nothing unsafe gets through the gate.

The restricted policy is the bouncer at the door. It checks a list: no root, no extra privileges, read-only filesystem, no dangerous host access. If your pod spec doesn’t comply, it doesn’t get in.

In plain English

The restricted profile requires, at minimum:

  • runAsNonRoot: true
  • allowPrivilegeEscalation: false
  • capabilities.drop: [ALL]
  • A seccomp profile (RuntimeDefault or a custom tight one)
  • readOnlyRootFilesystem: true
  • No hostPath, hostPID, hostIPC, or hostNetwork

Apply the restricted label to every namespace that runs workloads:

Namespace labels for PSA restricted
apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.36
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

System namespaces (kube-system, kube-public) cannot comply with restricted — they legitimately run privileged infrastructure. Apply audit and warn modes there so you get visibility without breaking the cluster.

Note

2. Seccomp RuntimeDefault: shrinking the system call surface

Every program running inside a container talks to the host operating system using system calls — requests like “open this file”, “create a network socket”, “launch a new process”. Linux has over 400 of them. Many of the most dangerous ones (ptrace, keyctl, personality, raw socket creation) are never needed by typical applications.

Seccomp is a Linux kernel feature that filters system calls. The RuntimeDefault profile (defined by containerd) blocks approximately 50 dangerous ones while allowing everything a normal application needs.

This is already set cluster-wide. Layer 2 added seccomp-default=true to the kubelet configuration at /etc/rancher/k3s/config.yaml. Every pod gets RuntimeDefault automatically, without needing to declare it. The per-pod securityContext.seccompProfile field shown in examples is explicit confirmation — it enforces the right value rather than relying on the default.

Important
Per-pod seccomp (explicit confirmation of the cluster default)
securityContext:
  seccompProfile:
    type: RuntimeDefault

3. AppArmor RuntimeDefault: restricting what the container can touch

AppArmor is a Linux security module that applies a mandatory access control policy to every process. The runtime’s built-in RuntimeDefault AppArmor profile restricts file access, network operations, and capability use to a safe subset.

Unlike seccomp (which filters system calls), AppArmor filters what those calls can act on — for example, blocking writes to /proc or /sys even if the syscall itself is allowed.

If seccomp is “you can only use these tools”, AppArmor is “even with those tools, you can only work in this room”. Two independent guards, not one.

In plain English
Per-pod AppArmor (stable API from Kubernetes 1.31)
securityContext:
  appArmorProfile:
    type: RuntimeDefault

4. User namespaces: UID 0 inside ≠ root outside

User namespaces are a Linux kernel feature, GA in Kubernetes 1.36. Setting hostUsers: false tells Kubernetes to map UID 0 inside the container to an unprivileged UID on the host — something like UID 100000.

What this means in practice: if an attacker achieves a container escape and “breaks out” to the host, they arrive as an unprivileged user with no special access. The escape becomes far less valuable.

Being “root” inside the container is like being the CEO of a toy company. If you escape the container and land on the host without user namespaces, you are still CEO — of the real company. With user namespaces, you land outside as an intern with no badge, no access, and no keys.

In plain English
User namespace isolation (pod-level)
spec:
  hostUsers: false    # GA in k8s 1.36; no feature gate needed

Requirement: Linux kernel 5.19 or later (standard on current Ubuntu/Debian releases). The kernel flag /proc/sys/kernel/unprivileged_userns_clone must be 1 (default on modern distros).

5. Capabilities and filesystem hardening

Linux capabilities are fine-grained slices of the traditional root privilege. A container that starts with no capabilities cannot bind to ports below 1024, cannot modify network interfaces, cannot load kernel modules. Combined with a read-only root filesystem, a container escape produces a read-only, no-privilege, no-tool environment.

Per-container security context (drop everything)
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
    add: []            # add back only what the app provably requires
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534     # nobody — no special significance, just non-root
  runAsGroup: 65534

Never add capabilities back unless you have verified the application actually needs them and have a written record of why. Capability creep undoes this entire layer.

Hardened

The hardened pod template

This is the complete baseline securityContext for any workload. Every new deployment starts here.

scripts/config/runtime/hardened-pod-template.yaml
# See the full file at scripts/config/runtime/hardened-pod-template.yaml
spec:
  hostUsers: false        # user namespace isolation
  hostPID: false
  hostIPC: false
  hostNetwork: false
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    runAsGroup: 65534
    seccompProfile:
      type: RuntimeDefault
    appArmorProfile:
      type: RuntimeDefault
  containers:
    - image: cgr.dev/chainguard/static:latest   # distroless — no shell, no package manager
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
      volumeMounts:
        - name: tmp
          mountPath: /tmp
  volumes:
    - name: tmp
      emptyDir:
        medium: Memory    # tmpfs — no disk persistence; wiped on pod restart

emptyDir with medium: Memory is the correct way to give a read-only-filesystem container a writable scratch area. The data lives in RAM and disappears when the pod stops — no host disk involvement.

Note

Hardened images: no shell, no tools, no CVEs

The container image is the attack surface inside the sandbox. An image with bash, curl, and a package manager gives an attacker everything they need once inside. A distroless image gives them nothing to work with.

A normal container image is a furnished apartment: tools everywhere, bash shell, package manager. An attacker who gets in can redecorate, install things, phone home. A distroless image is an empty concrete room: just your application binary, nothing else. An attacker who gets in has nothing to use.

In plain English

The hierarchy from smallest to largest attack surface:

Image base Shell? Package manager? Notes
FROM scratch No No Only for fully static binaries (Go/Rust with CGO disabled)
cgr.dev/chainguard/static No No Wolfi-based; CA certs + tz data only; best for Go/Rust
cgr.dev/chainguard/<runtime> No No Python, Node, Java, etc. — runs as non-root by default
gcr.io/distroless/<runtime> No No Google; Debian-stripped; slightly slower CVE patching
debian-slim / ubuntu-minimal Yes Yes Still widely used; much larger attack surface

Use Chainguard or distroless images. Both ship with SBOM (software bill of materials) attestations and Sigstore signatures so you can verify exactly what is inside.

Multi-stage build targeting distroless
FROM cgr.dev/chainguard/go:latest AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o /app/server .

FROM cgr.dev/chainguard/static:latest
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]

Runtime sandboxes: the second wall for untrusted workloads

The baseline above applies to every pod. For workloads that process untrusted input — public APIs, user-uploaded content, third-party code execution, anything internet-facing — we add a runtime sandbox. This is a second, independent isolation boundary around the entire container.

There are two sandbox options:

Option A: gVisor — a software kernel in userspace

gVisor’s runsc binary intercepts every system call the container makes and handles it inside a userspace kernel called the Sentry — a Go program that implements about 250 Linux syscalls. The host Linux kernel only ever sees the narrow set of calls the Sentry itself needs. An attacker who exploits a kernel vulnerability inside the container reaches the Sentry, not the host kernel.

Normally, a container shares the host’s operating system kernel — like tenants sharing the building’s plumbing. gVisor gives each tenant their own fake plumbing system. Attacks on the plumbing only break the fake one; the real building plumbing is untouched.

In plain English

Platform choice: gVisor can operate using two mechanisms. The one you want for homelab nodes running inside VMs is systrap: it uses a kernel security feature called seccomp SECCOMP_RET_TRAP to intercept system calls without needing hardware virtualisation. The alternative, KVM, requires direct access to hardware virtualisation extensions (/dev/kvm) — which is unavailable or severely degraded inside a VM.

Use the systrap platform when your k3s nodes are themselves VMs (Proxmox, VMware, cloud instances). Do not use the KVM platform inside a VM — it requires nested virtualisation and performs poorly under it.

Important

Raspberry Pi / ARM64 caveat: gVisor requires a 48-bit virtual-address kernel (CONFIG_ARM64_VA_BITS_48=y). Raspberry Pi OS ships a 39-bit VA kernel. gVisor will fail on Raspberry Pi OS. The fix is to run Ubuntu on your Pi — Ubuntu for Pi ships a 48-bit VA kernel out of the box. Ubuntu on Pi 5 works with gVisor without any kernel customisation.

If you are on a Pi running Raspberry Pi OS, do not install gVisor. Switch to Ubuntu first, or skip gVisor and rely on the baseline controls only.

Warning

Option B: Kata Containers — a full hardware microVM per pod

Kata runs each pod inside a real lightweight virtual machine using KVM hardware virtualisation. The container filesystem and processes live inside a guest VM; the host sees only a KVM virtual machine process. The isolation is as strong as it gets — a full hardware boundary between the container and the host.

gVisor gives the tenant a fake plumbing system. Kata gives them their own entirely separate building on a separate plot of land. The separation is real and physical (in hardware), not just logical.

In plain English

The catch: Kata requires /dev/kvm (hardware virtualisation support on the CPU). If your k3s nodes are VMs themselves, Kata needs nested virtualisation — VMs inside VMs. This works on modern x86 Intel/AMD CPUs with nested virt enabled in your hypervisor (Proxmox, VMware), but many cloud providers do not offer it on standard instances. It also carries significant overhead: ~600 ms pod cold-start time, 8–12% steady-state CPU/memory cost, and 5–30% additional I/O overhead from double-translation.

Kata is the right choice when: your k3s nodes are bare metal (no nesting needed, full performance), or when you need the strongest possible isolation and can accept the overhead.

Latest stable Kata release: 3.31.0 (May 2025).


Comparison table

runc + seccomp/AppArmor gVisor (systrap) Kata (QEMU/KVM)
Isolation model Linux namespaces + cgroups + MAC Userspace kernel (Sentry) intercepts all syscalls Full hardware microVM per pod
Host kernel exposure High — full kernel via namespaces Low — Sentry filters to narrow interface Very low — only KVM/VMM interface
Syscall overhead ~70 ns ~800 ns (≈11× slower per call) Near-native inside VM
Steady-state CPU overhead Baseline <1% for most apps; worse for I/O-heavy 8–12%
Memory overhead ~0 ~100 MB per sandbox ~64–128 MB per pod (guest kernel)
App compatibility Full ~90% — no SELinux, no io_uring, limited ioctl Near-full — full Linux in guest
Needs KVM No No Yes — mandatory
Works inside a VM Yes Yes (systrap) Needs nested virt
ARM64 / Pi support Yes Needs 48-bit VA — Ubuntu on Pi; not Raspberry Pi OS Yes (Pi 4/5 have KVM; nested Pi VM: not practical)
Confidential computing No No Optional (CoCo + TDX/SEV-SNP)
Installation complexity None Medium Medium–High (kata-deploy DaemonSet)

For most homelab setups (k3s inside a VM): use gVisor with systrap for untrusted workloads. It requires no KVM, installs via apt, and provides a strong second isolation layer. Kata is optional unless you need the absolute strongest isolation and have either bare metal nodes or nested virt available.

Tip

How to mark a workload as untrusted (use the sandbox)

To route pods in a namespace through the gVisor sandbox, label the namespace with the exact key and value that the Kyverno policy enforces:

Label a namespace as untrusted (triggers gVisor requirement)
kubectl label namespace <ns> fortress.k3s/trust=untrusted

The Kyverno policy (scripts/config/supplychain/kyverno-policies/require-runtimeclass-untrusted.yaml) matches on fortress.k3s/trust: untrusted — pods in any namespace carrying that label must declare runtimeClassName: gvisor or admission is denied.

One field in the pod spec activates the sandbox:

Activating gVisor for an untrusted workload
spec:
  runtimeClassName: gvisor   # routes container creation to runsc instead of runc
  hostUsers: false
  # ... rest of hardened securityContext unchanged

The runtimeClassName field tells Kubernetes which registered runtime to hand this pod to. The gvisor and kata RuntimeClass objects (installed by script 22-runtime.sh) tell containerd which binary to use.

runtimeClassName: gvisor is like a label on a package: “handle with extra care”. The cluster sees the label, looks up “gvisor” in its runtime registry, and sends the container to runsc instead of the normal runc launcher.

In plain English

What to sandbox: any workload that processes input you do not control. Public-facing HTTP services, file parsers, code execution environments, anything that talks to third-party systems. Internal-only workloads talking only to trusted services can stay on the baseline.

gVisor does not support SELinux inside the sandbox, and has limited ioctl support. A small number of applications (mainly those using io_uring for I/O performance, or applications that need in-container raw block devices) will need the runc runtime. Test before deploying to production.

Note

Installing gVisor: step by step

This is handled by scripts/cluster/22-runtime.sh. For reference, the steps are:

  1. Add the gVisor APT repository and install runsc (the gVisor binary).
  2. Write /etc/containerd/runsc.toml selecting the systrap platform.
  3. Write the containerd template extension at /var/lib/rancher/k3s/agent/etc/containerd/config-v3.toml.tmpl — this file extends the k3s-managed base template, so it survives k3s upgrades cleanly. Never edit the rendered config at /var/lib/rancher/k3s/agent/etc/containerd/config.toml — k3s overwrites it on every restart.
  4. Restart k3s so containerd picks up the new runtime.
  5. Apply the RuntimeClass objects (gvisor, kata).

k3s ships containerd 2.0 (from v1.31.6+k3s1 / v1.32.2+k3s1 onward), which uses the v3 TOML schema. The template file must be config-v3.toml.tmpl, not config.toml.tmpl. The plugin key in the v3 schema is plugins.'io.containerd.cri.v1.runtime' — if you see examples using plugins."io.containerd.grpc.v1.cri", those are for the older v2 schema and will not work.

Important

Tight per-workload seccomp profiles with the Security Profiles Operator

RuntimeDefault seccomp blocks ~50 dangerous syscalls. For high-value workloads you can do much better: generate a profile that only allows the exact syscalls your application actually makes. An attacker who exploits a memory-corruption vulnerability cannot pivot via any syscall the application never calls.

The Security Profiles Operator (SPO) automates this. It is a Kubernetes operator (a program running inside the cluster) that:

  • Distributes custom seccomp and AppArmor profiles to all nodes as CRDs (custom resources).
  • Records profiles from running workloads using eBPF — a safe, kernel-native tracing mechanism.
  • Binds profiles to pods without requiring changes to pod specs.

Instead of using a standard “block the 50 most dangerous syscalls” filter, SPO watches your application running normally and builds a custom list: “this app only ever calls these 37 specific system calls — block everything else”. If an attacker tries a technique that requires a system call your app never uses, the kernel refuses it outright.

In plain English

How profile recording works

  1. Create a ProfileRecording resource pointing at a pod.
  2. Run the pod through its normal workload.
  3. SPO’s eBPF recorder captures every system call made.
  4. SPO generates a SeccompProfile CRD with the allow-list.
  5. Reference the generated profile from your pod’s securityContext.

Requirement for eBPF recording: the node kernel must expose /sys/kernel/btf/vmlinux — available on kernels built with CONFIG_DEBUG_INFO_BTF=y, which is standard on Ubuntu 22.04+ and Debian 12+.

Triggering a profile recording
apiVersion: security-profiles-operator.x-k8s.io/v1alpha1
kind: ProfileRecording
metadata:
  name: my-app-recording
  namespace: default
spec:
  kind: SeccompProfile
  recorder: bpf              # eBPF recorder
  podSelector:
    matchLabels:
      app: my-app

After recording, reference the generated profile:

Using a recorded custom seccomp profile
securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: operator/default/my-app-recording.json

Record in a staging environment under realistic load. Profiles generated from idle pods will be too narrow and will cause failures in production when the app tries a syscall it did not make during recording. Always test the generated profile before enforcing in production.

Tip

SPO is installed via Helm by 22-runtime.sh. Values are in scripts/config/runtime/spo-values.yaml.


What this layer bought you

Before this layer, a container escape gave an attacker the same privileges as the container process on the host. After:

  • PSA restricted blocks misconfigured pods before they start — the most common escape vectors (privileged containers, hostPath mounts, host networking) never get scheduled.
  • Seccomp RuntimeDefault (set cluster-wide in Layer 2) removes ~50 dangerous system calls. SPO lets you tighten this further to exactly what each application needs.
  • AppArmor RuntimeDefault adds a second, independent MAC filter on what the syscalls can access — blocking reads from /proc, writes to /sys, and more.
  • User namespaces (hostUsers: false) mean that a successful container escape lands the attacker as an unprivileged user on the host, not as root.
  • Drop ALL capabilities removes every sliver of elevated Linux privilege. No capability means no kernel module loading, no raw network sockets, no filesystem ACL manipulation.
  • Read-only root filesystem means there is no writable disk surface inside the container to drop tools, persist payloads, or modify the running application.
  • Distroless images remove the shell and tools an attacker would use to act on any of the above.
  • gVisor (for untrusted workloads) interposes a second kernel between the application and the host — a kernel exploit inside the container reaches the gVisor Sentry, not the host.
  • SPO tight profiles narrow the seccomp surface from “everything except 50 bad calls” to “only the 30-something calls this app actually makes”.

The result: a container escape that bypasses all seven independent controls simultaneously is not a realistic attack path.