Operations & Maintenance

Day-2: the cluster is built. Here’s how to actually use it, change it, and keep it healthy — safely.

Building the fortress was the one-time part. This chapter is the part you’ll return to: deploying apps, giving them storage, opening the exact doors they need, scaling them, patching everything, and recovering when something breaks. Every recipe keeps the security model intact — you open things deliberately, never broadly.

Two habits make everything here safe. One: every change goes through the same default-deny gates, so a new app gets nothing until you grant it. Two: keep your changes in Git (Layer ✦ / Flux) so the cluster is reproducible and every change is reviewable. Treat the live cluster as disposable; treat the Git repo as the truth.

Important

Deploying an application (the secure baseline)

Every workload you deploy must satisfy the restricted Pod Security Standard and the Kyverno policies, or admission control will reject it. Start from the hardened template so it passes the first time:

On your admin laptop
# The fully-locked-down reference Deployment (drop caps, read-only rootfs, non-root, seccomp):
kubectl apply -f scripts/config/runtime/hardened-pod-template.yaml

Think of the template as a pre-approved form. If you fill out your app’s details on this form, the gatekeeper waves you through. Invent your own form and you’ll be turned away until it meets the rules — which is exactly what you want.

In plain English

To run an untrusted or internet-facing app inside the gVisor sandbox, add one line to its pod spec:

Put a workload in the sandbox
spec:
  runtimeClassName: gvisor   # this pod now runs in the user-space sandbox

Provisioning persistent storage

When an app needs storage that survives restarts, you create a PersistentVolumeClaim against the encrypted StorageClass. The volume is LUKS-encrypted on the USB drive automatically.

my-app-data.yaml — a 10Gi encrypted volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data
  namespace: my-app
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: longhorn-encrypted   # per-volume LUKS2; see Layer 5
  resources:
    requests:
      storage: 10Gi
Create it and attach it
kubectl apply -f my-app-data.yaml
kubectl get pvc -n my-app                 # STATUS should become 'Bound'

Then mount it in your pod with a volumes: + volumeMounts: block (full example in scripts/config/storage/example-pvc.yaml).

Need an S3 bucket instead (for backups, large files, app object storage)? Create one in Garage — see Layer 5. Your apps then talk plain S3 to the in-cluster endpoint, encrypted in transit and at rest.

Tip

Autoscaling

There are two kinds of scaling. Be clear-eyed about which applies to a homelab.

Scaling the app (more copies of a pod when busy) works great on one machine. Scaling the cluster (adding machines automatically) needs a cloud to summon new machines from — it doesn’t apply to a single box. On a homelab you add nodes by hand when you buy more hardware.

In plain English

Automatically adds/removes pod copies based on CPU or memory. Requires the metrics-server (or Prometheus Adapter); the observability stack provides metrics.

hpa.yaml — scale my-app between 2 and 8 pods at 70% CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
  namespace: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
Apply and watch it scale
kubectl apply -f hpa.yaml
kubectl get hpa -n my-app -w

Vertical Pod Autoscaler (VPA) can right-size CPU/memory requests over time — useful on small nodes to avoid waste. Install it from the autoscaler project and run it in “recommendation” mode first. Cluster Autoscaler is for clouds; on bare metal, to add capacity you provision another node and join it (see below).

Note

Adding a node by hand

On the new machine (after running Phases 0-1)
# Get the join token from the first server:
#   sudo cat /var/lib/rancher/k3s/server/node-token
curl -sfL https://get.k3s.io | K3S_URL=https://<server-wg-ip>:6443 \
  K3S_TOKEN=<token> sh -s - agent --node-label fortress=true

Opening the network — ports, IPs, and hostnames

The network is default-deny: a new pod can’t talk to anything until you allow it. This is the single most important operational habit. Three common cases:

Allow a specific port from a specific IP/CIDR

Copy and edit the annotated template — it allows only the port and source you name:

scripts/config/network/example-allow-port-ip.yaml (excerpt)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-db-from-app
  namespace: my-app
spec:
  endpointSelector:
    matchLabels: { app: postgres }     # the pod RECEIVING traffic
  ingress:
    - fromCIDR:
        - 10.0.0.0/24                    # only this source range
      toPorts:
        - ports:
            - port: "5432"               # only this port
              protocol: TCP
Apply it
kubectl apply -f example-allow-port-ip.yaml

Allow egress to specific hostnames (FQDN policy)

To let a pod reach only api.github.com and nothing else on the internet — this is Cilium’s L7 DNS-aware policy:

scripts/config/network/example-fqdn-egress.yaml (excerpt)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: egress-github-only
  namespace: my-app
spec:
  endpointSelector:
    matchLabels: { app: my-app }
  egress:
    - toEndpoints:                        # must allow DNS first
        - matchLabels:
            io.kubernetes.pod.namespace: kube-system
            k8s-app: kube-dns
      toPorts:
        - ports: [{ port: "53", protocol: ANY }]
          rules:
            dns: [{ matchPattern: "*" }]
    - toFQDNs:
        - matchName: "api.github.com"     # the ONLY external host allowed
      toPorts:
        - ports: [{ port: "443", protocol: TCP }]

This is far stronger than an IP allow-list: it follows the name, survives the service changing IPs, and blocks data exfiltration to anywhere you didn’t name.

Hardened

Exposing a service to the outside with TLS

To publish a web service safely, use the Cilium Gateway API with an automatic Let’s Encrypt certificate (from Layer 4):

Expose a service over HTTPS
kubectl apply -f scripts/config/network/gateway-tls.yaml   # edit host + service first
kubectl get gateway -A                                      # PROGRAMMED = True when ready

Only expose what you must, and put a sandbox (runtimeClassName: gvisor) on anything reachable from the internet. Everything you expose enlarges the attack surface — keep the list short and reviewed.

Warning

Managing secrets

Never put a plaintext secret in Git or a pod’s environment. Encrypt it with SOPS+age; only your private key can decrypt it, and Flux decrypts it on the way into the cluster:

Encrypt a secret for Git (from Layer ✦)
sops --encrypt --age $AGE_PUBLIC_KEY secret.yaml > secret.enc.yaml
git add secret.enc.yaml          # safe to commit — it's ciphertext

Rotate a secret by re-encrypting a new value and letting Flux roll it out. Rotate the cluster’s at-rest encryption key periodically:

On the server (root)
sudo k3s secrets-encrypt rotate-keys
sudo k3s secrets-encrypt status

Backups & restore

Two things must be backed up: the cluster state (etcd) and your data (volumes/buckets). Both go to the encrypted object store.

Automatic etcd snapshots (configure once, on the server)
# k3s takes scheduled etcd snapshots; ship them to Garage/S3:
sudo k3s etcd-snapshot save --s3 --s3-bucket=cluster-backups \
  --s3-endpoint=<garage-endpoint> --name pre-change
sudo k3s etcd-snapshot ls
Back up volume data with Velero + Kopia (see Layer 5)
# Trigger a named on-demand backup; Velero uses Kopia to snapshot PVs into the Garage bucket:
velero backup create pre-change --include-namespaces my-app --wait
velero backup describe pre-change   # confirm Phase: Completed

Test your restores. A backup you’ve never restored is a hope, not a backup. Once a quarter, restore an etcd snapshot to a throwaway VM and confirm it boots. Restore docs: https://docs.k3s.io/datastore/backup-restore.

Important

Updating & patching everything

What How Cadence
Host OS security patches unattended-upgrades (automatic, from L0) Automatic, nightly
Immutable VM (Flatcar/Talos) Atomic auto-update + reboot into new image Automatic, with rollback
k3s Re-run installer pinned to the new version, one node at a time Monthly, after reading release notes
Helm add-ons (Cilium, monitoring…) helm repo update then helm upgrade with your saved values Monthly
Container images Trivy flags CVEs; rebuild/redeploy from a patched base As Trivy reports
Upgrade k3s safely (server)
# read the release notes first, then:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.2+k3s1 sh -
sudo k3s kubectl get nodes        # confirm Ready on the new version

Because the OS is immutable and the config is in Git, a bad upgrade is not a disaster — roll the VM back to the previous image, or rebuild from the runbook.

Hardened

Routine health checks

Your daily/weekly glance
kubectl get nodes,pods -A | grep -v Running   # anything not healthy?
kubectl top nodes                              # resource pressure?
# Grafana (over WireGuard): cluster + security dashboards
# Tetragon/Falco alerts in Loki: any runtime detections?

Set the alerts from Layer 6 to reach you (ntfy/Discord/email) so you don’t have to remember to look. The whole point of Layer 6 is that the cluster tells you.

When something looks wrong

  1. Don’t panic, and don’t kubectl delete blindly. Snapshot first (etcd-snapshot save).
  2. Check Tetragon/Falco events and the audit log — what happened and who did it.
  3. If a node is compromised, treat it as cattle: cordon, drain, wipe, and rebuild from the runbook. The immutable design makes this minutes, not hours.
  4. Rotate any credential that could have been exposed (WireGuard keys, k3s token, secrets-encrypt key).

Full symptom-by-symptom help is in Troubleshooting & FAQ.

What this chapter bought you

The day-2 muscle memory: deploy within the rules, grant storage and network access one deliberate step at a time, scale the app, patch everything on a schedule, back up to encrypted object storage, and rebuild fearlessly when needed. The fortress isn’t just secure — it’s operable.