Operations & Maintenance
Day-2: the cluster is built. Here’s how to actually use it, change it, and keep it healthy — safely.
Building the fortress was the one-time part. This chapter is the part you’ll return to: deploying apps, giving them storage, opening the exact doors they need, scaling them, patching everything, and recovering when something breaks. Every recipe keeps the security model intact — you open things deliberately, never broadly.
Two habits make everything here safe. One: every change goes through the same default-deny gates, so a new app gets nothing until you grant it. Two: keep your changes in Git (Layer ✦ / Flux) so the cluster is reproducible and every change is reviewable. Treat the live cluster as disposable; treat the Git repo as the truth.
Deploying an application (the secure baseline)
Every workload you deploy must satisfy the restricted Pod Security Standard and the Kyverno policies, or admission control will reject it. Start from the hardened template so it passes the first time:
# The fully-locked-down reference Deployment (drop caps, read-only rootfs, non-root, seccomp):
kubectl apply -f scripts/config/runtime/hardened-pod-template.yaml
Think of the template as a pre-approved form. If you fill out your app’s details on this form, the gatekeeper waves you through. Invent your own form and you’ll be turned away until it meets the rules — which is exactly what you want.
To run an untrusted or internet-facing app inside the gVisor sandbox, add one line to its pod spec:
spec:
runtimeClassName: gvisor # this pod now runs in the user-space sandbox
Provisioning persistent storage
When an app needs storage that survives restarts, you create a PersistentVolumeClaim against the encrypted StorageClass. The volume is LUKS-encrypted on the USB drive automatically.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-data
namespace: my-app
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn-encrypted # per-volume LUKS2; see Layer 5
resources:
requests:
storage: 10Gi
kubectl apply -f my-app-data.yaml
kubectl get pvc -n my-app # STATUS should become 'Bound'
Then mount it in your pod with a volumes: + volumeMounts: block (full example in
scripts/config/storage/example-pvc.yaml).
Need an S3 bucket instead (for backups, large files, app object storage)? Create one in Garage — see Layer 5. Your apps then talk plain S3 to the in-cluster endpoint, encrypted in transit and at rest.
Autoscaling
There are two kinds of scaling. Be clear-eyed about which applies to a homelab.
Scaling the app (more copies of a pod when busy) works great on one machine. Scaling the cluster (adding machines automatically) needs a cloud to summon new machines from — it doesn’t apply to a single box. On a homelab you add nodes by hand when you buy more hardware.
Horizontal Pod Autoscaler (HPA) — recommended
Automatically adds/removes pod copies based on CPU or memory. Requires the metrics-server (or Prometheus Adapter); the observability stack provides metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
namespace: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
kubectl apply -f hpa.yaml
kubectl get hpa -n my-app -w
Vertical Pod Autoscaler (VPA) can right-size CPU/memory requests over time — useful on small nodes to avoid waste. Install it from the autoscaler project and run it in “recommendation” mode first. Cluster Autoscaler is for clouds; on bare metal, to add capacity you provision another node and join it (see below).
Adding a node by hand
# Get the join token from the first server:
# sudo cat /var/lib/rancher/k3s/server/node-token
curl -sfL https://get.k3s.io | K3S_URL=https://<server-wg-ip>:6443 \
K3S_TOKEN=<token> sh -s - agent --node-label fortress=true
Opening the network — ports, IPs, and hostnames
The network is default-deny: a new pod can’t talk to anything until you allow it. This is the single most important operational habit. Three common cases:
Allow a specific port from a specific IP/CIDR
Copy and edit the annotated template — it allows only the port and source you name:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-db-from-app
namespace: my-app
spec:
endpointSelector:
matchLabels: { app: postgres } # the pod RECEIVING traffic
ingress:
- fromCIDR:
- 10.0.0.0/24 # only this source range
toPorts:
- ports:
- port: "5432" # only this port
protocol: TCP
kubectl apply -f example-allow-port-ip.yaml
Allow egress to specific hostnames (FQDN policy)
To let a pod reach only api.github.com and nothing else on the internet — this is
Cilium’s L7 DNS-aware policy:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: egress-github-only
namespace: my-app
spec:
endpointSelector:
matchLabels: { app: my-app }
egress:
- toEndpoints: # must allow DNS first
- matchLabels:
io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports: [{ port: "53", protocol: ANY }]
rules:
dns: [{ matchPattern: "*" }]
- toFQDNs:
- matchName: "api.github.com" # the ONLY external host allowed
toPorts:
- ports: [{ port: "443", protocol: TCP }]
This is far stronger than an IP allow-list: it follows the name, survives the service changing IPs, and blocks data exfiltration to anywhere you didn’t name.
Exposing a service to the outside with TLS
To publish a web service safely, use the Cilium Gateway API with an automatic Let’s Encrypt certificate (from Layer 4):
kubectl apply -f scripts/config/network/gateway-tls.yaml # edit host + service first
kubectl get gateway -A # PROGRAMMED = True when ready
Only expose what you must, and put a sandbox (runtimeClassName: gvisor) on
anything reachable from the internet. Everything you expose enlarges the attack
surface — keep the list short and reviewed.
Managing secrets
Never put a plaintext secret in Git or a pod’s environment. Encrypt it with SOPS+age; only your private key can decrypt it, and Flux decrypts it on the way into the cluster:
sops --encrypt --age $AGE_PUBLIC_KEY secret.yaml > secret.enc.yaml
git add secret.enc.yaml # safe to commit — it's ciphertext
Rotate a secret by re-encrypting a new value and letting Flux roll it out. Rotate the cluster’s at-rest encryption key periodically:
sudo k3s secrets-encrypt rotate-keys
sudo k3s secrets-encrypt status
Backups & restore
Two things must be backed up: the cluster state (etcd) and your data (volumes/buckets). Both go to the encrypted object store.
# k3s takes scheduled etcd snapshots; ship them to Garage/S3:
sudo k3s etcd-snapshot save --s3 --s3-bucket=cluster-backups \
--s3-endpoint=<garage-endpoint> --name pre-change
sudo k3s etcd-snapshot ls
# Trigger a named on-demand backup; Velero uses Kopia to snapshot PVs into the Garage bucket:
velero backup create pre-change --include-namespaces my-app --wait
velero backup describe pre-change # confirm Phase: Completed
Test your restores. A backup you’ve never restored is a hope, not a backup. Once a quarter, restore an etcd snapshot to a throwaway VM and confirm it boots. Restore docs: https://docs.k3s.io/datastore/backup-restore.
Updating & patching everything
| What | How | Cadence |
|---|---|---|
| Host OS security patches | unattended-upgrades (automatic, from L0) |
Automatic, nightly |
| Immutable VM (Flatcar/Talos) | Atomic auto-update + reboot into new image | Automatic, with rollback |
| k3s | Re-run installer pinned to the new version, one node at a time | Monthly, after reading release notes |
| Helm add-ons (Cilium, monitoring…) | helm repo update then helm upgrade with your saved values |
Monthly |
| Container images | Trivy flags CVEs; rebuild/redeploy from a patched base | As Trivy reports |
# read the release notes first, then:
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.36.2+k3s1 sh -
sudo k3s kubectl get nodes # confirm Ready on the new version
Because the OS is immutable and the config is in Git, a bad upgrade is not a disaster — roll the VM back to the previous image, or rebuild from the runbook.
Routine health checks
kubectl get nodes,pods -A | grep -v Running # anything not healthy?
kubectl top nodes # resource pressure?
# Grafana (over WireGuard): cluster + security dashboards
# Tetragon/Falco alerts in Loki: any runtime detections?
Set the alerts from Layer 6 to reach you (ntfy/Discord/email) so you don’t have to remember to look. The whole point of Layer 6 is that the cluster tells you.
When something looks wrong
- Don’t panic, and don’t
kubectl deleteblindly. Snapshot first (etcd-snapshot save). - Check Tetragon/Falco events and the audit log — what happened and who did it.
- If a node is compromised, treat it as cattle: cordon, drain, wipe, and rebuild from the runbook. The immutable design makes this minutes, not hours.
- Rotate any credential that could have been exposed (WireGuard keys, k3s token, secrets-encrypt key).
Full symptom-by-symptom help is in Troubleshooting & FAQ.
What this chapter bought you
The day-2 muscle memory: deploy within the rules, grant storage and network access one deliberate step at a time, scale the app, patch everything on a schedule, back up to encrypted object storage, and rebuild fearlessly when needed. The fortress isn’t just secure — it’s operable.