Troubleshooting & FAQ

When something won’t work, start here. Symptom → cause → fix, in plain language.

Hardened systems fail closed — when something’s misconfigured, it blocks rather than opens. That’s safer, but it means “it’s not working” usually means “a guard is doing its job and you haven’t told it to allow this yet.” This chapter maps the common symptoms to their real cause.

Two commands solve half of everything: kubectl get events -A --sort-by=.lastTimestamp | tail -20 (what just happened) and kubectl describe pod <name> (why this pod is unhappy). Read the message, not just the status.

Tip

I locked myself out of the host

This is the most common Day-0 mistake. Layer 0 sets the firewall to default-deny and SSH to keys-only. If you ran it over an SSH session without first allowing your key and your source IP, you’re out.

Caution

Fix: use the machine’s physical console (monitor + keyboard). You can’t avoid this remotely — which is exactly why the runbook tells you to do the first run at the console. To prevent it next time, confirm your SSH key works and your admin IP is in the nftables allow-list before enabling the firewall.

kubectl says “connection refused” or “timeout”

Almost always: the WireGuard tunnel isn’t up. The API server (port 6443) is only reachable through the VPN — that’s by design.

On your admin laptop
sudo wg show                      # is the tunnel up and showing a recent handshake?
sudo wg-quick up wg0              # bring it up if not
ping 10.100.0.1                   # can you reach the host's WireGuard IP?
kubectl config view --minify | grep server   # does it point at the WireGuard IP, not a public one?

If wg show has no handshake: check the host’s firewall actually allows the WireGuard UDP port, and that your router forwards it.

A pod is stuck Pending or ContainerCreating

Find the real reason
kubectl describe pod <name> | sed -n '/Events:/,$p'
Message contains Cause Fix
FailedScheduling ... Insufficient cpu/memory Node is full Lower requests, or add a node
failed to provision volume StorageClass/PVC issue Check Layer 5; is the USB unlocked & mounted?
failed to create pod sandbox Runtime/containerd issue Check runtimeClassName exists; see gVisor below
network: ... not allowed Default-deny network policy Add an allow rule (Layer 4 / Operations)

My pod was REJECTED at creation (admission denied)

This is a success, not a bug — a guard refused a pod that breaks the rules.

See exactly which rule
kubectl get events -A | grep -i 'denied\|violat'
The error mentions Which guard What to do
PodSecurity ... restricted Pod Security Admission (L2) Add the restricted securityContext — start from hardened-pod-template.yaml
Kyverno ... disallow-privileged etc. Kyverno policy (✦) Remove the privileged/hostPath setting; you almost never need it
validation failure ... image is not signed cosign verifyImages (✦) Sign the image, or use one from an allowed, signed source
registry ... not allowed restrict-registries (✦) Push to your allow-listed registry, or add the registry to the policy

The fortress refuses anything that doesn’t meet its rules. “Denied” means it’s working. The fix is almost never to weaken the rule — it’s to bring your workload up to the standard.

In plain English

gVisor pods won’t start

On the node (root)
sudo k3s kubectl get runtimeclass            # is 'gvisor' present?
sudo journalctl -u k3s -e | grep -i runsc    # what does containerd say?
  • On a Raspberry Pi: gVisor needs a 48-bit-address kernel. Raspberry Pi OS ships a 39-bit kernel and gVisor will refuse. Switch the Pi to Ubuntu Server for gVisor workloads (see Hardware).
  • Inside a VM: make sure runsc uses the systrap platform, not KVM (Layer 3 installs it this way by default).

Storage / the USB drive

On the host (root)
lsblk -o NAME,FSTYPE,MOUNTPOINT          # is the LUKS device open and mounted?
sudo cryptsetup status usb-data          # is the encrypted mapping active?
  • Won’t auto-unlock after reboot: TPM unlock only works if the boot measurements (PCRs) match. After a firmware/kernel update they can change — re-enrol with 12-tpm-luks-enroll.sh, or unlock manually with your passphrase.
  • Longhorn won’t start on a Pi 4: it needs ~4 GB RAM it doesn’t have. Use TopoLVM instead — 24-storage.sh selects it automatically on low-RAM nodes.

The monitoring stack is eating the machine

On small nodes the full monitoring tier can starve workloads.

Check pressure
kubectl top pods -n monitoring | sort -k3 -r | head

Fix: switch to the light tier (VictoriaMetrics + Fluent Bit) — re-run 25-observability.sh with the light-tier values, or set tighter resource limits. On a Raspberry Pi the light tier is selected automatically.

I think I’ve been compromised

Stay calm and follow the incident steps from Operations:

  1. Snapshot first: sudo k3s etcd-snapshot save --name incident.
  2. Look at the evidence: Tetragon/Falco events in Loki, and the API audit log — they tell you what ran and who called the API.
  3. Contain: kubectl cordon + drain the node; cut its network if needed.
  4. Rebuild, don’t clean: wipe the node and rebuild from the runbook. The immutable design makes this minutes. Rotate every credential that could be exposed.

Frequently asked questions

Q: Do I really need the VM layer? On a capable x86 machine, yes — it’s a cheap, huge isolation win. On a Raspberry Pi it’s impossible (no nested virtualization), so we run on bare metal and rely harder on the other six layers.

Q: Can I use Docker instead of containerd? No — k3s ships containerd, and the sandbox/runtime hardening is built around it. That’s a feature, not a limitation.

Q: Is this overkill for a homelab? Completely, and that’s the point — it’s a “how secure can it get” build. Every layer is optional; you can stop after any of them and still be far ahead of a default install.

Q: Will updates break it? Rarely, because the OS is immutable (atomic rollback) and the config is in Git (reproducible). Read release notes, update one node at a time, and keep a recent etcd snapshot.

Q: How do I add a second machine? See “Adding a node by hand” in Operations.

What this chapter bought you

A fast path from “it’s broken” to “oh, that guard is just doing its job — here’s how to tell it yes.” In a fortress, most failures are the walls working. Now you can read them.