Troubleshooting & FAQ

When something won’t work, start here. Symptom → cause → fix, in plain language.

Hardened systems fail closed — when something’s misconfigured, it blocks rather than opens. That’s safer, but it means “it’s not working” usually means “a guard is doing its job and you haven’t told it to allow this yet.” This chapter maps the common symptoms to their real cause.

Two commands solve half of everything: kubectl get events -A --sort-by=.lastTimestamp | tail -20 (what just happened) and kubectl describe pod <name> (why this pod is unhappy). Read the message, not just the status.

Tip

I locked myself out of the host

This is the most common Day-0 mistake. Layer 0 sets the firewall to default-deny and SSH to keys-only. If you ran it over an SSH session without first allowing your key and your source IP, you’re out.

Caution

Fix: use the machine’s physical console (monitor + keyboard). You can’t avoid this remotely — which is exactly why the runbook tells you to do the first run at the console. To prevent it next time, confirm your SSH key works and your admin IP is in the nftables allow-list before enabling the firewall.

kubectl says “connection refused” or “timeout”

Almost always: the WireGuard tunnel isn’t up. The API server (port 6443) is only reachable through the VPN — that’s by design.

On your admin laptop

sudo wg show                      # is the tunnel up and showing a recent handshake?
sudo wg-quick up wg0              # bring it up if not
ping 10.100.0.1                   # can you reach the host's WireGuard IP?
kubectl config view --minify | grep server   # does it point at the WireGuard IP, not a public one?

If wg show has no handshake: check the host’s firewall actually allows the WireGuard UDP port, and that your router forwards it.

A pod is stuck `Pending` or `ContainerCreating`

Find the real reason

kubectl describe pod <name> | sed -n '/Events:/,$p'

Message contains	Cause	Fix
`FailedScheduling ... Insufficient cpu/memory`	Node is full	Lower requests, or add a node
`failed to provision volume`	StorageClass/PVC issue	Check Layer 5; is the USB unlocked & mounted?
`failed to create pod sandbox`	Runtime/containerd issue	Check `runtimeClassName` exists; see gVisor below
`network: ... not allowed`	Default-deny network policy	Add an allow rule (Layer 4 / Operations)

My pod was REJECTED at creation (admission denied)

This is a success, not a bug — a guard refused a pod that breaks the rules.

See exactly which rule

kubectl get events -A | grep -i 'denied\|violat'

The error mentions	Which guard	What to do
`PodSecurity ... restricted`	Pod Security Admission (L2)	Add the restricted `securityContext` — start from `hardened-pod-template.yaml`
`Kyverno ... disallow-privileged` etc.	Kyverno policy (✦)	Remove the privileged/hostPath setting; you almost never need it
`validation failure ... image is not signed`	cosign verifyImages (✦)	Sign the image, or use one from an allowed, signed source
`registry ... not allowed`	restrict-registries (✦)	Push to your allow-listed registry, or add the registry to the policy

The fortress refuses anything that doesn’t meet its rules. “Denied” means it’s working. The fix is almost never to weaken the rule — it’s to bring your workload up to the standard.

In plain English

gVisor pods won’t start

On the node (root)

sudo k3s kubectl get runtimeclass            # is 'gvisor' present?
sudo journalctl -u k3s -e | grep -i runsc    # what does containerd say?

On a Raspberry Pi: gVisor needs a 48-bit-address kernel. Raspberry Pi OS ships a 39-bit kernel and gVisor will refuse. Switch the Pi to Ubuntu Server for gVisor workloads (see Hardware).
Inside a VM: make sure runsc uses the systrap platform, not KVM (Layer 3 installs it this way by default).

Storage / the USB drive

On the host (root)

lsblk -o NAME,FSTYPE,MOUNTPOINT          # is the LUKS device open and mounted?
sudo cryptsetup status usb-data          # is the encrypted mapping active?

Won’t auto-unlock after reboot: TPM unlock only works if the boot measurements (PCRs) match. After a firmware/kernel update they can change — re-enrol with 12-tpm-luks-enroll.sh, or unlock manually with your passphrase.
Longhorn won’t start on a Pi 4: it needs ~4 GB RAM it doesn’t have. Use TopoLVM instead — 24-storage.sh selects it automatically on low-RAM nodes.

The monitoring stack is eating the machine

On small nodes the full monitoring tier can starve workloads.

Check pressure

kubectl top pods -n monitoring | sort -k3 -r | head

Fix: switch to the light tier (VictoriaMetrics + Fluent Bit) — re-run 25-observability.sh with the light-tier values, or set tighter resource limits. On a Raspberry Pi the light tier is selected automatically.

I think I’ve been compromised

Stay calm and follow the incident steps from Operations:

Snapshot first: sudo k3s etcd-snapshot save --name incident.
Look at the evidence: Tetragon/Falco events in Loki, and the API audit log — they tell you what ran and who called the API.
Contain: kubectl cordon + drain the node; cut its network if needed.
Rebuild, don’t clean: wipe the node and rebuild from the runbook. The immutable design makes this minutes. Rotate every credential that could be exposed.

Frequently asked questions

Q: Do I really need the VM layer? On a capable x86 machine, yes — it’s a cheap, huge isolation win. On a Raspberry Pi it’s impossible (no nested virtualization), so we run on bare metal and rely harder on the other six layers.

Q: Can I use Docker instead of containerd? No — k3s ships containerd, and the sandbox/runtime hardening is built around it. That’s a feature, not a limitation.

Q: Is this overkill for a homelab? Completely, and that’s the point — it’s a “how secure can it get” build. Every layer is optional; you can stop after any of them and still be far ahead of a default install.

Q: Will updates break it? Rarely, because the OS is immutable (atomic rollback) and the config is in Git (reproducible). Read release notes, update one node at a time, and keep a recent etcd snapshot.

Q: How do I add a second machine? See “Adding a node by hand” in Operations.

What this chapter bought you

A fast path from “it’s broken” to “oh, that guard is just doing its job — here’s how to tell it yes.” In a fortress, most failures are the walls working. Now you can read them.