See Everything, Always
Layer 6 wires up metrics, logs, eBPF-powered runtime threat detection, and compliance scanning — so nothing hides on your cluster.
A locked-down cluster with no visibility is just a locked box you can’t see inside. This layer adds three things: a dashboard that shows what every workload is doing right now; a log trail that records what happened and when; and a trip-wire that watches every system call your containers make and can literally kill a malicious process before it finishes. Then we bolt on automated scanners that report CVEs and mis-configurations every few hours. If something goes wrong — at 3 am — an alert lands on your phone.
What you’re building
Two tiers, one architecture:
| Signal type | Where it goes | Who queries it |
|---|---|---|
| Metrics (CPU, memory, Kubernetes state, Falco event counts) | Prometheus or VictoriaMetrics | Grafana dashboards + Alertmanager rules |
| Logs (container stdout, k3s audit log, host journal) | Loki | Grafana Explore + Grafana alert rules |
| eBPF runtime events (syscalls, network connects) | Tetragon JSON → Loki or Falco → Loki | Grafana Falco dashboard |
| Vulnerability & compliance reports | Trivy Operator CRDs | kubectl get vulnerabilityreports, Grafana |
Two-tier decision
Pick your tier before running the install script. The script auto-detects, but you can force it.
| Criterion | Full tier | Light tier |
|---|---|---|
| RAM available for monitoring | ≥ 2 GB headroom | < 2 GB headroom |
| Node target | 8 GB+ x86/arm64 | Pi 4 / 4 GB arm64 |
| Metrics engine | kube-prometheus-stack (Prometheus 3.x + Alertmanager + Grafana 11.x + node-exporter + kube-state-metrics) | VictoriaMetrics single (vmsingle + vmagent + Grafana) |
| Log collector | Grafana Alloy DaemonSet (replaces deprecated Promtail; ships logs + metrics + traces) | Fluent Bit DaemonSet (C binary, ~15 MB RAM) |
| Runtime security | Tetragon (Cilium CNI path, PRIMARY) or Falco (Flannel/any CNI) | Falco only (lean rules, ERROR+ priority) |
| Posture scanning | Trivy Operator + Kubescape Operator + kube-bench weekly | Trivy Operator + kube-bench weekly |
| Total monitoring RAM | ~3–4 GB peak | ~800 MB–1.2 GB |
Promtail is end-of-life (EOL March 2026). Do not install it. The light tier uses Fluent Bit; the full tier uses Grafana Alloy.
The metrics pipeline
What gets scraped
k3s exposes a single combined Prometheus endpoint on each node at :10250/metrics. It aggregates kubelet, scheduler, controller-manager, and kube-proxy metrics in one place — you don’t need four separate scrape targets as you would on vanilla Kubernetes.
Think of the metrics endpoint as a speedometer you can plug any gauge into. Prometheus is the recorder that plugs in every 15–30 seconds and writes the reading to disk. Grafana is the dashboard that turns those recordings into charts you can actually read.
Full tier — kube-prometheus-stack
kube-prometheus-stack (chart 86.x) bundles Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics into a single Helm release. It is the most battle-tested option with the largest library of pre-built alert rules and dashboards.
Resource reality on a capable node: Prometheus needs 1–2 GB RAM for 30–60 scrape targets; kube-state-metrics ~150 MB; node-exporter ~50 MB; Grafana ~200 MB.
k3s-specific: k3s does not expose separate etcd, controller-manager, or scheduler endpoints — disable those scrapers in the Helm values or Prometheus will log unreachable-target warnings forever.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-prom prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--version 86.x \
-f /path/to/kube-prometheus-stack-values.yaml
Light tier — VictoriaMetrics single
VictoriaMetrics (vmsingle) uses up to 7× less RAM than Prometheus for the same number of targets. On a Pi 4 it runs comfortably in 256 MB. The victoria-metrics-k8s-stack chart is a drop-in replacement that reuses the same PrometheusRule CRDs and Grafana dashboards.
VictoriaMetrics is a more fuel-efficient engine doing the same job. Same dashboard, same alert rules — just lighter on RAM, which matters when your node only has 4 GB total.
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
helm upgrade --install vmstack vm/victoria-metrics-k8s-stack \
--namespace monitoring --create-namespace \
--version 0.38.x \
-f /path/to/victoriametrics-values.yaml
The logging pipeline
All logs — container stdout, the k3s API audit log, and systemd journal entries from the host — funnel through a DaemonSet (one pod per node) that pushes to Loki, a log aggregation store. Grafana then queries Loki the same way it queries Prometheus.
Without a logging pipeline, when something goes wrong and the container has restarted, the logs are gone. Loki is a permanent filing cabinet. The DaemonSet is the filing clerk that stuffs every line in as it happens.
Loki — monolithic mode
Loki runs as a single pod in monolithic mode. It stores log chunks on a local PersistentVolume (local-path storage class). This handles well over 100 GB of logs per day — far more than any homelab generates.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki \
--namespace logging --create-namespace \
--version 6.x \
-f /path/to/loki-values.yaml
Install Loki first — Alloy and Fluent Bit need the Loki endpoint to be ready to accept pushes.
Full tier — Grafana Alloy (log collector)
Alloy runs as a DaemonSet and ships three streams to Loki:
- Container logs — everything written to stdout/stderr by every pod on the node (
/var/log/pods/) - k3s audit log — every API server request recorded at
/var/log/k3s/audit.log - Host journal — systemd unit logs via journald
The audit log is the most security-critical stream. It records who ran kubectl exec, who created a privileged pod, who changed an RBAC role. Without it, you’re flying blind during an incident.
helm upgrade --install alloy grafana/alloy \
--namespace logging \
--version 0.12.x \
-f /path/to/alloy-config.yaml
Alloy needs read access to host paths. It runs as UID 0 (required for journald access) but is not
privileged. It holds only the DAC_READ_SEARCH capability — enough to read log files, nothing more.
Light tier — Fluent Bit (log collector)
Fluent Bit is a C binary. Its DaemonSet pod idles at ~15 MB RAM. It ships the same three streams as Alloy but has no metrics or tracing capability — that’s fine on a Pi where we’re counting every megabyte.
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm upgrade --install fluent-bit fluent/fluent-bit \
--namespace logging \
--version 0.47.x \
-f /path/to/fluent-bit-values.yaml
Runtime security — the trip-wire layer
Metrics and logs tell you what was happening. Runtime security watches what your containers are doing right now at the kernel level, using eBPF — a technology that lets programs run safely inside the Linux kernel without a kernel module.
Every action a container takes — opening a file, making a network connection, spawning a process — goes through the Linux kernel. eBPF lets us put a camera inside the kernel and watch every one of those actions in real time. If a container starts doing something it has no business doing — like opening a shell or connecting outbound to an unknown IP — we see it immediately. With Tetragon, we can also stop it in its tracks before the action completes.
Tetragon (primary — Cilium CNI)
If your cluster uses Cilium as the network plugin (the hardened default in this guide), deploy Tetragon. Tetragon and Cilium share the same eBPF kernel subsystem, which means:
- Lower total overhead than running two separate eBPF agents
- Tetragon can SIGKILL a malicious process in-kernel, before the syscall completes — a reverse shell is an aborted connection, not a live incident
- Network policy violations and runtime violations come from one coherent audit trail
helm repo add cilium https://helm.cilium.io
helm repo update
helm upgrade --install tetragon cilium/tetragon \
--namespace kube-system \
--version 1.3.x \
-f /path/to/tetragon-values.yaml
kubectl apply -f /path/to/tetragon-values.yaml # TracingPolicy CRs applied separately
Tetragon emits JSON events to stdout, picked up by Alloy/Fluent Bit and shipped to Loki. No sidecar required.
What Tetragon detects and enforces:
| TracingPolicy | What it catches | Action |
|---|---|---|
detect-reverse-shell |
Shell binary (bash/sh) making an outbound TCP connection |
SIGKILL |
detect-sensitive-file-write |
Writes to /etc, /bin, /usr/bin inside a container |
Alert + optional SIGKILL |
detect-secret-read |
Container reading /var/run/secrets/kubernetes.io/serviceaccount/token |
Alert |
| Process execution tracking | Every exec() call cluster-wide |
Alert |
Falco (alternative — Flannel/non-Cilium CNI, or documented fallback)
If you’re on Flannel (the k3s default CNI) or are not using Cilium, deploy Falco instead. Falco is the CNCF-graduated standard for runtime security, with 25 stable detection rules out of the box and Falcosidekick to route alerts to 50+ destinations.
Running both Tetragon and Falco simultaneously doubles eBPF overhead. On a capable 8 GB+ node it is possible if you want Falco’s mature rules library alongside Tetragon’s enforcement. On a Pi, run exactly one.
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm upgrade --install falco falcosecurity/falco \
--namespace falco --create-namespace \
--version 9.x \
-f /path/to/falco-values.yaml
k3s-specific: Falco must point at the k3s containerd socket (/run/k3s/containerd/containerd.sock), not the default Docker socket.
What Falco’s built-in rules detect (25 stable rules, highlights):
| Rule | Severity |
|---|---|
| Terminal shell opened in container | Warning |
| Container launched with full privileges | Critical |
Write to sensitive directory (/etc, /bin, /sbin) |
Error |
Package manager (apt, yum) run inside container |
Warning |
| Outbound connection to EC2 instance metadata service | Warning |
| Netcat spawned in container | Error |
| RBAC secrets read from container | Warning |
nsenter called inside a container (namespace escape attempt) |
Critical |
Falco exposes Prometheus metrics natively on port 8765 since version 0.38. Do not install falco-exporter — it is deprecated.
The audit log — your legal and forensic trail
k3s does not enable API audit logging by default. Enable it by adding these flags to the k3s server config, then restarting k3s:
kube-apiserver-arg:
- "audit-log-path=/var/log/k3s/audit.log"
- "audit-policy-file=/etc/rancher/k3s/audit-policy.yaml"
- "audit-log-maxage=30"
- "audit-log-maxbackup=10"
- "audit-log-maxsize=100"
The audit policy (at /etc/rancher/k3s/audit-policy.yaml) controls what gets logged. The policy in this layer:
- Logs every RBAC change (create/update/delete on roles, bindings)
- Logs
kubectl exec,attach, andport-forwardat request level - Logs pod and deployment mutations (create/delete/patch)
- Logs secret and ConfigMap access at metadata level (records who accessed what, never the values)
- Suppresses noisy read-only
get/list/watchon events
Log secret metadata only — the audit log records which secret was accessed and by whom, but never
the secret’s value. The level: Metadata setting for secrets is intentional and non-negotiable.
Compliance and vulnerability scanning
Trivy Operator — both tiers
Trivy Operator (chart 0.33.x) runs as a controller inside the cluster. When a new pod starts, it automatically scans the container image for CVEs and checks the workload configuration for security mis-configurations. Results are stored as Kubernetes custom resources (CRDs) queryable with kubectl.
Every time a new container starts, Trivy Operator checks its ingredients list for known poison (CVEs) and its configuration for obvious mistakes (like “runs as root” or “no CPU limits”). The results stay in the cluster as records you can query or alert on.
Reports it generates:
| CRD | What it captures |
|---|---|
VulnerabilityReport |
CVEs by severity per container image |
ConfigAuditReport |
Missing securityContext, host path mounts, privilege escalation flags |
ExposedSecretReport |
Credentials baked into images or environment variables |
RbacAssessmentReport |
Over-permissive service accounts and role bindings |
SbomReport |
Full software bill of materials |
ClusterComplianceReport |
NSA/CISA Hardening Guidance, CIS Benchmark |
helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
helm upgrade --install trivy-operator aqua/trivy-operator \
--namespace trivy-system --create-namespace \
--version 0.33.x \
-f /path/to/trivy-operator-values.yaml
Trivy Operator does continuous scanning (watching for workload changes). Image signing and supply-chain verification live in the Supply Chain chapter (✦) — this layer owns runtime scanning only.
Kubescape Operator — full tier only
Kubescape (chart 1.22.x, CNCF incubating) scans against NSA/CISA, MITRE ATT&CK, CIS, and SOC 2 frameworks continuously. It is heavier than Trivy Operator (~150–300 MB), so it is full-tier only.
k3s-specific: k3s ships its own runc binary at a non-standard path. You must pass the override at install time or Kubescape cannot inspect container runtime configuration.
helm repo add kubescape https://kubescape.github.io/helm-charts/
helm repo update
helm upgrade --install kubescape kubescape/kubescape-operator \
--namespace kubescape --create-namespace \
--version 1.22.x \
--set global.overrideRuntimePath="/host/var/lib/rancher/k3s/data/current/bin/runc" \
-f /path/to/kubescape-values.yaml
kube-bench — both tiers (weekly CronJob)
kube-bench runs the CIS Kubernetes Benchmark as a batch job. It natively understands k3s. Run it as a weekly CronJob so you catch drift from the CIS baseline without the constant overhead of a running controller.
kubectl apply -f /path/to/kube-bench-cronjob.yaml
# Results in Job logs: kubectl logs -l job-name=kube-bench -n kube-system
The CronJob runs at 02:00 Sunday; kube-bench auto-detects the correct CIS profile for the running k3s version and writes JSON output to the pod log.
Dashboards — what to provision
Provision dashboards as ConfigMaps labelled grafana_dashboard: "1". Grafana’s sidecar picks them up automatically.
| Dashboard | Grafana ID | What it shows |
|---|---|---|
| K3s Monitoring | 19972 | Cluster-level CPU / memory / disk / pod counts |
| Kubernetes Views / K3s Cluster | 16450 | Per-node and per-container view via cAdvisor |
| Node Exporter Full | 1860 | Deep OS-level metrics (disk I/O, network, filesystem) |
| Falco Events | 17319 | Runtime security events from Falco → Loki |
Import dashboards from the Grafana UI under Dashboards → Import. Paste the numeric ID and select your Prometheus/VictoriaMetrics data source. Grafana fetches the dashboard definition from grafana.com.
Alerting — what fires and where it goes
Alert destinations
Alertmanager (full tier) and Grafana Alerting (both tiers) route to:
- ntfy — push notifications to your phone (self-hosted or ntfy.sh)
- Discord — webhook to a private channel
- Email — SMTP (configure in Alertmanager global)
Critical alerts (severity: critical) fire immediately and go to all channels. Warning-level alerts are grouped, deduplicated, and fire at most once every 12 hours.
Essential security alerts
These are the alerts that matter most. All of them are in alert-rules.yaml.
Node health:
| Alert | Fires when | Severity |
|---|---|---|
NodeHighCPU |
CPU > 90% for 15 minutes | Warning |
NodeHighMemory |
Memory > 90% for 5 minutes | Critical |
NodeDiskFull |
Filesystem < 10% free | Critical |
Kubernetes workloads:
| Alert | Fires when | Severity |
|---|---|---|
PodCrashLooping |
Pod in CrashLoopBackOff for 5 min | Critical |
PodOOMKilled |
Container terminated by OOM killer | Warning |
DeploymentReplicasMismatch |
Desired ≠ available replicas for 5 min | Warning |
Security (the ones that wake you up):
| Alert | Fires when | Severity |
|---|---|---|
FalcoCriticalEvent |
Falco fires a CRITICAL-priority rule | Critical |
FalcoErrorEvent |
Falco fires an ERROR-priority rule | Warning |
TrivyCriticalVulnerabilities |
Any CRITICAL CVE present in cluster images | Warning |
KubectlExecDetected |
kubectl exec into any pod (Loki rule) |
Warning |
RBACModification |
Any RBAC role/binding created, updated, or deleted | Critical |
PrivilegedPodCreated |
Pod created with privileged: true |
Critical |
K3sAPIServerDown |
k3s metrics endpoint unreachable for 2 min | Critical |
RBACModification and PrivilegedPodCreated are the two highest-signal security alerts. An attacker
who has broken into your cluster will almost certainly need to escalate privileges — both actions
show up here immediately.
Uptime Kuma — service reachability
Deploy Uptime Kuma as a StatefulSet for HTTP/TCP/cert-expiry monitoring. Point it at:
- Grafana (internal URL via WireGuard)
- Loki health endpoint
- Falco/Falcosidekick health endpoint
- The k3s API server (
https://localhost:6443) - Any application endpoints you care about
Uptime Kuma stores its state in SQLite on a 1 GiB PVC. It is not wired into Alertmanager — it has its own notification channels (configure ntfy/Discord there too).
Hardening the monitoring stack itself
The monitoring stack is a high-value target. Prometheus has read access to every container’s metrics; Grafana holds credentials for all data sources; Loki holds your audit log. Compromise of the monitoring stack is effectively compromise of the cluster.
Never expose Prometheus, Alertmanager, Loki, Falcosidekick, or Tetragon endpoints outside the cluster.
Expose only Grafana (via the ingress, behind WireGuard). Access everything else via kubectl port-forward
when you need it from your workstation — never via NodePort or LoadBalancer.
Namespace isolation
All metrics components live in monitoring; all logging components in logging; Falco in falco; Trivy in trivy-system; Kubescape in kubescape. No cross-namespace co-location with application workloads.
NetworkPolicy isolation
Each monitoring component gets a NetworkPolicy that restricts ingress and egress to exactly what it needs:
- Grafana: ingress from ingress controller and home LAN CIDR only; egress to
monitoringandloggingnamespaces + DNS - Prometheus/VictoriaMetrics: egress to all namespaces on scrape ports; ingress from Grafana only
- Alertmanager: egress to configured webhook/SMTP endpoints only
- Loki: ingress from Alloy/Fluent Bit pods only; egress to Alertmanager (for ruler alerts)
Secrets management
No passwords in Helm values files. Every credential is a Kubernetes Secret referenced by name:
apiVersion: v1
kind: Secret
metadata:
name: grafana-admin-secret
namespace: monitoring
type: Opaque
stringData:
admin-user: admin
admin-password: "<generate with: openssl rand -base64 32>"
Reference in values: grafana.admin.existingSecret: grafana-admin-secret.
Grafana security settings
The Helm values in this layer enforce these Grafana security settings:
- Anonymous access: disabled (explicit, not left at default)
- Cookie flags:
Secure+SameSite=Strict X-Content-Type-Options,X-XSS-Protection, and Content Security Policy: all enabled- External snapshot sharing (snapshot.raintank.io): disabled
- Grafana version check / analytics phoning home: disabled
- Sign-up: disabled
Resource limits — prevent monitoring DoS
Without memory limits, a runaway Prometheus can OOM the node and take the cluster offline — the monitoring system becoming the attack vector. Every component in this layer has both requests and limits defined.
Total footprint:
| Tier | Monitoring RAM (peak) | Monitoring CPU (peak) |
|---|---|---|
| Full | ~3–4 GB | ~1.5–2 vCPU |
| Light | ~800 MB–1.2 GB | ~0.5 vCPU |
Reaching Grafana safely (over WireGuard)
Grafana has no public exposure. The correct access path:
- Your workstation connects to the cluster via WireGuard VPN (Layer 4 of this guide)
- Your browser hits
https://grafana.internal.home— resolved via split-DNS to the cluster’s WireGuard IP - The ingress controller terminates TLS (cert-manager + self-signed CA or Let’s Encrypt DNS-01)
- Grafana receives the request over plain HTTP on its cluster-internal port
If WireGuard is not connected, Grafana is unreachable. That is the design.
For quick one-off access without the ingress: kubectl -n monitoring port-forward svc/kube-prom-grafana 3000:80
then open http://localhost:3000. Close the port-forward when done.
Install order
The order matters — Loki must be ready before Alloy/Fluent Bit starts pushing, and Prometheus must be ready before alert rules are applied.
1. Create namespaces
2. Install Loki
3. Install Alloy (full) or Fluent Bit (light)
4. Install kube-prometheus-stack (full) or VictoriaMetrics (light)
5. Install Tetragon (Cilium path) or Falco (Flannel/any path)
6. Install Trivy Operator
7. Install Kubescape Operator (full only)
8. Apply kube-bench CronJob
9. Apply PrometheusRule alert-rules.yaml
10. Apply Loki alert ConfigMap
11. Deploy Uptime Kuma
The install script 25-observability.sh handles this in order, idempotently.
What this layer bought you
Visibility: Every container’s stdout, the k3s API audit trail, and host journal entries are in Loki, queryable up to 30 days back. Prometheus (or VictoriaMetrics) stores 15–30 days of cluster and node metrics.
Detection: Tetragon watches every syscall and network event in-kernel — a reverse shell is killed before the connection completes. Falco’s 25 stable rules cover the most common container escape and privilege escalation patterns. Both fire alerts to ntfy/Discord within seconds.
Compliance: Trivy Operator reports every CRITICAL/HIGH CVE in running images within minutes of a pod starting. kube-bench runs the CIS k3s benchmark weekly and flags any drift. On capable nodes, Kubescape adds MITRE ATT&CK and NSA/CISA framework coverage.
Alerting: Seven high-signal security alerts and six node/workload health alerts are wired up. RBACModification and PrivilegedPodCreated fire immediately on any permission escalation attempt. FalcoCriticalEvent fires the moment a container escape is attempted.
Attack surface: Grafana is the only monitoring component reachable outside the cluster, and only over WireGuard. Every other component is cluster-internal, NetworkPolicy-isolated, and resource-capped.