See Everything, Always

Layer 6 wires up metrics, logs, eBPF-powered runtime threat detection, and compliance scanning — so nothing hides on your cluster.

Three signal streams converge into one pane of glass — and alert your phone.

A locked-down cluster with no visibility is just a locked box you can’t see inside. This layer adds three things: a dashboard that shows what every workload is doing right now; a log trail that records what happened and when; and a trip-wire that watches every system call your containers make and can literally kill a malicious process before it finishes. Then we bolt on automated scanners that report CVEs and mis-configurations every few hours. If something goes wrong — at 3 am — an alert lands on your phone.

In plain English

What you’re building

Two tiers, one architecture:

Signal type	Where it goes	Who queries it
Metrics (CPU, memory, Kubernetes state, Falco event counts)	Prometheus or VictoriaMetrics	Grafana dashboards + Alertmanager rules
Logs (container stdout, k3s audit log, host journal)	Loki	Grafana Explore + Grafana alert rules
eBPF runtime events (syscalls, network connects)	Tetragon JSON → Loki or Falco → Loki	Grafana Falco dashboard
Vulnerability & compliance reports	Trivy Operator CRDs	`kubectl get vulnerabilityreports`, Grafana

Two-tier decision

Pick your tier before running the install script. The script auto-detects, but you can force it.

Criterion	Full tier	Light tier
RAM available for monitoring	≥ 2 GB headroom	< 2 GB headroom
Node target	8 GB+ x86/arm64	Pi 4 / 4 GB arm64
Metrics engine	kube-prometheus-stack (Prometheus 3.x + Alertmanager + Grafana 11.x + node-exporter + kube-state-metrics)	VictoriaMetrics single (vmsingle + vmagent + Grafana)
Log collector	Grafana Alloy DaemonSet (replaces deprecated Promtail; ships logs + metrics + traces)	Fluent Bit DaemonSet (C binary, ~15 MB RAM)
Runtime security	Tetragon (Cilium CNI path, PRIMARY) or Falco (Flannel/any CNI)	Falco only (lean rules, ERROR+ priority)
Posture scanning	Trivy Operator + Kubescape Operator + kube-bench weekly	Trivy Operator + kube-bench weekly
Total monitoring RAM	~3–4 GB peak	~800 MB–1.2 GB

Promtail is end-of-life (EOL March 2026). Do not install it. The light tier uses Fluent Bit; the full tier uses Grafana Alloy.

Important

The metrics pipeline

What gets scraped

k3s exposes a single combined Prometheus endpoint on each node at :10250/metrics. It aggregates kubelet, scheduler, controller-manager, and kube-proxy metrics in one place — you don’t need four separate scrape targets as you would on vanilla Kubernetes.

Think of the metrics endpoint as a speedometer you can plug any gauge into. Prometheus is the recorder that plugs in every 15–30 seconds and writes the reading to disk. Grafana is the dashboard that turns those recordings into charts you can actually read.

In plain English

Full tier — kube-prometheus-stack

kube-prometheus-stack (chart 86.x) bundles Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics into a single Helm release. It is the most battle-tested option with the largest library of pre-built alert rules and dashboards.

Resource reality on a capable node: Prometheus needs 1–2 GB RAM for 30–60 scrape targets; kube-state-metrics ~150 MB; node-exporter ~50 MB; Grafana ~200 MB.

k3s-specific: k3s does not expose separate etcd, controller-manager, or scheduler endpoints — disable those scrapers in the Helm values or Prometheus will log unreachable-target warnings forever.

Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --version 86.x \
  -f /path/to/kube-prometheus-stack-values.yaml

Light tier — VictoriaMetrics single

VictoriaMetrics (vmsingle) uses up to 7× less RAM than Prometheus for the same number of targets. On a Pi 4 it runs comfortably in 256 MB. The victoria-metrics-k8s-stack chart is a drop-in replacement that reuses the same PrometheusRule CRDs and Grafana dashboards.

VictoriaMetrics is a more fuel-efficient engine doing the same job. Same dashboard, same alert rules — just lighter on RAM, which matters when your node only has 4 GB total.

In plain English

Install VictoriaMetrics stack

helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
helm upgrade --install vmstack vm/victoria-metrics-k8s-stack \
  --namespace monitoring --create-namespace \
  --version 0.38.x \
  -f /path/to/victoriametrics-values.yaml

The logging pipeline

All logs — container stdout, the k3s API audit log, and systemd journal entries from the host — funnel through a DaemonSet (one pod per node) that pushes to Loki, a log aggregation store. Grafana then queries Loki the same way it queries Prometheus.

Without a logging pipeline, when something goes wrong and the container has restarted, the logs are gone. Loki is a permanent filing cabinet. The DaemonSet is the filing clerk that stuffs every line in as it happens.

In plain English

Loki — monolithic mode

Loki runs as a single pod in monolithic mode. It stores log chunks on a local PersistentVolume (local-path storage class). This handles well over 100 GB of logs per day — far more than any homelab generates.

Install Loki

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki \
  --namespace logging --create-namespace \
  --version 6.x \
  -f /path/to/loki-values.yaml

Install Loki first — Alloy and Fluent Bit need the Loki endpoint to be ready to accept pushes.

Full tier — Grafana Alloy (log collector)

Alloy runs as a DaemonSet and ships three streams to Loki:

Container logs — everything written to stdout/stderr by every pod on the node (/var/log/pods/)
k3s audit log — every API server request recorded at /var/log/k3s/audit.log
Host journal — systemd unit logs via journald

The audit log is the most security-critical stream. It records who ran kubectl exec, who created a privileged pod, who changed an RBAC role. Without it, you’re flying blind during an incident.

Install Grafana Alloy

helm upgrade --install alloy grafana/alloy \
  --namespace logging \
  --version 0.12.x \
  -f /path/to/alloy-config.yaml

Alloy needs read access to host paths. It runs as UID 0 (required for journald access) but is not privileged. It holds only the DAC_READ_SEARCH capability — enough to read log files, nothing more.

Hardened

Light tier — Fluent Bit (log collector)

Fluent Bit is a C binary. Its DaemonSet pod idles at ~15 MB RAM. It ships the same three streams as Alloy but has no metrics or tracing capability — that’s fine on a Pi where we’re counting every megabyte.

Install Fluent Bit

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update
helm upgrade --install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --version 0.47.x \
  -f /path/to/fluent-bit-values.yaml

Runtime security — the trip-wire layer

Metrics and logs tell you what was happening. Runtime security watches what your containers are doing right now at the kernel level, using eBPF — a technology that lets programs run safely inside the Linux kernel without a kernel module.

Every action a container takes — opening a file, making a network connection, spawning a process — goes through the Linux kernel. eBPF lets us put a camera inside the kernel and watch every one of those actions in real time. If a container starts doing something it has no business doing — like opening a shell or connecting outbound to an unknown IP — we see it immediately. With Tetragon, we can also stop it in its tracks before the action completes.

In plain English

Tetragon (primary — Cilium CNI)

If your cluster uses Cilium as the network plugin (the hardened default in this guide), deploy Tetragon. Tetragon and Cilium share the same eBPF kernel subsystem, which means:

Lower total overhead than running two separate eBPF agents
Tetragon can SIGKILL a malicious process in-kernel, before the syscall completes — a reverse shell is an aborted connection, not a live incident
Network policy violations and runtime violations come from one coherent audit trail

Install Tetragon

helm repo add cilium https://helm.cilium.io
helm repo update
helm upgrade --install tetragon cilium/tetragon \
  --namespace kube-system \
  --version 1.3.x \
  -f /path/to/tetragon-values.yaml
kubectl apply -f /path/to/tetragon-values.yaml  # TracingPolicy CRs applied separately

Tetragon emits JSON events to stdout, picked up by Alloy/Fluent Bit and shipped to Loki. No sidecar required.

What Tetragon detects and enforces:

TracingPolicy	What it catches	Action
`detect-reverse-shell`	Shell binary (`bash`/`sh`) making an outbound TCP connection	SIGKILL
`detect-sensitive-file-write`	Writes to `/etc`, `/bin`, `/usr/bin` inside a container	Alert + optional SIGKILL
`detect-secret-read`	Container reading `/var/run/secrets/kubernetes.io/serviceaccount/token`	Alert
Process execution tracking	Every `exec()` call cluster-wide	Alert

Falco (alternative — Flannel/non-Cilium CNI, or documented fallback)

If you’re on Flannel (the k3s default CNI) or are not using Cilium, deploy Falco instead. Falco is the CNCF-graduated standard for runtime security, with 25 stable detection rules out of the box and Falcosidekick to route alerts to 50+ destinations.

Running both Tetragon and Falco simultaneously doubles eBPF overhead. On a capable 8 GB+ node it is possible if you want Falco’s mature rules library alongside Tetragon’s enforcement. On a Pi, run exactly one.

Note

Install Falco

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm upgrade --install falco falcosecurity/falco \
  --namespace falco --create-namespace \
  --version 9.x \
  -f /path/to/falco-values.yaml

k3s-specific: Falco must point at the k3s containerd socket (/run/k3s/containerd/containerd.sock), not the default Docker socket.

What Falco’s built-in rules detect (25 stable rules, highlights):

Rule	Severity
Terminal shell opened in container	Warning
Container launched with full privileges	Critical
Write to sensitive directory (`/etc`, `/bin`, `/sbin`)	Error
Package manager (`apt`, `yum`) run inside container	Warning
Outbound connection to EC2 instance metadata service	Warning
Netcat spawned in container	Error
RBAC secrets read from container	Warning
`nsenter` called inside a container (namespace escape attempt)	Critical

Falco exposes Prometheus metrics natively on port 8765 since version 0.38. Do not install falco-exporter — it is deprecated.

The audit log — your legal and forensic trail

k3s does not enable API audit logging by default. Enable it by adding these flags to the k3s server config, then restarting k3s:

/etc/rancher/k3s/config.yaml (additions)

kube-apiserver-arg:
  - "audit-log-path=/var/log/k3s/audit.log"
  - "audit-policy-file=/etc/rancher/k3s/audit-policy.yaml"
  - "audit-log-maxage=30"
  - "audit-log-maxbackup=10"
  - "audit-log-maxsize=100"

The audit policy (at /etc/rancher/k3s/audit-policy.yaml) controls what gets logged. The policy in this layer:

Logs every RBAC change (create/update/delete on roles, bindings)
Logs kubectl exec, attach, and port-forward at request level
Logs pod and deployment mutations (create/delete/patch)
Logs secret and ConfigMap access at metadata level (records who accessed what, never the values)
Suppresses noisy read-only get/list/watch on events

Log secret metadata only — the audit log records which secret was accessed and by whom, but never the secret’s value. The level: Metadata setting for secrets is intentional and non-negotiable.

Hardened

Compliance and vulnerability scanning

Trivy Operator — both tiers

Trivy Operator (chart 0.33.x) runs as a controller inside the cluster. When a new pod starts, it automatically scans the container image for CVEs and checks the workload configuration for security mis-configurations. Results are stored as Kubernetes custom resources (CRDs) queryable with kubectl.

Every time a new container starts, Trivy Operator checks its ingredients list for known poison (CVEs) and its configuration for obvious mistakes (like “runs as root” or “no CPU limits”). The results stay in the cluster as records you can query or alert on.

In plain English

Reports it generates:

CRD	What it captures
`VulnerabilityReport`	CVEs by severity per container image
`ConfigAuditReport`	Missing `securityContext`, host path mounts, privilege escalation flags
`ExposedSecretReport`	Credentials baked into images or environment variables
`RbacAssessmentReport`	Over-permissive service accounts and role bindings
`SbomReport`	Full software bill of materials
`ClusterComplianceReport`	NSA/CISA Hardening Guidance, CIS Benchmark

Install Trivy Operator

helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
helm upgrade --install trivy-operator aqua/trivy-operator \
  --namespace trivy-system --create-namespace \
  --version 0.33.x \
  -f /path/to/trivy-operator-values.yaml

Trivy Operator does continuous scanning (watching for workload changes). Image signing and supply-chain verification live in the Supply Chain chapter (✦) — this layer owns runtime scanning only.

Note

Kubescape Operator — full tier only

Kubescape (chart 1.22.x, CNCF incubating) scans against NSA/CISA, MITRE ATT&CK, CIS, and SOC 2 frameworks continuously. It is heavier than Trivy Operator (~150–300 MB), so it is full-tier only.

k3s-specific: k3s ships its own runc binary at a non-standard path. You must pass the override at install time or Kubescape cannot inspect container runtime configuration.

Install Kubescape Operator

helm repo add kubescape https://kubescape.github.io/helm-charts/
helm repo update
helm upgrade --install kubescape kubescape/kubescape-operator \
  --namespace kubescape --create-namespace \
  --version 1.22.x \
  --set global.overrideRuntimePath="/host/var/lib/rancher/k3s/data/current/bin/runc" \
  -f /path/to/kubescape-values.yaml

kube-bench — both tiers (weekly CronJob)

kube-bench runs the CIS Kubernetes Benchmark as a batch job. It natively understands k3s. Run it as a weekly CronJob so you catch drift from the CIS baseline without the constant overhead of a running controller.

Apply kube-bench CronJob

kubectl apply -f /path/to/kube-bench-cronjob.yaml
# Results in Job logs: kubectl logs -l job-name=kube-bench -n kube-system

The CronJob runs at 02:00 Sunday; kube-bench auto-detects the correct CIS profile for the running k3s version and writes JSON output to the pod log.

Dashboards — what to provision

Provision dashboards as ConfigMaps labelled grafana_dashboard: "1". Grafana’s sidecar picks them up automatically.

Dashboard	Grafana ID	What it shows
K3s Monitoring	19972	Cluster-level CPU / memory / disk / pod counts
Kubernetes Views / K3s Cluster	16450	Per-node and per-container view via cAdvisor
Node Exporter Full	1860	Deep OS-level metrics (disk I/O, network, filesystem)
Falco Events	17319	Runtime security events from Falco → Loki

Import dashboards from the Grafana UI under Dashboards → Import. Paste the numeric ID and select your Prometheus/VictoriaMetrics data source. Grafana fetches the dashboard definition from grafana.com.

Tip

Alerting — what fires and where it goes

Alert destinations

Alertmanager (full tier) and Grafana Alerting (both tiers) route to:

ntfy — push notifications to your phone (self-hosted or ntfy.sh)
Discord — webhook to a private channel
Email — SMTP (configure in Alertmanager global)

Critical alerts (severity: critical) fire immediately and go to all channels. Warning-level alerts are grouped, deduplicated, and fire at most once every 12 hours.

Essential security alerts

These are the alerts that matter most. All of them are in alert-rules.yaml.

Node health:

Alert	Fires when	Severity
`NodeHighCPU`	CPU > 90% for 15 minutes	Warning
`NodeHighMemory`	Memory > 90% for 5 minutes	Critical
`NodeDiskFull`	Filesystem < 10% free	Critical

Kubernetes workloads:

Alert	Fires when	Severity
`PodCrashLooping`	Pod in CrashLoopBackOff for 5 min	Critical
`PodOOMKilled`	Container terminated by OOM killer	Warning
`DeploymentReplicasMismatch`	Desired ≠ available replicas for 5 min	Warning

Security (the ones that wake you up):

Alert	Fires when	Severity
`FalcoCriticalEvent`	Falco fires a CRITICAL-priority rule	Critical
`FalcoErrorEvent`	Falco fires an ERROR-priority rule	Warning
`TrivyCriticalVulnerabilities`	Any CRITICAL CVE present in cluster images	Warning
`KubectlExecDetected`	`kubectl exec` into any pod (Loki rule)	Warning
`RBACModification`	Any RBAC role/binding created, updated, or deleted	Critical
`PrivilegedPodCreated`	Pod created with `privileged: true`	Critical
`K3sAPIServerDown`	k3s metrics endpoint unreachable for 2 min	Critical

RBACModification and PrivilegedPodCreated are the two highest-signal security alerts. An attacker who has broken into your cluster will almost certainly need to escalate privileges — both actions show up here immediately.

Important

Uptime Kuma — service reachability

Deploy Uptime Kuma as a StatefulSet for HTTP/TCP/cert-expiry monitoring. Point it at:

Grafana (internal URL via WireGuard)
Loki health endpoint
Falco/Falcosidekick health endpoint
The k3s API server (https://localhost:6443)
Any application endpoints you care about

Uptime Kuma stores its state in SQLite on a 1 GiB PVC. It is not wired into Alertmanager — it has its own notification channels (configure ntfy/Discord there too).

Hardening the monitoring stack itself

The monitoring stack is a high-value target. Prometheus has read access to every container’s metrics; Grafana holds credentials for all data sources; Loki holds your audit log. Compromise of the monitoring stack is effectively compromise of the cluster.

Never expose Prometheus, Alertmanager, Loki, Falcosidekick, or Tetragon endpoints outside the cluster. Expose only Grafana (via the ingress, behind WireGuard). Access everything else via kubectl port-forward when you need it from your workstation — never via NodePort or LoadBalancer.

Hardened

Namespace isolation

All metrics components live in monitoring; all logging components in logging; Falco in falco; Trivy in trivy-system; Kubescape in kubescape. No cross-namespace co-location with application workloads.

NetworkPolicy isolation

Each monitoring component gets a NetworkPolicy that restricts ingress and egress to exactly what it needs:

Grafana: ingress from ingress controller and home LAN CIDR only; egress to monitoring and logging namespaces + DNS
Prometheus/VictoriaMetrics: egress to all namespaces on scrape ports; ingress from Grafana only
Alertmanager: egress to configured webhook/SMTP endpoints only
Loki: ingress from Alloy/Fluent Bit pods only; egress to Alertmanager (for ruler alerts)

Secrets management

No passwords in Helm values files. Every credential is a Kubernetes Secret referenced by name:

Grafana admin secret (create before helm install)

apiVersion: v1
kind: Secret
metadata:
  name: grafana-admin-secret
  namespace: monitoring
type: Opaque
stringData:
  admin-user: admin
  admin-password: "<generate with: openssl rand -base64 32>"

Reference in values: grafana.admin.existingSecret: grafana-admin-secret.

Grafana security settings

The Helm values in this layer enforce these Grafana security settings:

Anonymous access: disabled (explicit, not left at default)
Cookie flags: Secure + SameSite=Strict
X-Content-Type-Options, X-XSS-Protection, and Content Security Policy: all enabled
External snapshot sharing (snapshot.raintank.io): disabled
Grafana version check / analytics phoning home: disabled
Sign-up: disabled

Resource limits — prevent monitoring DoS

Without memory limits, a runaway Prometheus can OOM the node and take the cluster offline — the monitoring system becoming the attack vector. Every component in this layer has both requests and limits defined.

Total footprint:

Tier	Monitoring RAM (peak)	Monitoring CPU (peak)
Full	~3–4 GB	~1.5–2 vCPU
Light	~800 MB–1.2 GB	~0.5 vCPU

Reaching Grafana safely (over WireGuard)

Grafana has no public exposure. The correct access path:

Your workstation connects to the cluster via WireGuard VPN (Layer 4 of this guide)
Your browser hits https://grafana.internal.home — resolved via split-DNS to the cluster’s WireGuard IP
The ingress controller terminates TLS (cert-manager + self-signed CA or Let’s Encrypt DNS-01)
Grafana receives the request over plain HTTP on its cluster-internal port

If WireGuard is not connected, Grafana is unreachable. That is the design.

For quick one-off access without the ingress: kubectl -n monitoring port-forward svc/kube-prom-grafana 3000:80 then open http://localhost:3000. Close the port-forward when done.

Tip

Install order

The order matters — Loki must be ready before Alloy/Fluent Bit starts pushing, and Prometheus must be ready before alert rules are applied.

text

1. Create namespaces
2. Install Loki
3. Install Alloy (full) or Fluent Bit (light)
4. Install kube-prometheus-stack (full) or VictoriaMetrics (light)
5. Install Tetragon (Cilium path) or Falco (Flannel/any path)
6. Install Trivy Operator
7. Install Kubescape Operator (full only)
8. Apply kube-bench CronJob
9. Apply PrometheusRule alert-rules.yaml
10. Apply Loki alert ConfigMap
11. Deploy Uptime Kuma

The install script 25-observability.sh handles this in order, idempotently.

What this layer bought you

Visibility: Every container’s stdout, the k3s API audit trail, and host journal entries are in Loki, queryable up to 30 days back. Prometheus (or VictoriaMetrics) stores 15–30 days of cluster and node metrics.

Detection: Tetragon watches every syscall and network event in-kernel — a reverse shell is killed before the connection completes. Falco’s 25 stable rules cover the most common container escape and privilege escalation patterns. Both fire alerts to ntfy/Discord within seconds.

Compliance: Trivy Operator reports every CRITICAL/HIGH CVE in running images within minutes of a pod starting. kube-bench runs the CIS k3s benchmark weekly and flags any drift. On capable nodes, Kubescape adds MITRE ATT&CK and NSA/CISA framework coverage.

Alerting: Seven high-signal security alerts and six node/workload health alerts are wired up. RBACModification and PrivilegedPodCreated fire immediately on any permission escalation attempt. FalcoCriticalEvent fires the moment a container escape is attempted.

Attack surface: Grafana is the only monitoring component reachable outside the cluster, and only over WireGuard. Every other component is cluster-internal, NetworkPolicy-isolated, and resource-capped.