Services

observability / running

Grafana + Prometheus stack

Time-series metrics, historical graphs, threshold alerting. The 'why is the cluster slow?' answer engine.

What it is

The full Prometheus / Grafana / Alertmanager stack, plus a small constellation of exporters. Prometheus scrapes metrics from every node and container in the cluster; Grafana renders them as dashboards; Alertmanager fires Discord notifications when something crosses a threshold.

Why I run it

I already had Uptime Kuma for binary up/down monitoring and Portainer for container-state visibility. What I didn't have was time-series — historical CPU, memory, disk-fill rate, network throughput per LXC. When the cluster slowed down, I had no way to answer "is something using more resources than usual?" except by SSHing in and running top.

The trigger was that "is something using more resources than usual?" question one too many times. Grafana solves it: a single overview dashboard shows every host and every LXC in one screen, with sparklines and current values, and Alertmanager catches the things I'd otherwise have to remember to check.

Deliberately not InfluxDB or Beszel: Prometheus's exporter ecosystem covers everything I need (node_exporter, cAdvisor, pve_exporter, pbs_exporter), the pull model fits the homelab — adding a target is one line in prometheus.yml, no agent to deploy — and PromQL is the lingua franca of self-hosted dashboards. Beszel would have been simpler to set up; Prometheus is what I want to be running in five years.

How I use it

Every Docker LXC runs cAdvisor for per-container metrics. Every host and LXC runs node_exporter natively under systemd for OS-level metrics. A pve-exporter on the observability LXC scrapes the Proxmox API for cluster-level state. A pbs-exporter polls Proxmox Backup Server for snapshot age and datastore usage. Alertmanager has a dedicated Discord webhook (deliberately separate from the Watchtower webhook so alert volume can't drown out update advisories).

The dashboards I built or imported:

Ten alert rules across host / Proxmox / cAdvisor / PBS files. The most important one is PBSStaleBackup — fires if any live guest hasn't been backed up in 30 hours. That single alert would have caught a vzdump hang within hours instead of next morning.

The whole stack is themed deep purple, because the default Grafana chrome is too neutral. Vanilla Grafana doesn't ship a built-in theme system, so the purple is achieved by NPM injecting a <link> to a custom stylesheet via sub_filter. Reverse-proxy CSS injection is the most reversible option available and survives every Grafana upgrade.

Setup notes

Runbook