Grafana + Prometheus stack

What it is

The full Prometheus / Grafana / Alertmanager stack, plus a small constellation of exporters and the newer Loki/Alloy log lane. Prometheus scrapes metrics from every node and container in the cluster; Grafana renders them as dashboards; Alertmanager fires Discord notifications when something crosses a threshold; Loki/Alloy brings Docker logs into Grafana Explore.

Why I run it

I already had Uptime Kuma for binary up/down monitoring and Portainer for container-state visibility. What I didn't have was time-series — historical CPU, memory, disk-fill rate, network throughput per LXC. When the cluster slowed down, I had no way to answer "is something using more resources than usual?" except by SSHing in and running top.

The trigger was that "is something using more resources than usual?" question one too many times. Grafana solves it: a single overview dashboard shows every host and every LXC in one screen, with sparklines and current values, and Alertmanager catches the things I'd otherwise have to remember to check.

Deliberately not InfluxDB or Beszel: Prometheus's exporter ecosystem covers everything I need (node_exporter, cAdvisor, pve_exporter, pbs_exporter), the pull model fits the homelab — adding a target is one line in prometheus.yml, no agent to deploy — and PromQL is the lingua franca of self-hosted dashboards. Beszel would have been simpler to set up; Prometheus is what I want to be running in five years.

How I use it

Every Docker LXC runs cAdvisor for per-container metrics. Every host and LXC runs node_exporter natively under systemd for OS-level metrics. A pve-exporter on the observability LXC scrapes the Proxmox API for cluster-level state. A pbs-exporter polls Proxmox Backup Server for snapshot age and datastore usage. Alertmanager has a dedicated Discord webhook (deliberately separate from the Watchtower webhook so alert volume can't drown out update advisories).

As of the July 2026 observability pass, Loki is provisioned as a Grafana datasource and Alloy collectors push Docker logs from all six Docker LXCs. That turns "what did the container say around the alert?" into a Grafana Explore query instead of a round of SSH.

The dashboards I built or imported:

Home-Lab Overview — the org default home dashboard. Cluster CPU, RAM, storage, active alerts, per-host CPU line chart, per-LXC RAM bar gauge, PBS datastore trend.
Media-Kitchen — per-container metrics for Plex / Tautulli / Overseerr / Riven / Seanime / AniBridge, plus the media LXC's rootfs-vs-data-mount split.
Cluster I/O — per-host and per-LXC network and disk throughput. The "would have caught Plex's overnight burst at a glance" dashboard.
LXC Grid — sortable table of every LXC with CPU%, RAM%, disk%, uptime.
PBS Backups — datastore usage, dedup ratio, snapshot count, per-guest snapshot age. Red rows when a guest's last backup is older than 48 hours.
n8n + Uptime Kuma + Vaultwarden + Alerts Overview — focused dashboards for individual services that I look at often.
Plus the standard community dashboards: Node Exporter Full (1860), cAdvisor (14282), Proxmox (10347).

Ten alert rules across host / Proxmox / cAdvisor / PBS files. The most important one is PBSStaleBackup — fires if any live guest hasn't been backed up in 30 hours. That single alert would have caught a vzdump hang within hours instead of next morning.

The whole stack is themed deep purple, because the default Grafana chrome is too neutral. Vanilla Grafana doesn't ship a built-in theme system, so the purple is achieved by NPM injecting a <link> to a custom stylesheet via sub_filter. Reverse-proxy CSS injection is the most reversible option available and survives every Grafana upgrade.