What it is
The full Prometheus / Grafana / Alertmanager stack, plus a small constellation of exporters. Prometheus scrapes metrics from every node and container in the cluster; Grafana renders them as dashboards; Alertmanager fires Discord notifications when something crosses a threshold.
Why I run it
I already had Uptime Kuma for binary up/down monitoring and Portainer for container-state visibility. What I didn't have was time-series — historical CPU, memory, disk-fill rate, network throughput per LXC. When the cluster slowed down, I had no way to answer "is something using more resources than usual?" except by SSHing in and running top.
The trigger was that "is something using more resources than usual?" question one too many times. Grafana solves it: a single overview dashboard shows every host and every LXC in one screen, with sparklines and current values, and Alertmanager catches the things I'd otherwise have to remember to check.
Deliberately not InfluxDB or Beszel: Prometheus's exporter ecosystem covers everything I need (node_exporter, cAdvisor, pve_exporter, pbs_exporter), the pull model fits the homelab — adding a target is one line in prometheus.yml, no agent to deploy — and PromQL is the lingua franca of self-hosted dashboards. Beszel would have been simpler to set up; Prometheus is what I want to be running in five years.
How I use it
Every Docker LXC runs cAdvisor for per-container metrics. Every host and LXC runs node_exporter natively under systemd for OS-level metrics. A pve-exporter on the observability LXC scrapes the Proxmox API for cluster-level state. A pbs-exporter polls Proxmox Backup Server for snapshot age and datastore usage. Alertmanager has a dedicated Discord webhook (deliberately separate from the Watchtower webhook so alert volume can't drown out update advisories).
The dashboards I built or imported:
- Home-Lab Overview — the org default home dashboard. Cluster CPU, RAM, storage, active alerts, per-host CPU line chart, per-LXC RAM bar gauge, PBS datastore trend.
- Media-Kitchen — per-container metrics for Plex / Tautulli / Overseerr / Riven / Seanime / AniBridge, plus the media LXC's rootfs-vs-data-mount split.
- Cluster I/O — per-host and per-LXC network and disk throughput. The "would have caught Plex's overnight burst at a glance" dashboard.
- LXC Grid — sortable table of every LXC with CPU%, RAM%, disk%, uptime.
- PBS Backups — datastore usage, dedup ratio, snapshot count, per-guest snapshot age. Red rows when a guest's last backup is older than 48 hours.
- n8n + Uptime Kuma + Vaultwarden + Alerts Overview — focused dashboards for individual services that I look at often.
- Plus the standard community dashboards: Node Exporter Full (1860), cAdvisor (14282), Proxmox (10347).
Ten alert rules across host / Proxmox / cAdvisor / PBS files. The most important one is PBSStaleBackup — fires if any live guest hasn't been backed up in 30 hours. That single alert would have caught a vzdump hang within hours instead of next morning.
The whole stack is themed deep purple, because the default Grafana chrome is too neutral. Vanilla Grafana doesn't ship a built-in theme system, so the purple is achieved by NPM injecting a <link> to a custom stylesheet via sub_filter. Reverse-proxy CSS injection is the most reversible option available and survives every Grafana upgrade.
Setup notes
- Host: the dev-tools LXC. If the observability stack ever pushes that LXC past 70% RAM sustained, I'll split it out — but for now it shares space with Stirling PDF.
- Provisioning: datasources and dashboard providers live as bind-mounted YAML so they're immutable in the UI and survive a container recreate. Dashboards live as JSON files. PBS captures the bind-mount tree as part of the LXC rootfs, so restoring the LXC brings the entire Grafana state back, no UI re-config needed.
- Reverse proxy: Grafana itself is behind NPM. Prometheus and Alertmanager are intentionally LAN-IP-only — they're operator tools, not user surfaces.
- Authentication: Grafana admin login, anonymous auth off.
- Update cadence: manual. Every image is pinned. The Prometheus / Grafana / Alertmanager / exporter tags are tracked in the compose file and bumped deliberately.
Runbook
- Healthy looks like: every scrape target shows
UPin Prometheus, Grafana's API health endpoint returns OK, Alertmanager has zero firing alerts on a baseline cluster. - Trust metrics over logs for Alertmanager success. The version I run doesn't log successful Discord notifications at INFO level — only failures.
alertmanager_notifications_total{integration="discord"}is the real source of truth. - A target is
DOWNbut the service is up: check the exporter port is bound (ss -tlnp), and check that the LXC's Proxmox firewall flag is off. Afirewall=1flag onnet0makes scrapes fail silently — same trap that hides a single LXC from monitoring entirely. - Bind-mounted files break when the host inode changes: GNU
sed -irewrites files via temp-file-then-rename, which changes the inode. Docker bind mounts of single files snapshot the original inode on container start, so the container keeps seeing the old content. Fix:docker compose up -d --force-recreate <svc>. Directory bind mounts don't have this issue; single-file bind mounts do. - A dashboard shows "No data" everywhere: probably an exporter that's silently broken. cAdvisor for example was broken against Docker 27+'s overlayfs snapshotter in older versions —
--docker_only=truemade it report only the root cgroup with no per-container metrics at all. Verify metric names against the live/metricsendpoint, not the README. - Where logs live: per-container
docker logsfor the Grafana / Prometheus / Alertmanager containers; Alertmanager's metrics endpoint for delivery success/failure counts; Discord for the actual user-facing output.