What it is
Proxmox Backup Server (PBS) is the dedicated backup tool from the Proxmox project. It runs alongside Proxmox VE and stores incremental, deduplicated, content-addressable snapshots of every VM and container in the cluster. From PVE's perspective, it's "a storage pool you can back up to"; from PBS's perspective, it owns a datastore on disk where the actual backup chunks live.
Why I run it
The cluster has irreplaceable state: Home Assistant configuration, Vaultwarden's password vault, Nextcloud's data, my photo library, the n8n workflows, every service's config files. Losing any of that to a drive failure would range from annoying to catastrophic.
PBS gives me:
- Daily snapshots of every guest, automatic, no thinking required.
- Content-addressable chunk storage — unchanged data across snapshots takes near-zero additional space. A week of daily snapshots of Vaultwarden uses maybe 1.5× the size of a single snapshot, not 7×.
- Verify jobs that re-read every chunk and confirm the SHA-256 still matches. Catches silent disk corruption that ordinary backups can't see.
- Restore to a different VMID for non-destructive recovery drills.
It also enforces a rule I think is important: backups never live on the same physical disk as the data they protect. PBS's datastore is on an external NVMe enclosure that's always plugged in but is its own disk. A failure on the laptop node's main NVMe doesn't take the backups with it.
How I use it
The datastore holds about a year of dedup-collapsed snapshots across every guest. Scheduled backups run as two jobs:
- Job 1, 02:00 daily, snapshot mode, for every guest except the media LXC.
- Job 2, 08:00 daily, stop mode, for the media LXC only.
The split is the result of a backup incident I'd rather not repeat. The media LXC is the only multi-disk LXC in the cluster (rootfs plus a /data mount). vzdump's snapshot mode for multi-disk LXCs is "suspend → snapshot each disk → resume," and the resume step has a known cgroup-v2 freezer race that hangs the container indefinitely. Single-disk LXCs don't trigger it. Stop mode bypasses the snapshot entirely — quick shutdown, read from the now-stopped volumes, restart — at the cost of a few minutes of downtime, which is why job 2 runs at 08:00 (lowest media-viewing window).
A weekly verify job re-checks every chunk's SHA-256 on Saturday evenings. Snapshots verified in the last 30 days are skipped; everything else gets re-read. Typical run is about five minutes.
Setup notes
- Host: a privileged LXC on the laptop node (privileged + nesting needed for loopback access to the external NVMe).
- Datastore: ~500 GB external NVMe in a USB-C enclosure, bind-mounted into the PBS LXC. Currently around 17% used at 5.5× deduplication.
- Reverse proxy: yes, but PBS must be served over HTTPS by the proxy, not plain HTTP. PBS's frontend uses
Securecookies that silently fail outside a secure context. Symptom of getting it wrong: the static UI loads but every authenticated API call returns empty — sidebar visible, all data widgets blank. Lost half an hour to that one. - Monitoring: a
pbs-exporterrunning in the observability stack polls PBS's API with a read-only token and feeds metrics into Grafana. The most important alert that came out of this: a critical-severity rule that fires if any live guest's last snapshot is older than 30 hours. Would have caught the previous backup-hang incident within hours instead of next morning. - Update cadence: manual.
Runbook
- Healthy looks like: every scheduled job shows a recent successful run, Grafana's PBS dashboard shows fresh snapshot timestamps per guest, datastore usage is well under 80%.
- A scheduled backup is running but no progress: vzdump is wedged on a guest, probably the multi-disk LXC race. Identify the hung
lxc-infoprocess, kill it, clear any stalevzdump-backup-*.lckfiles, restartpvestatd. Worst case, reboot the affected node. - PBS web UI loads but data widgets stay empty: served over plain HTTP through the proxy. Force SSL on the proxy host and use HTTPS. (See above — this one is universal: any cookie-heavy app needs HTTPS even on a "trusted" network.)
- Backup failures are silent: PBS does not notify on failure by default. Either configure a notification target, or rely on the Grafana
PBSStaleBackupalert as the safety net. - Restore to a non-existent VMID: PBS won't restore over a running guest. To validate a backup, restore to a different VMID, boot it, sanity-check, then destroy. Quarterly is the right cadence.
- Where logs live: per-job task logs in the PBS UI;
journalctlon the PBS LXC for the daemon.