The day my Proxmox backups silently broke

Backups are comforting in the same way smoke alarms are comforting. They exist mostly in the background, they become part of the house, and then one night they remind you that "installed" and "protecting you" are not the same verb.

My Proxmox Backup Server setup looked healthy. The datastore had space. Deduplication was doing its quiet little miracle. Standard containers were backing up in seconds. The weekly verify job had run recently. The dashboard had the kind of green that makes you close the tab and go make coffee.

Then the scheduled backup job hit the one container in the cluster that was not like the others.

It started normally. A few guests finished. Logs advanced. Then the task reached a media container with more than one virtual disk attached and printed the last line anyone wants to see during a backup window:

INFO: suspend vm to make snapshot

And then nothing.

No obvious crash. No dramatic failure. No clean error message with a cute stack trace and a breadcrumb to paste into a search box. Just a task that said it was running, a container that had stopped behaving like a container, and a Proxmox UI that slowly began to look haunted.

This is the post I wish I had open before that morning.

The architecture that made the bug possible

Most of my Linux containers are boring in the best possible way: one root disk, one scheduled snapshot backup, off to Proxmox Backup Server, done. Snapshot mode is a great fit for that shape. The guest keeps running, Proxmox snapshots the volume, PBS receives the chunks, and deduplication makes repeat runs cheap.

The media container was different. It had a root disk and an extra mounted volume for application data. That design was convenient when I built it. It kept the service layout tidy and left room for growth.

It also changed the backup mechanics.

For a single-disk LXC, Proxmox can usually take the storage snapshot without freezing the whole guest for long. For a multi-disk LXC, the backup path has to coordinate multiple volumes. The simplified version looks like this:

Freeze or suspend the container.
Snapshot disk one.
Snapshot disk two.
Thaw or resume the container.
Read the snapshots and upload to PBS.
Remove the temporary snapshots.

The dangerous step is number four.

On my setup, the container did not reliably thaw. The backup process reached the suspend step, got stuck around the cgroup-v2 freezer path, and left the guest in a state where normal management commands could hang. That is worse than a failed backup. A failed backup is noisy. A wedged backup can make the rest of the node feel sick while still pretending the job is merely "running."

I do not mind sharp edges. I do mind sharp edges wearing a sweater.

The symptom was not "PBS is broken"

The easy wrong conclusion would have been "Proxmox Backup Server broke my backups." It did not.

PBS was doing its job. The datastore was fine. Chunk verification was fine. The failure happened before the backup stream became a PBS problem. The guest could not safely pass through the snapshot procedure.

That distinction matters because it changes the fix. Reinstalling PBS, moving the datastore, tuning retention, or poking at garbage collection would have been ritual, not repair. The actual failing boundary was between Proxmox VE, LXC, the storage snapshot path, and one container's disk layout.

Once I stopped blaming the backup target, the incident got much simpler.

One guest had a special shape.

One scheduled job treated it like all the others.

One backup mode was wrong for that shape.

Recovery was about unwedging the node first

The first job in an incident like this is not elegance. It is getting the control plane back.

The backup task had stopped making progress. Some Proxmox status calls were hanging because they were trying to inspect the wedged container. The UI showed uncertainty around guests that were otherwise fine, which is exactly the kind of secondary symptom that wastes time if you chase it.

The recovery sequence was:

ps aux | grep lxc-info
kill -9 <hung-status-process>
ls -la /var/lock/pve-manager/
rm /var/lock/pve-manager/vzdump-backup-*.lck
pct unlock <guest-id>
systemctl restart pvestatd
pct status <guest-id>

Those commands are intentionally generic here. The important part is the order.

First, unblock the status poller if it is stuck behind the bad guest. Then clear stale backup lock files only after you understand that the old job is dead. Then clear the guest lock. Then restart the Proxmox status daemon so the UI can stop reporting last known fear as current truth.

In my case, the cleanest final recovery was a node reboot. That is not my favorite lever, but it was the honest one. After the host came back, the containers returned, the media stack restarted, and the backup incident was contained to one bad job design.

The goal was not to prove I could avoid a reboot. The goal was to leave the system in a state I trusted.

The fix was boring, which is how you know it was probably right

I split the backup policy into two jobs.

The standard job still runs in snapshot mode for ordinary single-disk guests. They are fast, consistent enough for their workloads, and PBS handles them beautifully.

The special media container moved to its own backup job in stop mode. During that window, Proxmox shuts the guest down cleanly, backs up the volumes from a stopped state, and starts it again afterward.

That sounds less fancy because it is less fancy. It is also the correct trade.

Snapshot mode optimizes for uptime. Stop mode optimizes for mechanical sympathy. For a multi-disk LXC that can wedge during freezer coordination, uptime during the backup window is not a real feature. It is an IOU written by a subsystem that has already shown you it might not pay.

The downtime is small, scheduled, and understandable. The old behavior was unscheduled, confusing, and capable of making the host look unhealthy.

I will take the honest two-minute interruption.

The second-order bug

There was one more wrinkle. The stop-mode backup restarted the media container cleanly, but one of the mounted paths inside the media stack did not always come back happy. Some services saw a stale mount after boot and needed a small refresh step before they could read their library paths again.

That kind of bug is easy to miss because the backup itself now succeeds. The schedule is green. PBS has a new snapshot. But the application layer can still be degraded.

The durable fix was to add a boot-time refresh service inside the guest. After the container starts, the helper checks and remounts the dependent path before the rest of the stack settles into the day.

That was the moment the incident stopped being a backup fix and became an operating procedure.

A backup strategy is not just "can I create a snapshot?" It is "does the service return to a known-good state after the backup method I chose?"

What I monitor now

I added monitoring around the failure shape, not just around the components.

Datastore space matters. Exporter uptime matters. Verify jobs matter. But this incident was about stale backups for currently existing guests. The useful alert is not "PBS exists." The useful alert is "a guest that should be protected has not produced a fresh snapshot inside the expected window."

That check catches the silent version of this failure.

It also avoids a common homelab trap: alerting on old snapshots for retired guests. If a deleted test container still has snapshots aging out under retention, that should not wake anyone up. The alert needs to join backup freshness against the current inventory, or it will train you to ignore it.

Ignoring backup alerts is how backup systems turn into theatre.

Lessons I am keeping

The first lesson is that backup mode is architecture. It is not a dropdown you set once and forget. If one guest has a different disk topology, it may need a different backup contract.

The second lesson is that "green" needs a definition. A dashboard that says the datastore is healthy is not the same as proof that every important guest got a recent, restorable snapshot.

The third lesson is to write the recovery notes while the incident is still warm. I documented the hung process, the stale lock files, the unlock path, the status daemon restart, and the final reboot. Not because I enjoy incident paperwork, but because future me deserves better than reconstructing a bad morning from shell history.

The last lesson is the one homelabs keep teaching in different costumes: reliability is mostly boring specificity.

One weird container got one weird backup job.

The standard path stayed standard.

The monitoring now watches the thing that failed.

That is not a grand redesign. It is better than that. It is a small repair that fits the actual shape of the system.

And the next time a backup gets stuck after suspend vm to make snapshot, I will spend less time being surprised and more time being useful.