No-Downtime Proxmox Upgrades: Cluster, Kernel, and Ceph

A Proxmox cluster exists so that no single node is a single point of failure. That same property is what lets you upgrade it without an outage — drain a node, upgrade it, bring it back, repeat. The trick is doing it in an order that never loses quorum and never leaves Ceph trying to heal data that is about to come right back.

Done carelessly, an upgrade is how you turn routine patching into a 2 a.m. recovery. Done in order, VMs never notice.

The Pre-Flight Check

Proxmox ships a checker for major-version upgrades. Run it on every node and resolve everything it flags before touching a package:

Terminal window
# major version upgrade readiness (name tracks the version pair)
pve8to9 --full # run on each node; fix all warnings/failures first

It catches the things that actually break upgrades: a node out of quorum, Ceph not HEALTH_OK, a too-old corosync, insufficient free space, or repositories pointing at the wrong release. Do not start until it is clean cluster-wide.

Confirm the cluster is healthy and has quorum to spare:

Terminal window
pvecm status # Quorate: Yes, and note the expected vote count
ha-manager status # all services started, none in error/recovery
ceph -s # HEALTH_OK before you begin (if running Ceph)

The Rolling Procedure

Upgrade one node at a time, never more. With a 3+ node cluster you always keep quorum because only one vote is ever missing.

1. Drain the node

Move every running VM/CT off the node with live migration — this is the no-downtime part. With HA configured, you can let HA do it, but explicit is clearer:

Terminal window
# Live-migrate each VM off this node to a target
qm migrate 101 node2 --online
qm migrate 102 node3 --online
# containers (brief switch unless using restart migration carefully)
pct migrate 201 node2 --online

Or drain via HA, which respects HA groups and rules:

Terminal window
ha-manager crm-command node-maintenance enable node1
# HA relocates services off node1; wait until it's empty

Confirm the node is empty before proceeding:

Terminal window
qm list ; pct list # should show nothing running on this node

2. Upgrade the node

Now it carries no workload, so a reboot costs nothing:

Terminal window
apt update
apt dist-upgrade # full upgrade; for major versions after repo switch
# reboot if the kernel or other core packages were updated
reboot

3. Rejoin and verify before moving on

After reboot, confirm it is back in quorum and healthy before draining the next node:

Terminal window
pvecm status # node back, Quorate: Yes
ceph -s # back to HEALTH_OK if running Ceph
ha-manager crm-command node-maintenance disable node1

Only then drain and upgrade node 2. Patience here is the whole method — two nodes down at once on a 3-node cluster is a lost quorum and a frozen cluster.

Ceph: Order Matters

If the cluster runs Ceph, the daemons upgrade in a strict order, and you must stop Ceph from rebalancing while a node is briefly down.

Set noout so OSDs going down for a reboot do not trigger a full data rebalance that you will only have to undo minutes later:

Terminal window
ceph osd set noout
# ... do the node upgrade/reboot ...
ceph osd unset noout # after all nodes done

The daemon upgrade order is fixed — monitors, then managers, then OSDs, then MDS/gateways:

Terminal window
# 1. Monitors (one at a time, check quorum between each)
ceph mon stat
# 2. Managers
ceph mgr stat
# 3. OSDs — restart per node, wait for HEALTH_OK before the next
ceph -s # wait for all PGs active+clean between OSD nodes

Restarting OSDs out of order, or without noout, sends Ceph into a rebalance storm that hammers your disks and risks data movement during a window when redundancy is already reduced. One node’s OSDs at a time, noout set, active+clean confirmed between each.

Corosync: Do Not Break the Heartbeat

Quorum rides on corosync. Two rules during upgrades:

  • Never upgrade corosync on multiple nodes simultaneously — a version mismatch mid-cluster can drop membership. The pre-flight check flags incompatible jumps.
  • Keep the corosync network healthy. If corosync shares a link with VM/migration traffic, a migration storm during the upgrade can starve the heartbeat and cause a spurious fence. A dedicated (or QoS-protected) corosync link is the standing recommendation, and it matters most during the upgrade churn.
Terminal window
corosync-quorumtool -s # members, quorum, and that all nodes agree
journalctl -u corosync -n 50 # watch for retransmits/membership flaps

When a Migration Refuses to Move

The drain step assumes every VM live-migrates cleanly. Some will not, and finding out mid-upgrade with the node half-drained is the worst time. The usual blockers:

  • Local resources. A VM with a disk on local storage, a passed-through PCI device, or a CD-ROM mounted from a local ISO cannot live-migrate. qm migrate fails fast and tells you which.
  • CPU type mismatch. A VM pinned to host CPU type may refuse to land on a node with a different microarchitecture. Use a named model (e.g. x86-64-v2-AES) on mixed hardware so VMs migrate across the whole cluster.

Check before you drain, not during:

Terminal window
# Ask the precondition API what would block node1 -> elsewhere (moves nothing)
pvesh get /nodes/node1/qemu/101/migrate
# returns local_disks, local_resources, allowed_nodes, not_allowed_nodes
# Inspect what a VM holds that might pin it
qm config 101 | grep -E 'hostpci|^ide|^sata|machine|cpu'

For a VM that genuinely cannot live-migrate (a GPU passthrough host, say), the honest options are: accept a brief offline migration in a planned window, or upgrade that node last and shut the VM down for its reboot only. Do not let one un-migratable VM tempt you into rebooting a node that still hosts it live.

Terminal window
# Offline migration for a VM that can't move online — it is briefly down
qm migrate 105 node2 # no --online: stop, move, start on target

A Failure Mid-Upgrade: Recovering Quorum

Suppose node1 is drained and rebooting when node2 unexpectedly hard-fails. On a 3-node cluster you now have one node up out of three — quorum is lost, and the surviving node freezes the cluster filesystem (/etc/pve goes read-only) to protect against split-brain. VMs already running keep running; you just cannot start, stop, or migrate anything.

Terminal window
pvecm status
# Expected: Quorate: No, with Total votes below the quorum threshold

The correct fix is to get a node back, not to fight the safety mechanism. Bringing node1 back from its reboot restores 2/3 votes and quorum returns on its own. Only if a node is genuinely gone for the duration do you lower the expected votes — and this is a deliberate, documented step, not a reflex:

Terminal window
# ONLY when a node is confirmed down for an extended period and you must
# operate the survivors. Temporarily reduce expected votes.
pvecm expected 1
# Restore the real value once the cluster is whole again.

Setting expected low while a node is merely rebooting is how you create the split-brain the quorum was preventing. The discipline is the same as the upgrade itself: one node out at a time, and if a second leaves unexpectedly, your job is to restore it — not to convince corosync the missing nodes do not count.

The Order, Condensed

  1. pveXtoY --full on every node — fix everything.
  2. ceph osd set noout (if Ceph).
  3. Per node, one at a time: drain (live-migrate) → apt dist-upgrade → reboot → verify quorum + HEALTH_OK → next.
  4. Ceph daemons in order: mon → mgr → osd → mds, active+clean between OSD nodes.
  5. ceph osd unset noout.
  6. Final check: pvecm status, ceph -s, ha-manager status all green.

The discipline is boring and that is the point. The clusters that go down during upgrades are the ones where someone got impatient and rebooted two nodes “to save time.” On a quorum-based system, saving time by parallelizing the one thing you must serialize is how you turn a clean rolling upgrade into a frozen cluster and a recovery you did not plan for.