High Availability: HA Groups, Fencing Mindset, and Failure Testing

September 2, 2025 · 8 min read

High availability sounds like a feature you enable. Click “HA,” and VMs automatically restart when a node fails. Magic.

It’s not magic. It’s fencing, quorum, shared storage, and very specific failure handling. Get any of these wrong and HA either doesn’t work, or worse — causes split-brain where VMs run on multiple nodes simultaneously, corrupting data.

HA without testing is just a checkbox. A checkbox that might destroy your data when you actually need it.

HA Prerequisites

Before enabling HA, you need:

1. Cluster (3+ Nodes Recommended)

# Check cluster status
pvecm status

# Need quorum for HA decisions
# 2 nodes = no node can fail without losing quorum
# 3 nodes = 1 node can fail

Two-node clusters need a QDevice for HA to work reliably.

2. Shared Storage

HA VMs must be on storage accessible from all nodes:

# Check shared storage
pvesm status

# Valid for HA:
# - Ceph (RBD)
# - NFS
# - iSCSI
# - GlusterFS

# NOT valid:
# - local
# - local-lvm
# - local-zfs (unless Ceph ZFS)

If storage isn’t shared, VM can’t start on another node.

3. Fencing Capability

Fencing ensures a failed node is truly dead before starting VMs elsewhere. Without fencing, you risk:

Node 1: Appears dead (network issue)
Node 2: Starts VM copy
Node 1: Actually alive, VM still running
Result: Two VMs, same disk, corruption

Fencing (The Critical Part)

What Fencing Does

Fencing forces a failed node to stop before HA restarts VMs:

Node detected as failed
HA manager tries to fence (kill) the node
Only after successful fence → start VMs on other node

Fencing Methods

Hardware fencing (recommended):

IPMI/iLO/DRAC power off
PDU power cut
SBD (Storage-Based Death)

Software fencing:

Watchdog timer (self-fence)
SSH fence (tell node to shutdown)

Configuring Watchdog Fencing

Most common in homelab. Node kills itself if it loses quorum:

# Enable hardware watchdog
echo "softdog" >> /etc/modules

# Load module
modprobe softdog

# Verify
ls /dev/watchdog

Proxmox HA uses watchdog automatically. If node loses quorum and can’t reach cluster, watchdog triggers reboot.

IPMI Fencing (Production)

For reliable fencing, use IPMI:

# Install fence agents
apt install fence-agents

# Test IPMI fencing manually
ipmitool -H 10.0.0.200 -U admin -P password power status
ipmitool -H 10.0.0.200 -U admin -P password power off

Configure in /etc/pve/ha/fence.cfg:

# Fence configuration
# Not directly supported in PVE GUI, but can use with custom scripts

Storage-Based Fencing (SBD)

Nodes write heartbeats to shared storage. Missing heartbeat = fence:

# Create SBD device on shared storage
sbd -d /dev/sdb create

# Configure SBD
sbd -d /dev/sdb -1 60 -4 120 create

Enabling HA for VMs

Add VM to HA

# Enable HA for VM 100
ha-manager add vm:100

# With specific group
ha-manager add vm:100 --group production

# Check HA status
ha-manager status

Via Web UI: Datacenter → HA → Add → Select VM

HA States

State	Meaning
started	HA will ensure VM is running
stopped	HA will ensure VM is stopped
disabled	HA ignores this VM
ignored	Temporarily ignore (migration)

HA Groups

Groups define which nodes can run HA VMs:

# Create group preferring pve1 and pve2
ha-manager groupadd production --nodes pve1,pve2

# Add VM to group
ha-manager set vm:100 --group production

# Node priority (lower = preferred)
ha-manager groupadd production --nodes pve1:1,pve2:2,pve3:3

With priorities, VMs prefer pve1, failover to pve2, last resort pve3.

Restricted Groups

Only allow VMs on specific nodes:

# Create restricted group
ha-manager groupadd gpu-nodes --nodes pve2,pve3 --restricted

# VMs in this group can ONLY run on pve2 or pve3
ha-manager set vm:200 --group gpu-nodes

Useful for VMs needing specific hardware (GPU, special storage).

HA Manager Behavior

Node Failure Sequence

1. Node stops responding to cluster heartbeats
2. Other nodes detect failure (after timeout)
3. Quorum check: Do remaining nodes have majority?
4. If quorate:
   a. Attempt to fence failed node
   b. Wait for fence confirmation
   c. Start VMs on surviving nodes
5. If not quorate:
   a. Cluster freezes
   b. No HA actions (prevents split-brain)

Failover Timing

Detection timeout:    30 seconds (default)
Fence attempt:        Variable (IPMI: seconds, watchdog: 60s)
VM startup:           10-60 seconds

Total failover time:  1-3 minutes typical

For faster failover, tune detection but beware false positives.

Resource Migration

When node comes back online, VMs don’t automatically migrate back:

# VMs stay on failover node until:
# 1. Manual migration
# 2. Next failure
# 3. Maintenance mode + recovery

# To migrate back manually
qm migrate 100 pve1 --online

This is intentional. Automatic “failback” risks unnecessary disruption.

Maintenance Mode

Before working on a node, use maintenance mode:

# Request maintenance (HA migrates VMs away)
ha-manager set-maintenance pve1 --enable

# Check status
ha-manager status

# Wait for migrations to complete
# Do maintenance work

# Disable maintenance
ha-manager set-maintenance pve1 --disable

This gracefully moves VMs, unlike a failure which is disruptive.

Manual VM Migration

For HA VMs, use:

# Request HA to migrate
ha-manager migrate vm:100 pve2

# Or set VM to ignored temporarily
ha-manager set vm:100 --state ignored
qm migrate 100 pve2 --online
ha-manager set vm:100 --state started

Don’t just qm migrate an HA VM — HA manager might fight you.

Testing HA (Critical)

Test 1: Simulated Node Failure

# On node to "fail"
systemctl stop pve-cluster corosync

# Watch from another node
ha-manager status

# VMs should migrate to other nodes
# After 1-2 minutes, check VMs are running elsewhere

# Restore node
systemctl start corosync pve-cluster

Test 2: Hard Power Off

Warning: These commands immediately crash the node without graceful shutdown.

# Physical power button or:
echo b > /proc/sysrq-trigger  # Immediate reboot (no sync)

# Or IPMI (preferred for remote testing):
ipmitool chassis power off

# This tests actual fencing behavior

Test 3: Network Partition

# On node, drop cluster traffic
iptables -A INPUT -p udp --dport 5405:5412 -j DROP
iptables -A OUTPUT -p udp --dport 5405:5412 -j DROP

# Node should fence itself (watchdog) or be fenced (IPMI)
# VMs should migrate

# Restore
iptables -F

Test 4: Storage Failure

# If using NFS, unmount it
umount -l /mnt/nfs-storage

# HA behavior depends on configuration
# VMs using that storage should fail
# Other VMs should continue

# Document what happens!

Document Test Results

HA Test Report - 2025-01-08

Test: Node power off (pve2)
Method: IPMI power off
Expected: VMs 100, 101 migrate to pve1 or pve3

Timeline:
- 00:00 Power off pve2
- 00:32 Cluster detects failure
- 00:45 Fence confirmed
- 01:15 VM 100 started on pve1
- 01:28 VM 101 started on pve3

Total failover: 1 minute 28 seconds
Result: PASS

Issues: None
Tested by: Admin

Common HA Problems

”No quorum” — Nothing Happens

# Check quorum
pvecm status | grep Quorate

# If "Quorate: No", cluster can't make decisions
# Need majority of nodes online

Fix: Add more nodes, add QDevice, or manually set expected votes (dangerous).

VMs Won’t Start After Failover

# Check HA manager logs
journalctl -u pve-ha-lrm -f

# Common causes:
# - Shared storage not available
# - Resource constraints (RAM, CPU)
# - Start dependencies

Split-Brain Detected

If somehow VMs ran on multiple nodes:

# IMMEDIATELY stop VMs on one node
qm stop 100 --skiplock

# Check for disk corruption
# Restore from backup if needed

This is catastrophic. Prevent with proper fencing.

HA Service Stuck

# Restart HA services
systemctl restart pve-ha-crm
systemctl restart pve-ha-lrm

# Check status
ha-manager status

HA Architecture

Minimum Viable HA

3 nodes minimum (for quorum)
Shared storage (NFS, Ceph, iSCSI)
Fencing (watchdog at minimum)

Production HA

3+ nodes
Redundant network (bonding)
Dedicated cluster network
Ceph or enterprise SAN
Hardware fencing (IPMI)
UPS with monitoring

HA Network Topology

         ┌───────────────────────────────────┐
         │        Cluster Network            │
         │    (Corosync, fencing, HA)        │
         └───────────┬───────────┬───────────┘
                     │           │
         ┌───────────┴───┐   ┌───┴───────────┐
         │     pve1      │   │     pve2      │
         │   (node 1)    │   │   (node 2)    │
         └───────┬───────┘   └───────┬───────┘
                 │                   │
         ┌───────┴───────────────────┴───────┐
         │           Storage Network          │
         │        (Ceph, iSCSI, NFS)          │
         └───────────────────────────────────┘

Separate networks for cluster and storage prevents storage issues from affecting HA decisions.

The Lesson

HA without tests is just a checkbox.

Enabling HA takes 30 seconds. Testing it takes hours. But that testing is what determines whether HA works when you need it.

The checkbox says “HA enabled.” The test proves:

Fencing actually works
VMs actually migrate
Storage is actually shared
Recovery time meets requirements

Every HA setup has edge cases. The node that takes 5 minutes to fence. The VM that won’t start because of resource constraints. The storage path that fails under load.

You find these in testing, or you find them in production. Testing is cheaper.

Schedule regular HA tests. Document what happens. Fix what’s broken. That’s how you turn a checkbox into actual high availability.