Snapshots vs Backups vs Replication: What Saved Me and What Didn't

I’ve lost data three times in production. Each time taught me something different about what “protected” actually means.

First time: snapshot on same disk that failed. Snapshot died with the disk. Second time: backup existed, but retention policy had pruned the version I needed. Third time: replication was running, but it replicated the corruption.

Snapshots, backups, and replication are different tools solving different problems. Using the wrong one for your failure scenario means learning the hard way.

The Three Protection Layers

FeatureSnapshotBackupReplication
LocationSame storageDifferent storageDifferent node
SpeedInstantMinutes-hoursContinuous
ProtectionHuman errorHardware failureSite failure
Point-in-timeYesYesNear-real-time
Survives disk failureNoYesDepends
Survives site failureNoIf off-siteIf different site

Each layer protects against different failures. You need all three.

Snapshots

A snapshot captures VM state at a point in time — disk and optionally RAM.

How Snapshots Work

ZFS/LVM snapshots are copy-on-write:

Before snapshot:
Disk blocks: [A][B][C][D][E]
After snapshot:
Current: [A][B][C][D][E]
Snapshot: → points to same blocks
After modification (block C changed):
Current: [A][B][C'][D][E]
Snapshot: [A][B][C][D][E] (old C preserved)

Snapshots are instant because nothing is copied initially. Only changed blocks are preserved.

Creating Snapshots

Terminal window
# VM snapshot (disk + state)
qm snapshot 100 before-upgrade --description "Before kernel upgrade"
# List snapshots
qm listsnapshot 100
# Rollback
qm rollback 100 before-upgrade
# Delete snapshot
qm delsnapshot 100 before-upgrade

What Snapshots Are Good For

  • Before risky changes: Upgrade, config change, experimental work
  • Quick rollback: “Oops, that broke it” → 30-second recovery
  • Testing: Try something, snapshot, try variations, rollback

What Snapshots Don’t Protect Against

Failure scenario: Disk dies
Snapshots: Also dead (same disk)
Result: Total loss
Failure scenario: Storage controller fails
Snapshots: Also dead (same storage)
Result: Total loss
Failure scenario: Ransomware encrypts VM
Snapshots: Might survive if attacker doesn't find them
Result: Maybe recoverable, maybe not
Failure scenario: Accidental snapshot deletion
Snapshots: Gone
Result: No protection

Rule: Snapshots are convenience, not protection.

Backups

Backups copy data to separate storage.

Backup to PBS

Terminal window
# Full backup to PBS
vzdump 100 --storage pbs-store --mode snapshot --compress zstd
# Incremental (only changed since last)
# PBS does this automatically

What Backups Protect Against

Failure scenario: Primary storage dies
Backups on PBS: Safe
Result: Restore from backup
Failure scenario: Host fails completely
Backups on PBS: Safe (different hardware)
Result: Restore to new host
Failure scenario: Accidental VM deletion
Backups: Safe (separate system)
Result: Restore deleted VM
Failure scenario: Ransomware encrypts VM
Backups: Safe if not mounted/accessible to VM
Result: Restore clean version

Backup Limitations

Failure scenario: Backup storage also fails
Result: Both copies lost
Failure scenario: Retention pruned the backup you need
Result: Can't restore that point in time
Failure scenario: Site-wide disaster (fire, flood)
On-site backups: Also destroyed
Result: Total loss without off-site copy

RPO: Recovery Point Objective

How much data can you lose?

Daily backups at 1 AM:
- Failure at 11 PM = 22 hours of data loss
- RPO = 24 hours
Hourly backups:
- Maximum 1 hour of data loss
- RPO = 1 hour
Real-time replication:
- Seconds of data loss
- RPO ≈ 0

Match backup frequency to acceptable data loss.

RTO: Recovery Time Objective

How fast must you recover?

Full VM restore from PBS:
- 100GB VM ≈ 10-30 minutes
- RTO ≈ 30 minutes
Restore from off-site:
- Download time + restore time
- RTO = hours
Rebuild from scratch + restore data:
- RTO = hours to days

Match recovery method to acceptable downtime.

Replication

Replication continuously copies data to another location.

Proxmox Replication

Built-in ZFS replication between cluster nodes:

Terminal window
# Create replication job
pvesr create-local-job 100-0 pve2 --schedule '*/15' # Every 15 min
# Check replication status
pvesr status
# List jobs
pvesr list

How Replication Works

Node 1 (primary) Node 2 (replica)
┌──────────────┐ ┌──────────────┐
│ VM 100 │ │ VM 100 │
│ (active) │──────────►│ (standby) │
│ │ ZFS send │ │
└──────────────┘ └──────────────┘
Every 15 minutes: incremental sync

What Replication Protects Against

Failure scenario: Node 1 hardware failure
Replica on Node 2: Ready to start
Result: Activate replica, minimal downtime
Failure scenario: Storage failure on Node 1
Replica on Node 2: Has recent copy
Result: Start replica (with potential 15-min data loss)

What Replication Does NOT Protect Against

Failure scenario: VM data corruption (application bug)
Replication: Replicates the corruption to Node 2
Result: Both copies corrupted
Failure scenario: Ransomware encrypts VM
Replication: Replicates encrypted data
Result: Both copies encrypted
Failure scenario: Accidental VM deletion
Replication: Deletion replicates
Result: Both copies deleted
Failure scenario: Cluster-wide issue
Replication: Both nodes affected
Result: No protection

Rule: Replication protects against hardware failure, not data corruption.

The Three-Layer Strategy

For critical VMs, use all three:

Layer 1: Snapshots
- Before changes
- Quick rollback
- Same-disk convenience
Layer 2: Backups (PBS)
- Daily/hourly
- Different storage
- Historical retention
Layer 3: Replication
- Near-real-time
- Different node
- Fast failover

Example Configuration

Terminal window
# VM 100: Critical web application
# Layer 1: Manual snapshots before changes
qm snapshot 100 pre-upgrade
# Layer 2: Hourly backups to PBS, 30-day retention
# Backup job: hourly to pbs-store
# Retention: keep-hourly=24,keep-daily=30
# Layer 3: 15-minute replication to second node
pvesr create-local-job 100-0 pve2 --schedule '*/15'

Recovery scenarios:

ScenarioRecovery MethodData Loss
Bad config changeRollback snapshot0
Host hardware failureStart replicaUp to 15 min
Storage failureRestore from PBSUp to 1 hour
Data corruptionRestore from PBS (earlier point)Variable
Site disasterRestore from off-site PBSUp to 24 hours

Real Failure Scenarios

Scenario 1: Disk Failure

Situation: ZFS pool loses a disk in mirror
Snapshots: Still available (pool degraded but working)
Replication: Working
Backups: Working
Action: Replace disk, resilver, no VM downtime

Scenario 2: Complete Storage Loss

Situation: Storage controller failure, pool unimportable
Snapshots: Lost
Replication: Available on other node
Action: Start replica, 15 minutes data loss

Scenario 3: Database Corruption

Situation: App bug corrupts database on Tuesday
Discovered: Thursday
Replication: Has corrupted data
Recent backups: Have corrupted data
Older backup from Monday: Clean
Action: Restore Monday backup, replay transaction logs if possible
Lesson: Longer backup retention matters

Scenario 4: Ransomware

Situation: Ransomware encrypts VM on Friday night
Replication: Encrypted copy on second node
Snapshots: Might be encrypted (if attacker accessed)
PBS backups: Clean (PBS not mounted inside VM)
Action: Restore from PBS backup before infection
Lesson: Air-gapped backups survive ransomware

Calculating Your Strategy

For Each VM, Answer:

  1. RPO: How much data loss is acceptable?

    • Minutes → Replication + frequent backups
    • Hours → Hourly backups
    • Days → Daily backups
  2. RTO: How fast must it recover?

    • Minutes → Replication + HA
    • Hours → Local PBS restore
    • Days → Off-site restore okay
  3. Retention: How far back might you need?

    • Days → Short retention
    • Months → Longer retention
    • Compliance → Years (archive separately)

Example: Different VM Classes

Class A: Critical (database, ERP)
- RPO: 15 minutes
- RTO: 30 minutes
- Retention: 90 days
Strategy: Replication (15 min) + Hourly PBS + Monthly off-site
Class B: Important (web servers, apps)
- RPO: 1 hour
- RTO: 4 hours
- Retention: 30 days
Strategy: Hourly PBS backup
Class C: Development (test VMs)
- RPO: 24 hours
- RTO: Next business day
- Retention: 7 days
Strategy: Daily PBS backup
Class D: Ephemeral (CI runners)
- RPO: N/A (rebuild from config)
- RTO: Minutes (just recreate)
- Retention: None
Strategy: No backup (infrastructure as code)

Testing Your Strategy

Monthly Tests

Terminal window
# 1. Snapshot rollback test
qm snapshot 100 test-snap
# Make a change
qm rollback 100 test-snap
# Verify rollback worked
# 2. Backup restore test
qmrestore pbs-store:backup/vm/100/... 900
qm start 900
# Verify VM works
qm destroy 900
# 3. Replication failover test
# Stop source VM
qm stop 100
# Start replica on other node
# Verify it works
# Fail back to primary

Document Results

Test Date: 2025-01-08
Tested by: Admin
Snapshot rollback: PASS (30 seconds)
PBS restore (100GB VM): PASS (12 minutes)
Replication failover: PASS (2 minutes)
Issues found: None
Next test: 2025-02-08

The Lesson

Replication is not a replacement for PBS. It’s a different layer.

Each protection layer handles different failures:

  • Snapshots: Undo mistakes (same disk)
  • Backups: Recover from hardware failure (different storage)
  • Replication: Fast failover (different node)
  • Off-site: Survive site disasters (different location)

The failure you’ll have is the one you didn’t plan for. If you only have replication, you’ll face data corruption. If you only have daily backups, you’ll have the failure at 11 PM. If you only have on-site backups, you’ll have the site disaster.

Layer your protection. Test your recovery. Know exactly what each layer protects against and what it doesn’t.