Snapshots vs Backups vs Replication: What Saved Me and What Didn't

August 29, 2025 · 8 min read

I’ve lost data three times in production. Each time taught me something different about what “protected” actually means.

First time: snapshot on same disk that failed. Snapshot died with the disk. Second time: backup existed, but retention policy had pruned the version I needed. Third time: replication was running, but it replicated the corruption.

Snapshots, backups, and replication are different tools solving different problems. Using the wrong one for your failure scenario means learning the hard way.

The Three Protection Layers

Feature	Snapshot	Backup	Replication
Location	Same storage	Different storage	Different node
Speed	Instant	Minutes-hours	Continuous
Protection	Human error	Hardware failure	Site failure
Point-in-time	Yes	Yes	Near-real-time
Survives disk failure	No	Yes	Depends
Survives site failure	No	If off-site	If different site

Each layer protects against different failures. You need all three.

Snapshots

A snapshot captures VM state at a point in time — disk and optionally RAM.

How Snapshots Work

ZFS/LVM snapshots are copy-on-write:

Before snapshot:
  Disk blocks: [A][B][C][D][E]

After snapshot:
  Current:     [A][B][C][D][E]
  Snapshot:    → points to same blocks

After modification (block C changed):
  Current:     [A][B][C'][D][E]
  Snapshot:    [A][B][C][D][E]  (old C preserved)

Snapshots are instant because nothing is copied initially. Only changed blocks are preserved.

Creating Snapshots

# VM snapshot (disk + state)
qm snapshot 100 before-upgrade --description "Before kernel upgrade"

# List snapshots
qm listsnapshot 100

# Rollback
qm rollback 100 before-upgrade

# Delete snapshot
qm delsnapshot 100 before-upgrade

What Snapshots Are Good For

Before risky changes: Upgrade, config change, experimental work
Quick rollback: “Oops, that broke it” → 30-second recovery
Testing: Try something, snapshot, try variations, rollback

What Snapshots Don’t Protect Against

Failure scenario: Disk dies
Snapshots: Also dead (same disk)
Result: Total loss

Failure scenario: Storage controller fails
Snapshots: Also dead (same storage)
Result: Total loss

Failure scenario: Ransomware encrypts VM
Snapshots: Might survive if attacker doesn't find them
Result: Maybe recoverable, maybe not

Failure scenario: Accidental snapshot deletion
Snapshots: Gone
Result: No protection

Rule: Snapshots are convenience, not protection.

Backups

Backups copy data to separate storage.

Backup to PBS

# Full backup to PBS
vzdump 100 --storage pbs-store --mode snapshot --compress zstd

# Incremental (only changed since last)
# PBS does this automatically

What Backups Protect Against

Failure scenario: Primary storage dies
Backups on PBS: Safe
Result: Restore from backup

Failure scenario: Host fails completely
Backups on PBS: Safe (different hardware)
Result: Restore to new host

Failure scenario: Accidental VM deletion
Backups: Safe (separate system)
Result: Restore deleted VM

Failure scenario: Ransomware encrypts VM
Backups: Safe if not mounted/accessible to VM
Result: Restore clean version

Backup Limitations

Failure scenario: Backup storage also fails
Result: Both copies lost

Failure scenario: Retention pruned the backup you need
Result: Can't restore that point in time

Failure scenario: Site-wide disaster (fire, flood)
On-site backups: Also destroyed
Result: Total loss without off-site copy

RPO: Recovery Point Objective

How much data can you lose?

Daily backups at 1 AM:
- Failure at 11 PM = 22 hours of data loss
- RPO = 24 hours

Hourly backups:
- Maximum 1 hour of data loss
- RPO = 1 hour

Real-time replication:
- Seconds of data loss
- RPO ≈ 0

Match backup frequency to acceptable data loss.

RTO: Recovery Time Objective

How fast must you recover?

Full VM restore from PBS:
- 100GB VM ≈ 10-30 minutes
- RTO ≈ 30 minutes

Restore from off-site:
- Download time + restore time
- RTO = hours

Rebuild from scratch + restore data:
- RTO = hours to days

Match recovery method to acceptable downtime.

Replication

Replication continuously copies data to another location.

Proxmox Replication

Built-in ZFS replication between cluster nodes:

# Create replication job
pvesr create-local-job 100-0 pve2 --schedule '*/15'  # Every 15 min

# Check replication status
pvesr status

# List jobs
pvesr list

How Replication Works

Node 1 (primary)           Node 2 (replica)
┌──────────────┐           ┌──────────────┐
│   VM 100     │           │  VM 100      │
│   (active)   │──────────►│  (standby)   │
│              │  ZFS send │              │
└──────────────┘           └──────────────┘

Every 15 minutes: incremental sync

What Replication Protects Against

Failure scenario: Node 1 hardware failure
Replica on Node 2: Ready to start
Result: Activate replica, minimal downtime

Failure scenario: Storage failure on Node 1
Replica on Node 2: Has recent copy
Result: Start replica (with potential 15-min data loss)

What Replication Does NOT Protect Against

Failure scenario: VM data corruption (application bug)
Replication: Replicates the corruption to Node 2
Result: Both copies corrupted

Failure scenario: Ransomware encrypts VM
Replication: Replicates encrypted data
Result: Both copies encrypted

Failure scenario: Accidental VM deletion
Replication: Deletion replicates
Result: Both copies deleted

Failure scenario: Cluster-wide issue
Replication: Both nodes affected
Result: No protection

Rule: Replication protects against hardware failure, not data corruption.

The Three-Layer Strategy

For critical VMs, use all three:

Layer 1: Snapshots
- Before changes
- Quick rollback
- Same-disk convenience

Layer 2: Backups (PBS)
- Daily/hourly
- Different storage
- Historical retention

Layer 3: Replication
- Near-real-time
- Different node
- Fast failover

Example Configuration

# VM 100: Critical web application

# Layer 1: Manual snapshots before changes
qm snapshot 100 pre-upgrade

# Layer 2: Hourly backups to PBS, 30-day retention
# Backup job: hourly to pbs-store
# Retention: keep-hourly=24,keep-daily=30

# Layer 3: 15-minute replication to second node
pvesr create-local-job 100-0 pve2 --schedule '*/15'

Recovery scenarios:

Scenario	Recovery Method	Data Loss
Bad config change	Rollback snapshot	0
Host hardware failure	Start replica	Up to 15 min
Storage failure	Restore from PBS	Up to 1 hour
Data corruption	Restore from PBS (earlier point)	Variable
Site disaster	Restore from off-site PBS	Up to 24 hours

Real Failure Scenarios

Scenario 1: Disk Failure

Situation: ZFS pool loses a disk in mirror
Snapshots: Still available (pool degraded but working)
Replication: Working
Backups: Working

Action: Replace disk, resilver, no VM downtime

Scenario 2: Complete Storage Loss

Situation: Storage controller failure, pool unimportable
Snapshots: Lost
Replication: Available on other node

Action: Start replica, 15 minutes data loss

Scenario 3: Database Corruption

Situation: App bug corrupts database on Tuesday
Discovered: Thursday
Replication: Has corrupted data
Recent backups: Have corrupted data
Older backup from Monday: Clean

Action: Restore Monday backup, replay transaction logs if possible
Lesson: Longer backup retention matters

Scenario 4: Ransomware

Situation: Ransomware encrypts VM on Friday night
Replication: Encrypted copy on second node
Snapshots: Might be encrypted (if attacker accessed)
PBS backups: Clean (PBS not mounted inside VM)

Action: Restore from PBS backup before infection
Lesson: Air-gapped backups survive ransomware

Calculating Your Strategy

For Each VM, Answer:

RPO: How much data loss is acceptable?
- Minutes → Replication + frequent backups
- Hours → Hourly backups
- Days → Daily backups
RTO: How fast must it recover?
- Minutes → Replication + HA
- Hours → Local PBS restore
- Days → Off-site restore okay
Retention: How far back might you need?
- Days → Short retention
- Months → Longer retention
- Compliance → Years (archive separately)

Example: Different VM Classes

Class A: Critical (database, ERP)
- RPO: 15 minutes
- RTO: 30 minutes
- Retention: 90 days
Strategy: Replication (15 min) + Hourly PBS + Monthly off-site

Class B: Important (web servers, apps)
- RPO: 1 hour
- RTO: 4 hours
- Retention: 30 days
Strategy: Hourly PBS backup

Class C: Development (test VMs)
- RPO: 24 hours
- RTO: Next business day
- Retention: 7 days
Strategy: Daily PBS backup

Class D: Ephemeral (CI runners)
- RPO: N/A (rebuild from config)
- RTO: Minutes (just recreate)
- Retention: None
Strategy: No backup (infrastructure as code)

Testing Your Strategy

Monthly Tests

# 1. Snapshot rollback test
qm snapshot 100 test-snap
# Make a change
qm rollback 100 test-snap
# Verify rollback worked

# 2. Backup restore test
qmrestore pbs-store:backup/vm/100/... 900
qm start 900
# Verify VM works
qm destroy 900

# 3. Replication failover test
# Stop source VM
qm stop 100
# Start replica on other node
# Verify it works
# Fail back to primary

Document Results

Test Date: 2025-01-08
Tested by: Admin

Snapshot rollback: PASS (30 seconds)
PBS restore (100GB VM): PASS (12 minutes)
Replication failover: PASS (2 minutes)

Issues found: None
Next test: 2025-02-08

The Lesson

Replication is not a replacement for PBS. It’s a different layer.

Each protection layer handles different failures:

Snapshots: Undo mistakes (same disk)
Backups: Recover from hardware failure (different storage)
Replication: Fast failover (different node)
Off-site: Survive site disasters (different location)

The failure you’ll have is the one you didn’t plan for. If you only have replication, you’ll face data corruption. If you only have daily backups, you’ll have the failure at 11 PM. If you only have on-site backups, you’ll have the site disaster.

Layer your protection. Test your recovery. Know exactly what each layer protects against and what it doesn’t.