I’ve lost data three times in production. Each time taught me something different about what “protected” actually means.
First time: snapshot on same disk that failed. Snapshot died with the disk. Second time: backup existed, but retention policy had pruned the version I needed. Third time: replication was running, but it replicated the corruption.
Snapshots, backups, and replication are different tools solving different problems. Using the wrong one for your failure scenario means learning the hard way.
The Three Protection Layers
| Feature | Snapshot | Backup | Replication |
|---|---|---|---|
| Location | Same storage | Different storage | Different node |
| Speed | Instant | Minutes-hours | Continuous |
| Protection | Human error | Hardware failure | Site failure |
| Point-in-time | Yes | Yes | Near-real-time |
| Survives disk failure | No | Yes | Depends |
| Survives site failure | No | If off-site | If different site |
Each layer protects against different failures. You need all three.
Snapshots
A snapshot captures VM state at a point in time — disk and optionally RAM.
How Snapshots Work
ZFS/LVM snapshots are copy-on-write:
Before snapshot: Disk blocks: [A][B][C][D][E]
After snapshot: Current: [A][B][C][D][E] Snapshot: → points to same blocks
After modification (block C changed): Current: [A][B][C'][D][E] Snapshot: [A][B][C][D][E] (old C preserved)Snapshots are instant because nothing is copied initially. Only changed blocks are preserved.
Creating Snapshots
# VM snapshot (disk + state)qm snapshot 100 before-upgrade --description "Before kernel upgrade"
# List snapshotsqm listsnapshot 100
# Rollbackqm rollback 100 before-upgrade
# Delete snapshotqm delsnapshot 100 before-upgradeWhat Snapshots Are Good For
- Before risky changes: Upgrade, config change, experimental work
- Quick rollback: “Oops, that broke it” → 30-second recovery
- Testing: Try something, snapshot, try variations, rollback
What Snapshots Don’t Protect Against
Failure scenario: Disk diesSnapshots: Also dead (same disk)Result: Total loss
Failure scenario: Storage controller failsSnapshots: Also dead (same storage)Result: Total loss
Failure scenario: Ransomware encrypts VMSnapshots: Might survive if attacker doesn't find themResult: Maybe recoverable, maybe not
Failure scenario: Accidental snapshot deletionSnapshots: GoneResult: No protectionRule: Snapshots are convenience, not protection.
Backups
Backups copy data to separate storage.
Backup to PBS
# Full backup to PBSvzdump 100 --storage pbs-store --mode snapshot --compress zstd
# Incremental (only changed since last)# PBS does this automaticallyWhat Backups Protect Against
Failure scenario: Primary storage diesBackups on PBS: SafeResult: Restore from backup
Failure scenario: Host fails completelyBackups on PBS: Safe (different hardware)Result: Restore to new host
Failure scenario: Accidental VM deletionBackups: Safe (separate system)Result: Restore deleted VM
Failure scenario: Ransomware encrypts VMBackups: Safe if not mounted/accessible to VMResult: Restore clean versionBackup Limitations
Failure scenario: Backup storage also failsResult: Both copies lost
Failure scenario: Retention pruned the backup you needResult: Can't restore that point in time
Failure scenario: Site-wide disaster (fire, flood)On-site backups: Also destroyedResult: Total loss without off-site copyRPO: Recovery Point Objective
How much data can you lose?
Daily backups at 1 AM:- Failure at 11 PM = 22 hours of data loss- RPO = 24 hours
Hourly backups:- Maximum 1 hour of data loss- RPO = 1 hour
Real-time replication:- Seconds of data loss- RPO ≈ 0Match backup frequency to acceptable data loss.
RTO: Recovery Time Objective
How fast must you recover?
Full VM restore from PBS:- 100GB VM ≈ 10-30 minutes- RTO ≈ 30 minutes
Restore from off-site:- Download time + restore time- RTO = hours
Rebuild from scratch + restore data:- RTO = hours to daysMatch recovery method to acceptable downtime.
Replication
Replication continuously copies data to another location.
Proxmox Replication
Built-in ZFS replication between cluster nodes:
# Create replication jobpvesr create-local-job 100-0 pve2 --schedule '*/15' # Every 15 min
# Check replication statuspvesr status
# List jobspvesr listHow Replication Works
Node 1 (primary) Node 2 (replica)┌──────────────┐ ┌──────────────┐│ VM 100 │ │ VM 100 ││ (active) │──────────►│ (standby) ││ │ ZFS send │ │└──────────────┘ └──────────────┘
Every 15 minutes: incremental syncWhat Replication Protects Against
Failure scenario: Node 1 hardware failureReplica on Node 2: Ready to startResult: Activate replica, minimal downtime
Failure scenario: Storage failure on Node 1Replica on Node 2: Has recent copyResult: Start replica (with potential 15-min data loss)What Replication Does NOT Protect Against
Failure scenario: VM data corruption (application bug)Replication: Replicates the corruption to Node 2Result: Both copies corrupted
Failure scenario: Ransomware encrypts VMReplication: Replicates encrypted dataResult: Both copies encrypted
Failure scenario: Accidental VM deletionReplication: Deletion replicatesResult: Both copies deleted
Failure scenario: Cluster-wide issueReplication: Both nodes affectedResult: No protectionRule: Replication protects against hardware failure, not data corruption.
The Three-Layer Strategy
For critical VMs, use all three:
Layer 1: Snapshots- Before changes- Quick rollback- Same-disk convenience
Layer 2: Backups (PBS)- Daily/hourly- Different storage- Historical retention
Layer 3: Replication- Near-real-time- Different node- Fast failoverExample Configuration
# VM 100: Critical web application
# Layer 1: Manual snapshots before changesqm snapshot 100 pre-upgrade
# Layer 2: Hourly backups to PBS, 30-day retention# Backup job: hourly to pbs-store# Retention: keep-hourly=24,keep-daily=30
# Layer 3: 15-minute replication to second nodepvesr create-local-job 100-0 pve2 --schedule '*/15'Recovery scenarios:
| Scenario | Recovery Method | Data Loss |
|---|---|---|
| Bad config change | Rollback snapshot | 0 |
| Host hardware failure | Start replica | Up to 15 min |
| Storage failure | Restore from PBS | Up to 1 hour |
| Data corruption | Restore from PBS (earlier point) | Variable |
| Site disaster | Restore from off-site PBS | Up to 24 hours |
Real Failure Scenarios
Scenario 1: Disk Failure
Situation: ZFS pool loses a disk in mirrorSnapshots: Still available (pool degraded but working)Replication: WorkingBackups: Working
Action: Replace disk, resilver, no VM downtimeScenario 2: Complete Storage Loss
Situation: Storage controller failure, pool unimportableSnapshots: LostReplication: Available on other node
Action: Start replica, 15 minutes data lossScenario 3: Database Corruption
Situation: App bug corrupts database on TuesdayDiscovered: ThursdayReplication: Has corrupted dataRecent backups: Have corrupted dataOlder backup from Monday: Clean
Action: Restore Monday backup, replay transaction logs if possibleLesson: Longer backup retention mattersScenario 4: Ransomware
Situation: Ransomware encrypts VM on Friday nightReplication: Encrypted copy on second nodeSnapshots: Might be encrypted (if attacker accessed)PBS backups: Clean (PBS not mounted inside VM)
Action: Restore from PBS backup before infectionLesson: Air-gapped backups survive ransomwareCalculating Your Strategy
For Each VM, Answer:
-
RPO: How much data loss is acceptable?
- Minutes → Replication + frequent backups
- Hours → Hourly backups
- Days → Daily backups
-
RTO: How fast must it recover?
- Minutes → Replication + HA
- Hours → Local PBS restore
- Days → Off-site restore okay
-
Retention: How far back might you need?
- Days → Short retention
- Months → Longer retention
- Compliance → Years (archive separately)
Example: Different VM Classes
Class A: Critical (database, ERP)- RPO: 15 minutes- RTO: 30 minutes- Retention: 90 daysStrategy: Replication (15 min) + Hourly PBS + Monthly off-site
Class B: Important (web servers, apps)- RPO: 1 hour- RTO: 4 hours- Retention: 30 daysStrategy: Hourly PBS backup
Class C: Development (test VMs)- RPO: 24 hours- RTO: Next business day- Retention: 7 daysStrategy: Daily PBS backup
Class D: Ephemeral (CI runners)- RPO: N/A (rebuild from config)- RTO: Minutes (just recreate)- Retention: NoneStrategy: No backup (infrastructure as code)Testing Your Strategy
Monthly Tests
# 1. Snapshot rollback testqm snapshot 100 test-snap# Make a changeqm rollback 100 test-snap# Verify rollback worked
# 2. Backup restore testqmrestore pbs-store:backup/vm/100/... 900qm start 900# Verify VM worksqm destroy 900
# 3. Replication failover test# Stop source VMqm stop 100# Start replica on other node# Verify it works# Fail back to primaryDocument Results
Test Date: 2025-01-08Tested by: Admin
Snapshot rollback: PASS (30 seconds)PBS restore (100GB VM): PASS (12 minutes)Replication failover: PASS (2 minutes)
Issues found: NoneNext test: 2025-02-08The Lesson
Replication is not a replacement for PBS. It’s a different layer.
Each protection layer handles different failures:
- Snapshots: Undo mistakes (same disk)
- Backups: Recover from hardware failure (different storage)
- Replication: Fast failover (different node)
- Off-site: Survive site disasters (different location)
The failure you’ll have is the one you didn’t plan for. If you only have replication, you’ll face data corruption. If you only have daily backups, you’ll have the failure at 11 PM. If you only have on-site backups, you’ll have the site disaster.
Layer your protection. Test your recovery. Know exactly what each layer protects against and what it doesn’t.