Ceph is incredible technology. Distributed, self-healing storage that scales horizontally. No single point of failure. Built into Proxmox with a nice UI. Sounds perfect.
It’s not perfect. Ceph has costs: hardware (more nodes, more disks, more network), complexity (distributed systems are hard), and operational overhead (recovery can saturate your network). These costs are worth it for the right use case. For the wrong use case, Ceph is pain with no benefit.
This is an honest guide: when Ceph makes sense, what it really requires, and what to expect.
When Ceph Makes Sense
Good fit:
- 3+ nodes with dedicated storage networks
- Need for truly shared storage (HA, live migration)
- Can accept Ceph’s resource overhead
- Want to scale storage by adding nodes
- No external SAN/NAS available
Bad fit:
- Single node (Ceph needs 3+ for reliability)
- Tight hardware budget (Ceph needs resources)
- Simple backup/restore is sufficient
- Already have enterprise SAN
- Can’t dedicate network bandwidth
Minimum Requirements (Real Minimums)
Nodes
Minimum: 3 nodes
Ceph uses replication (default 3 copies). With 2 nodes, one failure = data at risk.
3 nodes: Can lose 1 node4 nodes: Can lose 1 node (more capacity)5 nodes: Can lose 2 nodesCPU and RAM
Per OSD (disk):
- 1 CPU core minimum
- 2GB RAM minimum (4GB recommended)
Example: 3 nodes × 4 OSDs each = 12 OSDsMinimum: 12 cores, 24GB RAM just for CephRecommended: 24 cores, 48GB RAM
Plus RAM for VMs, Proxmox, monitors...Network
Minimum: 10GbE dedicated
1GbE works for testing but not production. Recovery after disk failure saturates the network:
1TB disk fails, recovery speed:- 1GbE: ~3 hours (if nothing else uses network)- 10GbE: ~15 minutes
During recovery, performance is degradedRecommended: Separate networks
┌─────────────────────────────────────┐│ Public Network ││ (client access, VM traffic) ││ 10.0.0.0/24 │└─────────────┬───────────────────────┘ │ ┌─────┼─────┐ │ │ │ pve1 pve2 pve3 │ │ │ └─────┼─────┘ │┌─────────────┴───────────────────────┐│ Cluster Network ││ (OSD replication, recovery) ││ 10.10.0.0/24 │└─────────────────────────────────────┘Cluster network handles heavy replication traffic. Public network serves VMs.
Storage
SSDs strongly recommended
HDDs work but:
- Recovery is slow (hours to days)
- Random I/O performance is poor
- Write latency affects all VMs
Flash for metadata
If using HDDs, use SSDs for:
- WAL (Write-Ahead Log)
- DB (RocksDB metadata)
HDD OSD with SSD metadata:Performance: 10x better than HDD-onlyComplexity: Higher, more failure modesInstalling Ceph on Proxmox
Initialize Ceph
From any node (installs Ceph packages):
pveceph installOr via Web UI: Node → Ceph → Install
Create Ceph Cluster
# Initialize on first nodepveceph init --network 10.10.0.0/24
# This sets cluster network# Default: uses same as public (not recommended)Create Monitors
Each node needs a monitor for quorum:
# On each nodepveceph mon create
# Verifyceph mon statNeed at least 3 monitors for quorum (one per node in 3-node cluster).
Create Manager Daemons
# On each node (2+ recommended)pveceph mgr create
# Verifyceph mgr statCreate OSDs
Each disk becomes an OSD:
# List available disksceph-volume lvm list
# Create OSD on /dev/sdbpveceph osd create /dev/sdb
# With separate WAL/DB device (SSD for HDD OSDs)pveceph osd create /dev/sdb --wal-dev /dev/nvme0n1 --db-dev /dev/nvme0n1Via Web UI: Node → Ceph → OSD → Create OSD
Create Pool
Pools contain data with specific replication rules:
# Create pool with size 3 (3 replicas), min_size 2pveceph pool create vmpool --size 3 --min_size 2 --pg_num 128
# Add as Proxmox storagepvesm add rbd ceph-pool --pool vmpool --content images,rootdirCeph Configuration
Understanding PGs (Placement Groups)
PGs distribute data across OSDs. Too few = uneven distribution. Too many = overhead.
Rule of thumb:Total PGs = (OSDs × 100) / replica count
12 OSDs, 3 replicas:(12 × 100) / 3 = 400 PGs per pool
Divide among pools based on expected sizePool Configuration
# Check pool settingsceph osd pool get vmpool all
# Adjust replicationceph osd pool set vmpool size 3ceph osd pool set vmpool min_size 2
# Enable compression (optional)ceph osd pool set vmpool compression_mode aggressiveCRUSH Rules
CRUSH determines data placement. Default spreads across hosts:
# View CRUSH mapceph osd crush tree
# Data placement: 1 replica per host# Protects against single host failureFor single-node testing (NOT production):
# Allow replicas on same host (DANGEROUS)ceph osd crush rule create-replicated single_host default osdceph osd pool set vmpool crush_rule single_hostMonitoring Ceph Health
Basic Status
# Overall healthceph status
# Should show:# health: HEALTH_OK
# Detailed healthceph health detailOSD Status
# OSD treeceph osd tree
# OSD statsceph osd stat
# Individual OSDceph osd perfPool Usage
# Pool statsceph df
# Detailed pool inforados dfDashboard
Enable Ceph dashboard:
# Dashboard is included with Proxmox# Access via: https://<node>:8006 → Node → Ceph → StatusWhat to Expect: Performance
Write Performance
Single SSD OSD: ~50-100 MB/s per OSDSingle NVMe OSD: ~200-500 MB/s per OSDAggregate: Scales with OSDs
Latency: 1-5ms (SSD), 5-20ms (HDD)Read Performance
Read from primary OSD, scales with OSDsCache helps repeated readsReal-World VM Performance
Random 4K IOPS (single VM):- Ceph SSD: 5,000-20,000 IOPS- Local SSD: 50,000-100,000 IOPS
Latency matters more than throughput for VMsCeph adds network round-trip to every I/OCeph won’t match local NVMe performance. It provides redundancy and shared access, not speed.
Recovery Behavior
When an OSD Fails
1. Ceph detects OSD down (10-30 seconds)2. Marks OSD out (default: 5 minutes)3. Begins recovery (re-replicating data)4. Recovery uses cluster network bandwidth5. Performance degraded until completeRecovery Impact
1TB OSD failure:- Data to re-replicate: 1TB- 10GbE network: ~15 minutes- 1GbE network: ~3 hours- During recovery: Degraded performanceTuning Recovery
# Limit recovery bandwidth (default is aggressive)ceph config set osd osd_recovery_max_active 1ceph config set osd osd_recovery_sleep 0.1
# Check recovery statusceph status# Should show recovery progressBalance: Fast recovery vs. production performance impact.
Common Problems
HEALTH_WARN: Too Few PGs
# Increase PGsceph osd pool set vmpool pg_num 256ceph osd pool set vmpool pgp_num 256HEALTH_WARN: OSDs Near Full
# Check usageceph osd df
# Options:# 1. Add more OSDs# 2. Delete data# 3. Rebalance (if uneven)Ceph stops writes at 95% full. Plan capacity.
Slow Requests
# Check for slow opsceph daemon osd.0 ops
# Common causes:# - HDD latency# - Network congestion# - Undersized clusterClock Skew
# Monitors are sensitive to time# Check NTPtimedatectl status
# Fix: Ensure NTP is working on all nodesCeph vs Alternatives
Ceph vs Local ZFS
| Aspect | Ceph | Local ZFS |
|---|---|---|
| Redundancy | Across nodes | Within node (mirror/RAIDZ) |
| Shared storage | Yes | No (without replication) |
| Performance | Network-bound | Local disk speed |
| Complexity | High | Low |
| Node failure | VMs continue | VMs stop |
Choose local ZFS if you don’t need shared storage.
Ceph vs NFS
| Aspect | Ceph | NFS |
|---|---|---|
| Redundancy | Built-in | Requires HA NFS |
| Performance | Parallel access | Single server bottleneck |
| Complexity | High | Low |
| Scaling | Add nodes | Limited |
Choose NFS for simpler setups with existing NAS.
Ceph vs iSCSI SAN
| Aspect | Ceph | SAN |
|---|---|---|
| Cost | Hardware only | Hardware + licensing |
| Scaling | Add nodes | Add shelves/licenses |
| Complexity | Self-managed | Vendor support |
| Performance | Good | Often better |
Choose SAN if budget allows and you want vendor support.
Sizing Example
Small Production Cluster
3 nodes:- 32GB RAM each (16GB Ceph, 16GB VMs)- 4-core CPU each- 4× 1TB SSD per node (12 OSDs total)- 10GbE cluster network- 10GbE public network
Usable storage: ~4TB (12TB raw ÷ 3 replicas)VM capacity: ~20-40 VMs depending on sizeMedium Production Cluster
5 nodes:- 128GB RAM each- 16-core CPU each- 8× 2TB NVMe per node (40 OSDs total)- 25GbE cluster network- 10GbE public network
Usable storage: ~26TB (80TB raw ÷ 3 replicas)VM capacity: ~100-200 VMsThe Lesson
Ceph is great when you accept its costs: hardware, network, and operational complexity.
Ceph provides:
- True shared storage
- Self-healing
- Horizontal scaling
- No single point of failure
Ceph costs:
- 3+ nodes minimum
- Significant RAM (2-4GB per OSD)
- 10GbE+ network (dedicated)
- Operational knowledge
- Recovery impacts performance
For a 3-node homelab with 10GbE networking, Ceph is a solid choice. For a single node, Ceph is pointless complexity. For a budget cluster with 1GbE, Ceph will frustrate you.
Match the tool to the problem. Ceph solves “I need shared, redundant storage across multiple nodes.” If that’s not your problem, Ceph isn’t your solution.