Ceph on Proxmox: Honest Guide (When It's Worth It, When It's Pain)

Ceph is incredible technology. Distributed, self-healing storage that scales horizontally. No single point of failure. Built into Proxmox with a nice UI. Sounds perfect.

It’s not perfect. Ceph has costs: hardware (more nodes, more disks, more network), complexity (distributed systems are hard), and operational overhead (recovery can saturate your network). These costs are worth it for the right use case. For the wrong use case, Ceph is pain with no benefit.

This is an honest guide: when Ceph makes sense, what it really requires, and what to expect.

When Ceph Makes Sense

Good fit:

  • 3+ nodes with dedicated storage networks
  • Need for truly shared storage (HA, live migration)
  • Can accept Ceph’s resource overhead
  • Want to scale storage by adding nodes
  • No external SAN/NAS available

Bad fit:

  • Single node (Ceph needs 3+ for reliability)
  • Tight hardware budget (Ceph needs resources)
  • Simple backup/restore is sufficient
  • Already have enterprise SAN
  • Can’t dedicate network bandwidth

Minimum Requirements (Real Minimums)

Nodes

Minimum: 3 nodes

Ceph uses replication (default 3 copies). With 2 nodes, one failure = data at risk.

3 nodes: Can lose 1 node
4 nodes: Can lose 1 node (more capacity)
5 nodes: Can lose 2 nodes

CPU and RAM

Per OSD (disk):

  • 1 CPU core minimum
  • 2GB RAM minimum (4GB recommended)
Example: 3 nodes × 4 OSDs each = 12 OSDs
Minimum: 12 cores, 24GB RAM just for Ceph
Recommended: 24 cores, 48GB RAM
Plus RAM for VMs, Proxmox, monitors...

Network

Minimum: 10GbE dedicated

1GbE works for testing but not production. Recovery after disk failure saturates the network:

1TB disk fails, recovery speed:
- 1GbE: ~3 hours (if nothing else uses network)
- 10GbE: ~15 minutes
During recovery, performance is degraded

Recommended: Separate networks

┌─────────────────────────────────────┐
│ Public Network │
│ (client access, VM traffic) │
│ 10.0.0.0/24 │
└─────────────┬───────────────────────┘
┌─────┼─────┐
│ │ │
pve1 pve2 pve3
│ │ │
└─────┼─────┘
┌─────────────┴───────────────────────┐
│ Cluster Network │
│ (OSD replication, recovery) │
│ 10.10.0.0/24 │
└─────────────────────────────────────┘

Cluster network handles heavy replication traffic. Public network serves VMs.

Storage

SSDs strongly recommended

HDDs work but:

  • Recovery is slow (hours to days)
  • Random I/O performance is poor
  • Write latency affects all VMs

Flash for metadata

If using HDDs, use SSDs for:

  • WAL (Write-Ahead Log)
  • DB (RocksDB metadata)
HDD OSD with SSD metadata:
Performance: 10x better than HDD-only
Complexity: Higher, more failure modes

Installing Ceph on Proxmox

Initialize Ceph

From any node (installs Ceph packages):

Terminal window
pveceph install

Or via Web UI: Node → Ceph → Install

Create Ceph Cluster

Terminal window
# Initialize on first node
pveceph init --network 10.10.0.0/24
# This sets cluster network
# Default: uses same as public (not recommended)

Create Monitors

Each node needs a monitor for quorum:

Terminal window
# On each node
pveceph mon create
# Verify
ceph mon stat

Need at least 3 monitors for quorum (one per node in 3-node cluster).

Create Manager Daemons

Terminal window
# On each node (2+ recommended)
pveceph mgr create
# Verify
ceph mgr stat

Create OSDs

Each disk becomes an OSD:

Terminal window
# List available disks
ceph-volume lvm list
# Create OSD on /dev/sdb
pveceph osd create /dev/sdb
# With separate WAL/DB device (SSD for HDD OSDs)
pveceph osd create /dev/sdb --wal-dev /dev/nvme0n1 --db-dev /dev/nvme0n1

Via Web UI: Node → Ceph → OSD → Create OSD

Create Pool

Pools contain data with specific replication rules:

Terminal window
# Create pool with size 3 (3 replicas), min_size 2
pveceph pool create vmpool --size 3 --min_size 2 --pg_num 128
# Add as Proxmox storage
pvesm add rbd ceph-pool --pool vmpool --content images,rootdir

Ceph Configuration

Understanding PGs (Placement Groups)

PGs distribute data across OSDs. Too few = uneven distribution. Too many = overhead.

Rule of thumb:
Total PGs = (OSDs × 100) / replica count
12 OSDs, 3 replicas:
(12 × 100) / 3 = 400 PGs per pool
Divide among pools based on expected size

Pool Configuration

Terminal window
# Check pool settings
ceph osd pool get vmpool all
# Adjust replication
ceph osd pool set vmpool size 3
ceph osd pool set vmpool min_size 2
# Enable compression (optional)
ceph osd pool set vmpool compression_mode aggressive

CRUSH Rules

CRUSH determines data placement. Default spreads across hosts:

Terminal window
# View CRUSH map
ceph osd crush tree
# Data placement: 1 replica per host
# Protects against single host failure

For single-node testing (NOT production):

Terminal window
# Allow replicas on same host (DANGEROUS)
ceph osd crush rule create-replicated single_host default osd
ceph osd pool set vmpool crush_rule single_host

Monitoring Ceph Health

Basic Status

Terminal window
# Overall health
ceph status
# Should show:
# health: HEALTH_OK
# Detailed health
ceph health detail

OSD Status

Terminal window
# OSD tree
ceph osd tree
# OSD stats
ceph osd stat
# Individual OSD
ceph osd perf

Pool Usage

Terminal window
# Pool stats
ceph df
# Detailed pool info
rados df

Dashboard

Enable Ceph dashboard:

Terminal window
# Dashboard is included with Proxmox
# Access via: https://<node>:8006 → Node → Ceph → Status

What to Expect: Performance

Write Performance

Single SSD OSD: ~50-100 MB/s per OSD
Single NVMe OSD: ~200-500 MB/s per OSD
Aggregate: Scales with OSDs
Latency: 1-5ms (SSD), 5-20ms (HDD)

Read Performance

Read from primary OSD, scales with OSDs
Cache helps repeated reads

Real-World VM Performance

Random 4K IOPS (single VM):
- Ceph SSD: 5,000-20,000 IOPS
- Local SSD: 50,000-100,000 IOPS
Latency matters more than throughput for VMs
Ceph adds network round-trip to every I/O

Ceph won’t match local NVMe performance. It provides redundancy and shared access, not speed.

Recovery Behavior

When an OSD Fails

1. Ceph detects OSD down (10-30 seconds)
2. Marks OSD out (default: 5 minutes)
3. Begins recovery (re-replicating data)
4. Recovery uses cluster network bandwidth
5. Performance degraded until complete

Recovery Impact

1TB OSD failure:
- Data to re-replicate: 1TB
- 10GbE network: ~15 minutes
- 1GbE network: ~3 hours
- During recovery: Degraded performance

Tuning Recovery

Terminal window
# Limit recovery bandwidth (default is aggressive)
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_sleep 0.1
# Check recovery status
ceph status
# Should show recovery progress

Balance: Fast recovery vs. production performance impact.

Common Problems

HEALTH_WARN: Too Few PGs

Terminal window
# Increase PGs
ceph osd pool set vmpool pg_num 256
ceph osd pool set vmpool pgp_num 256

HEALTH_WARN: OSDs Near Full

Terminal window
# Check usage
ceph osd df
# Options:
# 1. Add more OSDs
# 2. Delete data
# 3. Rebalance (if uneven)

Ceph stops writes at 95% full. Plan capacity.

Slow Requests

Terminal window
# Check for slow ops
ceph daemon osd.0 ops
# Common causes:
# - HDD latency
# - Network congestion
# - Undersized cluster

Clock Skew

Terminal window
# Monitors are sensitive to time
# Check NTP
timedatectl status
# Fix: Ensure NTP is working on all nodes

Ceph vs Alternatives

Ceph vs Local ZFS

AspectCephLocal ZFS
RedundancyAcross nodesWithin node (mirror/RAIDZ)
Shared storageYesNo (without replication)
PerformanceNetwork-boundLocal disk speed
ComplexityHighLow
Node failureVMs continueVMs stop

Choose local ZFS if you don’t need shared storage.

Ceph vs NFS

AspectCephNFS
RedundancyBuilt-inRequires HA NFS
PerformanceParallel accessSingle server bottleneck
ComplexityHighLow
ScalingAdd nodesLimited

Choose NFS for simpler setups with existing NAS.

Ceph vs iSCSI SAN

AspectCephSAN
CostHardware onlyHardware + licensing
ScalingAdd nodesAdd shelves/licenses
ComplexitySelf-managedVendor support
PerformanceGoodOften better

Choose SAN if budget allows and you want vendor support.

Sizing Example

Small Production Cluster

3 nodes:
- 32GB RAM each (16GB Ceph, 16GB VMs)
- 4-core CPU each
- 4× 1TB SSD per node (12 OSDs total)
- 10GbE cluster network
- 10GbE public network
Usable storage: ~4TB (12TB raw ÷ 3 replicas)
VM capacity: ~20-40 VMs depending on size

Medium Production Cluster

5 nodes:
- 128GB RAM each
- 16-core CPU each
- 8× 2TB NVMe per node (40 OSDs total)
- 25GbE cluster network
- 10GbE public network
Usable storage: ~26TB (80TB raw ÷ 3 replicas)
VM capacity: ~100-200 VMs

The Lesson

Ceph is great when you accept its costs: hardware, network, and operational complexity.

Ceph provides:

  • True shared storage
  • Self-healing
  • Horizontal scaling
  • No single point of failure

Ceph costs:

  • 3+ nodes minimum
  • Significant RAM (2-4GB per OSD)
  • 10GbE+ network (dedicated)
  • Operational knowledge
  • Recovery impacts performance

For a 3-node homelab with 10GbE networking, Ceph is a solid choice. For a single node, Ceph is pointless complexity. For a budget cluster with 1GbE, Ceph will frustrate you.

Match the tool to the problem. Ceph solves “I need shared, redundant storage across multiple nodes.” If that’s not your problem, Ceph isn’t your solution.