Ceph on Proxmox: Honest Guide (When It's Worth It, When It's Pain)

September 5, 2025 · 9 min read

Ceph is incredible technology. Distributed, self-healing storage that scales horizontally. No single point of failure. Built into Proxmox with a nice UI. Sounds perfect.

It’s not perfect. Ceph has costs: hardware (more nodes, more disks, more network), complexity (distributed systems are hard), and operational overhead (recovery can saturate your network). These costs are worth it for the right use case. For the wrong use case, Ceph is pain with no benefit.

This is an honest guide: when Ceph makes sense, what it really requires, and what to expect.

When Ceph Makes Sense

Good fit:

3+ nodes with dedicated storage networks
Need for truly shared storage (HA, live migration)
Can accept Ceph’s resource overhead
Want to scale storage by adding nodes
No external SAN/NAS available

Bad fit:

Single node (Ceph needs 3+ for reliability)
Tight hardware budget (Ceph needs resources)
Simple backup/restore is sufficient
Already have enterprise SAN
Can’t dedicate network bandwidth

Minimum Requirements (Real Minimums)

Nodes

Minimum: 3 nodes

Ceph uses replication (default 3 copies). With 2 nodes, one failure = data at risk.

3 nodes: Can lose 1 node
4 nodes: Can lose 1 node (more capacity)
5 nodes: Can lose 2 nodes

CPU and RAM

Per OSD (disk):

1 CPU core minimum
2GB RAM minimum (4GB recommended)

Example: 3 nodes × 4 OSDs each = 12 OSDs
Minimum: 12 cores, 24GB RAM just for Ceph
Recommended: 24 cores, 48GB RAM

Plus RAM for VMs, Proxmox, monitors...

Network

Minimum: 10GbE dedicated

1GbE works for testing but not production. Recovery after disk failure saturates the network:

1TB disk fails, recovery speed:
- 1GbE: ~3 hours (if nothing else uses network)
- 10GbE: ~15 minutes

During recovery, performance is degraded

Recommended: Separate networks

┌─────────────────────────────────────┐
│          Public Network             │
│  (client access, VM traffic)        │
│           10.0.0.0/24               │
└─────────────┬───────────────────────┘
              │
        ┌─────┼─────┐
        │     │     │
     pve1   pve2   pve3
        │     │     │
        └─────┼─────┘
              │
┌─────────────┴───────────────────────┐
│         Cluster Network             │
│  (OSD replication, recovery)        │
│          10.10.0.0/24               │
└─────────────────────────────────────┘

Cluster network handles heavy replication traffic. Public network serves VMs.

Storage

SSDs strongly recommended

HDDs work but:

Recovery is slow (hours to days)
Random I/O performance is poor
Write latency affects all VMs

Flash for metadata

If using HDDs, use SSDs for:

WAL (Write-Ahead Log)
DB (RocksDB metadata)

HDD OSD with SSD metadata:
Performance: 10x better than HDD-only
Complexity: Higher, more failure modes

Installing Ceph on Proxmox

Initialize Ceph

From any node (installs Ceph packages):

pveceph install

Or via Web UI: Node → Ceph → Install

Create Ceph Cluster

# Initialize on first node
pveceph init --network 10.10.0.0/24

# This sets cluster network
# Default: uses same as public (not recommended)

Create Monitors

Each node needs a monitor for quorum:

# On each node
pveceph mon create

# Verify
ceph mon stat

Need at least 3 monitors for quorum (one per node in 3-node cluster).

Create Manager Daemons

# On each node (2+ recommended)
pveceph mgr create

# Verify
ceph mgr stat

Create OSDs

Each disk becomes an OSD:

# List available disks
ceph-volume lvm list

# Create OSD on /dev/sdb
pveceph osd create /dev/sdb

# With separate WAL/DB device (SSD for HDD OSDs)
pveceph osd create /dev/sdb --wal-dev /dev/nvme0n1 --db-dev /dev/nvme0n1

Via Web UI: Node → Ceph → OSD → Create OSD

Create Pool

Pools contain data with specific replication rules:

# Create pool with size 3 (3 replicas), min_size 2
pveceph pool create vmpool --size 3 --min_size 2 --pg_num 128

# Add as Proxmox storage
pvesm add rbd ceph-pool --pool vmpool --content images,rootdir

Ceph Configuration

Understanding PGs (Placement Groups)

PGs distribute data across OSDs. Too few = uneven distribution. Too many = overhead.

Rule of thumb:
Total PGs = (OSDs × 100) / replica count

12 OSDs, 3 replicas:
(12 × 100) / 3 = 400 PGs per pool

Divide among pools based on expected size

Pool Configuration

# Check pool settings
ceph osd pool get vmpool all

# Adjust replication
ceph osd pool set vmpool size 3
ceph osd pool set vmpool min_size 2

# Enable compression (optional)
ceph osd pool set vmpool compression_mode aggressive

CRUSH Rules

CRUSH determines data placement. Default spreads across hosts:

# View CRUSH map
ceph osd crush tree

# Data placement: 1 replica per host
# Protects against single host failure

For single-node testing (NOT production):

# Allow replicas on same host (DANGEROUS)
ceph osd crush rule create-replicated single_host default osd
ceph osd pool set vmpool crush_rule single_host

Monitoring Ceph Health

Basic Status

# Overall health
ceph status

# Should show:
# health: HEALTH_OK

# Detailed health
ceph health detail

OSD Status

# OSD tree
ceph osd tree

# OSD stats
ceph osd stat

# Individual OSD
ceph osd perf

Pool Usage

# Pool stats
ceph df

# Detailed pool info
rados df

Dashboard

Enable Ceph dashboard:

# Dashboard is included with Proxmox
# Access via: https://<node>:8006 → Node → Ceph → Status

What to Expect: Performance

Write Performance

Single SSD OSD: ~50-100 MB/s per OSD
Single NVMe OSD: ~200-500 MB/s per OSD
Aggregate: Scales with OSDs

Latency: 1-5ms (SSD), 5-20ms (HDD)

Read Performance

Read from primary OSD, scales with OSDs
Cache helps repeated reads

Real-World VM Performance

Random 4K IOPS (single VM):
- Ceph SSD: 5,000-20,000 IOPS
- Local SSD: 50,000-100,000 IOPS

Latency matters more than throughput for VMs
Ceph adds network round-trip to every I/O

Ceph won’t match local NVMe performance. It provides redundancy and shared access, not speed.

Recovery Behavior

When an OSD Fails

1. Ceph detects OSD down (10-30 seconds)
2. Marks OSD out (default: 5 minutes)
3. Begins recovery (re-replicating data)
4. Recovery uses cluster network bandwidth
5. Performance degraded until complete

Recovery Impact

1TB OSD failure:
- Data to re-replicate: 1TB
- 10GbE network: ~15 minutes
- 1GbE network: ~3 hours
- During recovery: Degraded performance

Tuning Recovery

# Limit recovery bandwidth (default is aggressive)
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_sleep 0.1

# Check recovery status
ceph status
# Should show recovery progress

Balance: Fast recovery vs. production performance impact.

Common Problems

HEALTH_WARN: Too Few PGs

# Increase PGs
ceph osd pool set vmpool pg_num 256
ceph osd pool set vmpool pgp_num 256

HEALTH_WARN: OSDs Near Full

# Check usage
ceph osd df

# Options:
# 1. Add more OSDs
# 2. Delete data
# 3. Rebalance (if uneven)

Ceph stops writes at 95% full. Plan capacity.

Slow Requests

# Check for slow ops
ceph daemon osd.0 ops

# Common causes:
# - HDD latency
# - Network congestion
# - Undersized cluster

Clock Skew

# Monitors are sensitive to time
# Check NTP
timedatectl status

# Fix: Ensure NTP is working on all nodes

Ceph vs Alternatives

Ceph vs Local ZFS

Aspect	Ceph	Local ZFS
Redundancy	Across nodes	Within node (mirror/RAIDZ)
Shared storage	Yes	No (without replication)
Performance	Network-bound	Local disk speed
Complexity	High	Low
Node failure	VMs continue	VMs stop

Choose local ZFS if you don’t need shared storage.

Ceph vs NFS

Aspect	Ceph	NFS
Redundancy	Built-in	Requires HA NFS
Performance	Parallel access	Single server bottleneck
Complexity	High	Low
Scaling	Add nodes	Limited

Choose NFS for simpler setups with existing NAS.

Ceph vs iSCSI SAN

Aspect	Ceph	SAN
Cost	Hardware only	Hardware + licensing
Scaling	Add nodes	Add shelves/licenses
Complexity	Self-managed	Vendor support
Performance	Good	Often better

Choose SAN if budget allows and you want vendor support.

Sizing Example

Small Production Cluster

3 nodes:
- 32GB RAM each (16GB Ceph, 16GB VMs)
- 4-core CPU each
- 4× 1TB SSD per node (12 OSDs total)
- 10GbE cluster network
- 10GbE public network

Usable storage: ~4TB (12TB raw ÷ 3 replicas)
VM capacity: ~20-40 VMs depending on size

Medium Production Cluster

5 nodes:
- 128GB RAM each
- 16-core CPU each
- 8× 2TB NVMe per node (40 OSDs total)
- 25GbE cluster network
- 10GbE public network

Usable storage: ~26TB (80TB raw ÷ 3 replicas)
VM capacity: ~100-200 VMs

The Lesson

Ceph is great when you accept its costs: hardware, network, and operational complexity.

Ceph provides:

True shared storage
Self-healing
Horizontal scaling
No single point of failure

Ceph costs:

3+ nodes minimum
Significant RAM (2-4GB per OSD)
10GbE+ network (dedicated)
Operational knowledge
Recovery impacts performance

For a 3-node homelab with 10GbE networking, Ceph is a solid choice. For a single node, Ceph is pointless complexity. For a budget cluster with 1GbE, Ceph will frustrate you.

Match the tool to the problem. Ceph solves “I need shared, redundant storage across multiple nodes.” If that’s not your problem, Ceph isn’t your solution.