The Proxmox web UI shows current state. It doesn’t show trends. It doesn’t show “disk was filling up for weeks before it failed.” It doesn’t wake you up at 3 AM when something is about to break.
Observability means knowing what’s happening before users tell you. Metrics show trends. Logs show context. Alerts notify you before failures become outages.
You can’t manage what you can’t see. And the Proxmox UI isn’t enough to see.
What to Monitor
Host Metrics
| Metric | Why | Alert Threshold |
|---|---|---|
| CPU usage | Overloaded host | >90% for 5 min |
| Memory usage | OOM risk | >85% |
| Load average | System stress | >cores×2 |
| Disk I/O | Storage bottleneck | Latency >50ms |
| Network I/O | Bandwidth saturation | >80% capacity |
Storage Metrics
| Metric | Why | Alert Threshold |
|---|---|---|
| Disk space | Running out | >80% |
| ZFS pool health | Data integrity | Any non-ONLINE |
| ZFS ARC hit rate | Cache efficiency | Below 80% |
| Ceph health | Cluster state | Any non-HEALTH_OK |
| SMART status | Disk failure prediction | Any warning |
VM Metrics
| Metric | Why | Alert Threshold |
|---|---|---|
| VM count | Capacity planning | Depends |
| Running vs stopped | Unexpected states | Any unexpected |
| CPU steal time | Overcommit | >10% |
| Balloon memory | Memory pressure | Significant deflation |
Prometheus + Grafana Setup
The standard stack: Prometheus scrapes metrics, Grafana visualizes.
Install on Separate VM
Don’t monitor Proxmox from Proxmox. If the host dies, monitoring dies.
# On monitoring VM - install Prometheusapt updateapt install -y prometheus prometheus-node-exporter
# Add Grafana repositoryapt install -y apt-transport-https software-properties-commonwget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.keyecho "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.listapt updateapt install -y grafana
systemctl enable --now prometheus prometheus-node-exporter grafana-serverProxmox PVE Exporter
Prometheus exporter specifically for Proxmox:
# Installpip install prometheus-pve-exporter
# Create configcat > /etc/pve-exporter.yml << 'EOF'default: user: monitoring@pve token_name: prometheus token_value: "xxxx-xxxx-xxxx" verify_ssl: falseEOF
# Create systemd servicecat > /etc/systemd/system/pve-exporter.service << 'EOF'[Unit]Description=Prometheus PVE ExporterAfter=network.target
[Service]Type=simpleExecStart=/usr/local/bin/pve_exporter /etc/pve-exporter.ymlRestart=always
[Install]WantedBy=multi-user.targetEOF
systemctl daemon-reloadsystemctl enable --now pve-exporterNode Exporter on Proxmox Hosts
Install on each Proxmox node:
apt install prometheus-node-exportersystemctl enable --now prometheus-node-exporterPrometheus Configuration
global: scrape_interval: 15s evaluation_interval: 15s
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
rule_files: - /etc/prometheus/rules/*.yml
scrape_configs: # Prometheus self - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
# Node exporters on Proxmox hosts - job_name: 'proxmox-nodes' static_configs: - targets: - 'pve1:9100' - 'pve2:9100' - 'pve3:9100'
# PVE exporter - job_name: 'proxmox-pve' static_configs: - targets: - 'localhost:9221' metrics_path: /pve params: module: [default]Grafana Dashboards
Import community dashboards:
- Proxmox VE: Dashboard ID 10347
- Node Exporter Full: Dashboard ID 1860
- ZFS: Dashboard ID 11337
- Ceph: Dashboard ID 2842
Or create custom dashboards for your specific needs.
ZFS Monitoring
ZFS Exporter
# Installpip install prometheus-zfs-exporter
# Runzfs_exporter --port 9134Key ZFS Metrics
# Prometheus queries
# Pool capacityzfs_pool_allocated_bytes / zfs_pool_size_bytes * 100
# ARC hit raterate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100
# Scrub errorszfs_pool_scrub_errors_total
# Pool state (1 = ONLINE)zfs_pool_health == 1ZFS Alerts
groups: - name: zfs rules: - alert: ZFSPoolDegraded expr: zfs_pool_health != 1 for: 1m labels: severity: critical annotations: summary: "ZFS pool {{ $labels.pool }} is degraded"
- alert: ZFSPoolSpaceLow expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "ZFS pool {{ $labels.pool }} is {{ $value }}% full"
- alert: ZFSScrubErrors expr: zfs_pool_scrub_errors_total > 0 for: 1m labels: severity: critical annotations: summary: "ZFS pool {{ $labels.pool }} has scrub errors"Ceph Monitoring
Built-in Ceph Metrics
Ceph exposes Prometheus metrics natively:
# On Ceph manager nodeceph mgr module enable prometheus
# Metrics at# http://ceph-mgr:9283/metricsPrometheus Config for Ceph
# Add to prometheus.yml- job_name: 'ceph' static_configs: - targets: - 'pve1:9283' # Ceph managerCeph Alerts
groups: - name: ceph rules: - alert: CephHealthWarning expr: ceph_health_status == 1 for: 5m labels: severity: warning annotations: summary: "Ceph cluster health is WARN"
- alert: CephHealthCritical expr: ceph_health_status == 2 for: 1m labels: severity: critical annotations: summary: "Ceph cluster health is CRITICAL"
- alert: CephOSDDown expr: ceph_osd_up == 0 for: 1m labels: severity: critical annotations: summary: "Ceph OSD {{ $labels.osd }} is down"
- alert: CephPGsUnclean expr: ceph_pg_total - ceph_pg_active_clean > 0 for: 5m labels: severity: warning annotations: summary: "Ceph has {{ $value }} unclean PGs"SMART Monitoring
Predict disk failures before they happen:
Install smartmontools
# On each Proxmox nodeapt install smartmontools
# Enable SMART on diskssmartctl --smart=on /dev/sdaPrometheus SMART Exporter
# Installpip install prometheus-smart-exporter
# Runsmart_exporter --port 9110SMART Alerts
groups: - name: smart rules: - alert: DiskSMARTWarning expr: smart_device_health != 1 for: 1m labels: severity: warning annotations: summary: "Disk {{ $labels.device }} SMART health warning"
- alert: DiskReallocationCount expr: smart_raw_value{attribute_name="Reallocated_Sector_Ct"} > 0 for: 1m labels: severity: warning annotations: summary: "Disk {{ $labels.device }} has reallocated sectors"Log Aggregation
Loki for Logs
# docker-compose.yml for Lokiversion: "3"services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - loki-data:/loki
promtail: image: grafana/promtail:latest volumes: - /var/log:/var/log:ro - ./promtail-config.yml:/etc/promtail/config.yml command: -config.file=/etc/promtail/config.yml
volumes: loki-data:Promtail on Proxmox Nodes
server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients: - url: http://loki:3100/loki/api/v1/push
scrape_configs: - job_name: proxmox static_configs: - targets: - localhost labels: job: proxmox host: pve1 __path__: /var/log/*.log
- job_name: pve-cluster static_configs: - targets: - localhost labels: job: pve-cluster host: pve1 __path__: /var/log/pve/tasks/*Key Logs to Monitor
# Proxmox-specific logs/var/log/pveproxy.log # Web UI access/var/log/pve/tasks/ # Task logs/var/log/pve-firewall.log # Firewall logs
# System logs/var/log/syslog # General system/var/log/auth.log # Authentication/var/log/kern.log # Kernel messages
# Ceph logs (if using)/var/log/ceph/*.logAlerting Rules Summary
Critical Alerts (Page Immediately)
groups: - name: critical rules: - alert: HostDown expr: up{job="proxmox-nodes"} == 0 for: 1m labels: severity: critical
- alert: StorageCritical expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 1m labels: severity: critical
- alert: MemoryExhausted expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 5 for: 1m labels: severity: critical
- alert: ZFSPoolFailed expr: zfs_pool_health == 2 # DEGRADED or worse for: 1m labels: severity: criticalWarning Alerts (Check Soon)
groups: - name: warnings rules: - alert: HighCPU expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 for: 15m labels: severity: warning
- alert: StorageWarning expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20 for: 5m labels: severity: warning
- alert: BackupFailed expr: pve_storage_backup_last_success_time < (time() - 86400) for: 1h labels: severity: warningAlertmanager Configuration
Route alerts appropriately:
global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alerts@example.com'
route: group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default'
routes: - match: severity: critical receiver: 'pagerduty' repeat_interval: 1h
- match: severity: warning receiver: 'slack' repeat_interval: 4h
receivers: - name: 'default' email_configs: - to: 'admin@example.com'
- name: 'pagerduty' pagerduty_configs: - service_key: 'xxx'
- name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/xxx' channel: '#alerts'Dashboard Overview
My Grafana home dashboard shows:
Row 1: Cluster Overview├── Total nodes (up/down)├── Total VMs (running/stopped)├── Cluster storage usage└── Active alerts
Row 2: Per-Node Resources├── CPU usage per node├── Memory usage per node├── Network I/O per node└── Disk I/O per node
Row 3: Storage Health├── ZFS pool status├── Ceph health (if using)├── Storage capacity trends└── SMART warnings
Row 4: Backups├── Last backup time├── Backup success rate├── Backup storage usage└── Restore test statusThe Lesson
You can’t manage what you can’t see.
The Proxmox UI shows now. Observability shows:
- What happened (logs)
- How things are trending (metrics)
- What’s about to break (alerts)
The investment in monitoring pays off when:
- Disk fills up → you knew 2 weeks ago
- Host overloaded → you saw the trend
- Ceph degraded → alerted immediately
- Backup failed → notified same day
Without monitoring, you find out when users complain. With monitoring, you find out before users notice. That’s the difference between reactive and proactive operations.