Observability: Metrics, Logs, Alerts — What I Monitor on Proxmox

The Proxmox web UI shows current state. It doesn’t show trends. It doesn’t show “disk was filling up for weeks before it failed.” It doesn’t wake you up at 3 AM when something is about to break.

Observability means knowing what’s happening before users tell you. Metrics show trends. Logs show context. Alerts notify you before failures become outages.

You can’t manage what you can’t see. And the Proxmox UI isn’t enough to see.

What to Monitor

Host Metrics

MetricWhyAlert Threshold
CPU usageOverloaded host>90% for 5 min
Memory usageOOM risk>85%
Load averageSystem stress>cores×2
Disk I/OStorage bottleneckLatency >50ms
Network I/OBandwidth saturation>80% capacity

Storage Metrics

MetricWhyAlert Threshold
Disk spaceRunning out>80%
ZFS pool healthData integrityAny non-ONLINE
ZFS ARC hit rateCache efficiencyBelow 80%
Ceph healthCluster stateAny non-HEALTH_OK
SMART statusDisk failure predictionAny warning

VM Metrics

MetricWhyAlert Threshold
VM countCapacity planningDepends
Running vs stoppedUnexpected statesAny unexpected
CPU steal timeOvercommit>10%
Balloon memoryMemory pressureSignificant deflation

Prometheus + Grafana Setup

The standard stack: Prometheus scrapes metrics, Grafana visualizes.

Install on Separate VM

Don’t monitor Proxmox from Proxmox. If the host dies, monitoring dies.

Terminal window
# On monitoring VM - install Prometheus
apt update
apt install -y prometheus prometheus-node-exporter
# Add Grafana repository
apt install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
apt update
apt install -y grafana
systemctl enable --now prometheus prometheus-node-exporter grafana-server

Proxmox PVE Exporter

Prometheus exporter specifically for Proxmox:

Terminal window
# Install
pip install prometheus-pve-exporter
# Create config
cat > /etc/pve-exporter.yml << 'EOF'
default:
user: monitoring@pve
token_name: prometheus
token_value: "xxxx-xxxx-xxxx"
verify_ssl: false
EOF
# Create systemd service
cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Prometheus PVE Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/pve-exporter.yml
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now pve-exporter

Node Exporter on Proxmox Hosts

Install on each Proxmox node:

Terminal window
apt install prometheus-node-exporter
systemctl enable --now prometheus-node-exporter

Prometheus Configuration

/etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Prometheus self
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporters on Proxmox hosts
- job_name: 'proxmox-nodes'
static_configs:
- targets:
- 'pve1:9100'
- 'pve2:9100'
- 'pve3:9100'
# PVE exporter
- job_name: 'proxmox-pve'
static_configs:
- targets:
- 'localhost:9221'
metrics_path: /pve
params:
module: [default]

Grafana Dashboards

Import community dashboards:

  • Proxmox VE: Dashboard ID 10347
  • Node Exporter Full: Dashboard ID 1860
  • ZFS: Dashboard ID 11337
  • Ceph: Dashboard ID 2842

Or create custom dashboards for your specific needs.

ZFS Monitoring

ZFS Exporter

Terminal window
# Install
pip install prometheus-zfs-exporter
# Run
zfs_exporter --port 9134

Key ZFS Metrics

# Prometheus queries
# Pool capacity
zfs_pool_allocated_bytes / zfs_pool_size_bytes * 100
# ARC hit rate
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100
# Scrub errors
zfs_pool_scrub_errors_total
# Pool state (1 = ONLINE)
zfs_pool_health == 1

ZFS Alerts

/etc/prometheus/rules/zfs.yml
groups:
- name: zfs
rules:
- alert: ZFSPoolDegraded
expr: zfs_pool_health != 1
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} is degraded"
- alert: ZFSPoolSpaceLow
expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "ZFS pool {{ $labels.pool }} is {{ $value }}% full"
- alert: ZFSScrubErrors
expr: zfs_pool_scrub_errors_total > 0
for: 1m
labels:
severity: critical
annotations:
summary: "ZFS pool {{ $labels.pool }} has scrub errors"

Ceph Monitoring

Built-in Ceph Metrics

Ceph exposes Prometheus metrics natively:

Terminal window
# On Ceph manager node
ceph mgr module enable prometheus
# Metrics at
# http://ceph-mgr:9283/metrics

Prometheus Config for Ceph

# Add to prometheus.yml
- job_name: 'ceph'
static_configs:
- targets:
- 'pve1:9283' # Ceph manager

Ceph Alerts

/etc/prometheus/rules/ceph.yml
groups:
- name: ceph
rules:
- alert: CephHealthWarning
expr: ceph_health_status == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph cluster health is WARN"
- alert: CephHealthCritical
expr: ceph_health_status == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Ceph cluster health is CRITICAL"
- alert: CephOSDDown
expr: ceph_osd_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ceph OSD {{ $labels.osd }} is down"
- alert: CephPGsUnclean
expr: ceph_pg_total - ceph_pg_active_clean > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph has {{ $value }} unclean PGs"

SMART Monitoring

Predict disk failures before they happen:

Install smartmontools

Terminal window
# On each Proxmox node
apt install smartmontools
# Enable SMART on disks
smartctl --smart=on /dev/sda

Prometheus SMART Exporter

Terminal window
# Install
pip install prometheus-smart-exporter
# Run
smart_exporter --port 9110

SMART Alerts

/etc/prometheus/rules/smart.yml
groups:
- name: smart
rules:
- alert: DiskSMARTWarning
expr: smart_device_health != 1
for: 1m
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.device }} SMART health warning"
- alert: DiskReallocationCount
expr: smart_raw_value{attribute_name="Reallocated_Sector_Ct"} > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.device }} has reallocated sectors"

Log Aggregation

Loki for Logs

# docker-compose.yml for Loki
version: "3"
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log:ro
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
volumes:
loki-data:

Promtail on Proxmox Nodes

/etc/promtail/config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: proxmox
static_configs:
- targets:
- localhost
labels:
job: proxmox
host: pve1
__path__: /var/log/*.log
- job_name: pve-cluster
static_configs:
- targets:
- localhost
labels:
job: pve-cluster
host: pve1
__path__: /var/log/pve/tasks/*

Key Logs to Monitor

Terminal window
# Proxmox-specific logs
/var/log/pveproxy.log # Web UI access
/var/log/pve/tasks/ # Task logs
/var/log/pve-firewall.log # Firewall logs
# System logs
/var/log/syslog # General system
/var/log/auth.log # Authentication
/var/log/kern.log # Kernel messages
# Ceph logs (if using)
/var/log/ceph/*.log

Alerting Rules Summary

Critical Alerts (Page Immediately)

groups:
- name: critical
rules:
- alert: HostDown
expr: up{job="proxmox-nodes"} == 0
for: 1m
labels:
severity: critical
- alert: StorageCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 1m
labels:
severity: critical
- alert: MemoryExhausted
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 5
for: 1m
labels:
severity: critical
- alert: ZFSPoolFailed
expr: zfs_pool_health == 2 # DEGRADED or worse
for: 1m
labels:
severity: critical

Warning Alerts (Check Soon)

groups:
- name: warnings
rules:
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 15m
labels:
severity: warning
- alert: StorageWarning
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
for: 5m
labels:
severity: warning
- alert: BackupFailed
expr: pve_storage_backup_last_success_time < (time() - 86400)
for: 1h
labels:
severity: warning

Alertmanager Configuration

Route alerts appropriately:

/etc/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack'
repeat_interval: 4h
receivers:
- name: 'default'
email_configs:
- to: 'admin@example.com'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'

Dashboard Overview

My Grafana home dashboard shows:

Row 1: Cluster Overview
├── Total nodes (up/down)
├── Total VMs (running/stopped)
├── Cluster storage usage
└── Active alerts
Row 2: Per-Node Resources
├── CPU usage per node
├── Memory usage per node
├── Network I/O per node
└── Disk I/O per node
Row 3: Storage Health
├── ZFS pool status
├── Ceph health (if using)
├── Storage capacity trends
└── SMART warnings
Row 4: Backups
├── Last backup time
├── Backup success rate
├── Backup storage usage
└── Restore test status

The Lesson

You can’t manage what you can’t see.

The Proxmox UI shows now. Observability shows:

  • What happened (logs)
  • How things are trending (metrics)
  • What’s about to break (alerts)

The investment in monitoring pays off when:

  • Disk fills up → you knew 2 weeks ago
  • Host overloaded → you saw the trend
  • Ceph degraded → alerted immediately
  • Backup failed → notified same day

Without monitoring, you find out when users complain. With monitoring, you find out before users notice. That’s the difference between reactive and proactive operations.