Observability: Metrics, Logs, Alerts — What I Monitor on Proxmox

September 23, 2025 · 8 min read

The Proxmox web UI shows current state. It doesn’t show trends. It doesn’t show “disk was filling up for weeks before it failed.” It doesn’t wake you up at 3 AM when something is about to break.

Observability means knowing what’s happening before users tell you. Metrics show trends. Logs show context. Alerts notify you before failures become outages.

You can’t manage what you can’t see. And the Proxmox UI isn’t enough to see.

What to Monitor

Host Metrics

Metric	Why	Alert Threshold
CPU usage	Overloaded host	>90% for 5 min
Memory usage	OOM risk	>85%
Load average	System stress	>cores×2
Disk I/O	Storage bottleneck	Latency >50ms
Network I/O	Bandwidth saturation	>80% capacity

Storage Metrics

Metric	Why	Alert Threshold
Disk space	Running out	>80%
ZFS pool health	Data integrity	Any non-ONLINE
ZFS ARC hit rate	Cache efficiency	Below 80%
Ceph health	Cluster state	Any non-HEALTH_OK
SMART status	Disk failure prediction	Any warning

VM Metrics

Metric	Why	Alert Threshold
VM count	Capacity planning	Depends
Running vs stopped	Unexpected states	Any unexpected
CPU steal time	Overcommit	>10%
Balloon memory	Memory pressure	Significant deflation

Prometheus + Grafana Setup

The standard stack: Prometheus scrapes metrics, Grafana visualizes.

Install on Separate VM

Don’t monitor Proxmox from Proxmox. If the host dies, monitoring dies.

# On monitoring VM - install Prometheus
apt update
apt install -y prometheus prometheus-node-exporter

# Add Grafana repository
apt install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
apt update
apt install -y grafana

systemctl enable --now prometheus prometheus-node-exporter grafana-server

Proxmox PVE Exporter

Prometheus exporter specifically for Proxmox:

# Install
pip install prometheus-pve-exporter

# Create config
cat > /etc/pve-exporter.yml << 'EOF'
default:
  user: monitoring@pve
  token_name: prometheus
  token_value: "xxxx-xxxx-xxxx"
  verify_ssl: false
EOF

# Create systemd service
cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Prometheus PVE Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/pve-exporter.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now pve-exporter

Node Exporter on Proxmox Hosts

Install on each Proxmox node:

apt install prometheus-node-exporter
systemctl enable --now prometheus-node-exporter

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporters on Proxmox hosts
  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
          - 'pve1:9100'
          - 'pve2:9100'
          - 'pve3:9100'

  # PVE exporter
  - job_name: 'proxmox-pve'
    static_configs:
      - targets:
          - 'localhost:9221'
    metrics_path: /pve
    params:
      module: [default]

Grafana Dashboards

Import community dashboards:

Proxmox VE: Dashboard ID 10347
Node Exporter Full: Dashboard ID 1860
ZFS: Dashboard ID 11337
Ceph: Dashboard ID 2842

Or create custom dashboards for your specific needs.

ZFS Monitoring

ZFS Exporter

# Install
pip install prometheus-zfs-exporter

# Run
zfs_exporter --port 9134

Key ZFS Metrics

# Prometheus queries

# Pool capacity
zfs_pool_allocated_bytes / zfs_pool_size_bytes * 100

# ARC hit rate
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100

# Scrub errors
zfs_pool_scrub_errors_total

# Pool state (1 = ONLINE)
zfs_pool_health == 1

ZFS Alerts

groups:
  - name: zfs
    rules:
      - alert: ZFSPoolDegraded
        expr: zfs_pool_health != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is degraded"

      - alert: ZFSPoolSpaceLow
        expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is {{ $value }}% full"

      - alert: ZFSScrubErrors
        expr: zfs_pool_scrub_errors_total > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} has scrub errors"

Ceph Monitoring

Built-in Ceph Metrics

Ceph exposes Prometheus metrics natively:

# On Ceph manager node
ceph mgr module enable prometheus

# Metrics at
# http://ceph-mgr:9283/metrics

Prometheus Config for Ceph

# Add to prometheus.yml
- job_name: 'ceph'
  static_configs:
    - targets:
        - 'pve1:9283'  # Ceph manager

Ceph Alerts

groups:
  - name: ceph
    rules:
      - alert: CephHealthWarning
        expr: ceph_health_status == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ceph cluster health is WARN"

      - alert: CephHealthCritical
        expr: ceph_health_status == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ceph cluster health is CRITICAL"

      - alert: CephOSDDown
        expr: ceph_osd_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ceph OSD {{ $labels.osd }} is down"

      - alert: CephPGsUnclean
        expr: ceph_pg_total - ceph_pg_active_clean > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ceph has {{ $value }} unclean PGs"

SMART Monitoring

Predict disk failures before they happen:

Install smartmontools

# On each Proxmox node
apt install smartmontools

# Enable SMART on disks
smartctl --smart=on /dev/sda

Prometheus SMART Exporter

# Install
pip install prometheus-smart-exporter

# Run
smart_exporter --port 9110

SMART Alerts

groups:
  - name: smart
    rules:
      - alert: DiskSMARTWarning
        expr: smart_device_health != 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Disk {{ $labels.device }} SMART health warning"

      - alert: DiskReallocationCount
        expr: smart_raw_value{attribute_name="Reallocated_Sector_Ct"} > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Disk {{ $labels.device }} has reallocated sectors"

Log Aggregation

Loki for Logs

# docker-compose.yml for Loki
version: "3"
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:

Promtail on Proxmox Nodes

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: proxmox
    static_configs:
      - targets:
          - localhost
        labels:
          job: proxmox
          host: pve1
          __path__: /var/log/*.log

  - job_name: pve-cluster
    static_configs:
      - targets:
          - localhost
        labels:
          job: pve-cluster
          host: pve1
          __path__: /var/log/pve/tasks/*

Key Logs to Monitor

# Proxmox-specific logs
/var/log/pveproxy.log      # Web UI access
/var/log/pve/tasks/         # Task logs
/var/log/pve-firewall.log  # Firewall logs

# System logs
/var/log/syslog            # General system
/var/log/auth.log          # Authentication
/var/log/kern.log          # Kernel messages

# Ceph logs (if using)
/var/log/ceph/*.log

Alerting Rules Summary

Critical Alerts (Page Immediately)

groups:
  - name: critical
    rules:
      - alert: HostDown
        expr: up{job="proxmox-nodes"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: StorageCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 1m
        labels:
          severity: critical

      - alert: MemoryExhausted
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 5
        for: 1m
        labels:
          severity: critical

      - alert: ZFSPoolFailed
        expr: zfs_pool_health == 2  # DEGRADED or worse
        for: 1m
        labels:
          severity: critical

Warning Alerts (Check Soon)

groups:
  - name: warnings
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 15m
        labels:
          severity: warning

      - alert: StorageWarning
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
        for: 5m
        labels:
          severity: warning

      - alert: BackupFailed
        expr: pve_storage_backup_last_success_time < (time() - 86400)
        for: 1h
        labels:
          severity: warning

Alertmanager Configuration

Route alerts appropriately:

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack'
      repeat_interval: 4h

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

Dashboard Overview

My Grafana home dashboard shows:

Row 1: Cluster Overview
├── Total nodes (up/down)
├── Total VMs (running/stopped)
├── Cluster storage usage
└── Active alerts

Row 2: Per-Node Resources
├── CPU usage per node
├── Memory usage per node
├── Network I/O per node
└── Disk I/O per node

Row 3: Storage Health
├── ZFS pool status
├── Ceph health (if using)
├── Storage capacity trends
└── SMART warnings

Row 4: Backups
├── Last backup time
├── Backup success rate
├── Backup storage usage
└── Restore test status

The Lesson

You can’t manage what you can’t see.

The Proxmox UI shows now. Observability shows:

What happened (logs)
How things are trending (metrics)
What’s about to break (alerts)

The investment in monitoring pays off when:

Disk fills up → you knew 2 weeks ago
Host overloaded → you saw the trend
Ceph degraded → alerted immediately
Backup failed → notified same day

Without monitoring, you find out when users complain. With monitoring, you find out before users notice. That’s the difference between reactive and proactive operations.