Performance Clinic: CPU Pinning, Hugepages, VirtIO, and Storage Tuning

Performance tuning is seductive. Forums are full of “enable this setting for 20% more speed.” Most of it is cargo culting — copying settings without understanding why.

Real performance optimization follows a process: measure, identify bottleneck, address bottleneck, measure again. Tweaking random settings without measuring is just superstition.

Optimization starts with measurement, not with tweaks.

Measure First

Before changing anything, understand your current performance.

Host Metrics

Terminal window
# Overall system performance
htop
# CPU usage per core
mpstat -P ALL 1
# Memory usage
free -h
vmstat 1
# Disk I/O
iostat -xz 1
# Network
iftop -i vmbr0

VM Performance

Terminal window
# Inside VM: Check for virtualization overhead
# CPU steal time (other VMs taking your CPU)
top # Look at %st column
# Disk latency
iostat -x 1
# From host: VM-specific metrics
qm monitor 100
info cpus
info block

Benchmark Tools

Terminal window
# CPU benchmark
apt install sysbench
sysbench cpu run
# Disk benchmark
apt install fio
# Random 4K (database-like)
fio --name=rand --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=30 --group_reporting
# Sequential (large file)
fio --name=seq --ioengine=libaio --iodepth=1 --rw=read --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=30 --group_reporting
# Network benchmark (between VMs)
apt install iperf3
# Server: iperf3 -s
# Client: iperf3 -c <server-ip>

VirtIO Drivers

VirtIO is paravirtualized I/O. Instead of emulating real hardware, the VM knows it’s virtualized and uses optimized drivers.

Performance Impact

DeviceEmulatedVirtIO
NetworkE1000: ~1 Gbpsvirtio-net: 10+ Gbps
DiskIDE: slow, high CPUvirtio-blk: fast, low CPU
DisplayVGA: basicvirtio-gpu: better

Configuring VirtIO

Terminal window
# Disk: Use virtio-scsi controller
qm set 100 --scsihw virtio-scsi-pci
qm set 100 --scsi0 local-zfs:vm-100-disk-0
# Network: Use virtio
qm set 100 --net0 virtio,bridge=vmbr0
# Display: Use virtio (Linux VMs)
qm set 100 --vga virtio

Windows VirtIO Drivers

Windows doesn’t include VirtIO drivers. Install them:

  1. Download ISO from Fedora: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/
  2. Attach ISO to VM
  3. During Windows install: Load driver from ISO
  4. After install: Run virtio-win-gt-x64.msi for guest tools

Storage Cache Modes

Cache mode affects performance vs. data safety:

ModeSpeedSafetyUse Case
noneFastSafeProduction (default)
writebackFastestLess safeBenchmarks, non-critical
writethroughSlowerSafestCritical data
directsyncSlowestSafestMaximum safety

Configure Cache

Terminal window
# No cache (recommended for production)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=none
# Writeback (faster, less safe)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=writeback
# With ZFS, cache=none is usually best
# ZFS has its own caching (ARC)

When to Use Writeback

Only with:

  • Battery-backed write cache (enterprise storage)
  • Non-critical VMs (dev, test)
  • Understanding that power loss = potential data loss

IO Threads

By default, all VM disk I/O goes through one QEMU thread. With IO threads, each disk gets its own thread.

Enable IO Threads

Terminal window
# Enable iothread for disk
qm set 100 --scsi0 local-zfs:vm-100-disk-0,iothread=1
# For multiple disks, each can have its own thread
qm set 100 --scsi1 local-zfs:vm-100-disk-1,iothread=1

When IO Threads Help

  • Multiple disks per VM
  • High IOPS workloads
  • VMs with concurrent disk access

CPU Configuration

CPU Type

Terminal window
# Host passthrough (best performance, limits migration)
qm set 100 --cpu host
# Specific type (allows migration between similar CPUs)
qm set 100 --cpu kvm64
# With flags (enable specific features)
qm set 100 --cpu host,flags=+aes

host gives best performance but limits live migration to identical CPUs.

NUMA Awareness

NUMA (Non-Uniform Memory Access) matters on multi-socket systems. Memory attached to one socket is faster for CPUs on that socket.

Terminal window
# Check host NUMA topology
numactl --hardware
# Example output:
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 0 size: 32768 MB
# node 1 size: 32768 MB

Configure NUMA for VMs

Terminal window
# Enable NUMA for VM
qm set 100 --numa 1
# Pin VM to specific NUMA node
qm set 100 --numa0 cpus=0-3,memory=8192
# For large VMs spanning nodes
qm set 100 --numa0 cpus=0-3,memory=4096
qm set 100 --numa1 cpus=8-11,memory=4096

CPU Pinning

Dedicate specific CPUs to a VM (reduces context switching):

Terminal window
# Pin VM to CPUs 0-3
qm set 100 --affinity 0-3
# Or via NUMA config
qm set 100 --numa0 cpus=0-3,memory=8192

Caution: Over-pinning leaves other VMs fighting for remaining CPUs.

Hugepages

Normal memory pages are 4KB. Hugepages (2MB or 1GB) reduce TLB misses for memory-intensive workloads.

Enable Hugepages

Terminal window
# Reserve hugepages on host
echo 4096 > /proc/sys/vm/nr_hugepages # 4096 × 2MB = 8GB
# Make persistent
echo "vm.nr_hugepages = 4096" >> /etc/sysctl.conf
# Verify
grep Huge /proc/meminfo

Configure VM for Hugepages

Terminal window
# Enable hugepages for VM
qm set 100 --hugepages 2
# Values: 2 (2MB pages), 1024 (1GB pages), any (auto)

When Hugepages Help

  • Large VMs (8GB+ RAM)
  • Memory-intensive workloads (databases)
  • Many VMs with significant memory

Memory Ballooning

Balloon driver lets host reclaim unused VM memory.

Terminal window
# Enable ballooning
qm set 100 --balloon 2048 # Minimum memory
qm set 100 --memory 8192 # Maximum memory
# VM starts with 8GB, can shrink to 2GB if host needs RAM

Ballooning Trade-offs

  • Pro: Better memory utilization across VMs
  • Con: Performance impact when balloon inflates
  • Con: Swap inside VM if balloon too aggressive

For latency-sensitive VMs, disable ballooning:

Terminal window
qm set 100 --balloon 0

Network Performance

Multiqueue

Enable multiple queues for virtio-net:

Terminal window
# Enable multiqueue (match to VM vCPUs, max 8)
qm set 100 --net0 virtio,bridge=vmbr0,queues=4

Inside VM:

Terminal window
# Set queues on interface
ethtool -L eth0 combined 4

Vhost-net

Offload network processing to kernel (usually enabled by default):

Terminal window
# Verify vhost-net is loaded
lsmod | grep vhost_net
# If not loaded
modprobe vhost_net

Storage Performance

ZFS Tuning

Terminal window
# Check ARC size
arc_summary | grep "ARC size"
# Increase ARC max (if you have RAM)
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf # 8GB
# For SSDs, adjust transaction group timing
# (faster sync, lower latency)
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

LVM-thin Tuning

Terminal window
# Check thin pool status
lvs -o+data_percent
# Zeroing (disable for SSD, faster provisioning)
lvchange --zero n pve/data

Ceph Tuning

Terminal window
# Check pool settings
ceph osd pool get vmpool all
# Increase pg_num if needed
ceph osd pool set vmpool pg_num 256
# Adjust recovery (if impacting production)
ceph config set osd osd_recovery_max_active 1

Common Bottlenecks

CPU Bottleneck

Symptoms: High CPU usage, steal time in VMs

Terminal window
# Check host CPU
mpstat -P ALL 1
# Check VM steal time
top # %st column
# Solutions:
# - Reduce VM count
# - Pin VMs to specific CPUs
# - Upgrade host CPU

Memory Bottleneck

Symptoms: Swapping, OOM, balloon activity

Terminal window
# Check host memory
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree"
# Check ZFS ARC (consuming RAM)
arc_summary | head -20
# Solutions:
# - Reduce ZFS ARC max
# - Reduce VM memory
# - Add more host RAM

Storage Bottleneck

Symptoms: High I/O wait, slow disk operations

Terminal window
# Check disk latency
iostat -x 1
# Look for:
# - await > 10ms (spinning disk) or > 1ms (SSD)
# - %util > 80%
# Solutions:
# - Move to faster storage
# - Enable IO threads
# - Reduce concurrent I/O (fewer VMs)

Network Bottleneck

Symptoms: Low throughput, high latency

Terminal window
# Check interface utilization
iftop -i vmbr0
# Check for errors
ip -s link show vmbr0
# Solutions:
# - Enable virtio multiqueue
# - Bond multiple NICs
# - Upgrade to faster network

Performance Testing Workflow

  1. Baseline: Measure current performance
  2. Identify: Find the bottleneck (CPU, RAM, disk, network)
  3. Change: Make ONE change
  4. Measure: Test the same workload
  5. Compare: Did it improve?
  6. Iterate: Repeat until satisfied

Never change multiple things at once. You won’t know what helped.

The Lesson

Optimization starts with measurement, not with tweaks.

Random performance settings from forums:

  • Might help your workload
  • Might hurt your workload
  • Might do nothing
  • You won’t know which without measuring

The process:

  1. Measure baseline
  2. Identify bottleneck
  3. Research solutions for THAT bottleneck
  4. Apply change
  5. Measure again
  6. Keep or revert

Performance tuning isn’t about knowing magic settings. It’s about understanding your workload, measuring it, and systematically removing bottlenecks.

The best optimization is often avoiding the problem: use VirtIO, use SSDs, have enough RAM. The tweaks come after the fundamentals are right.