Performance Clinic: CPU Pinning, Hugepages, VirtIO, and Storage Tuning

September 26, 2025 · 8 min read

Performance tuning is seductive. Forums are full of “enable this setting for 20% more speed.” Most of it is cargo culting — copying settings without understanding why.

Real performance optimization follows a process: measure, identify bottleneck, address bottleneck, measure again. Tweaking random settings without measuring is just superstition.

Optimization starts with measurement, not with tweaks.

Measure First

Before changing anything, understand your current performance.

Host Metrics

# Overall system performance
htop

# CPU usage per core
mpstat -P ALL 1

# Memory usage
free -h
vmstat 1

# Disk I/O
iostat -xz 1

# Network
iftop -i vmbr0

VM Performance

# Inside VM: Check for virtualization overhead
# CPU steal time (other VMs taking your CPU)
top  # Look at %st column

# Disk latency
iostat -x 1

# From host: VM-specific metrics
qm monitor 100
info cpus
info block

Benchmark Tools

# CPU benchmark
apt install sysbench
sysbench cpu run

# Disk benchmark
apt install fio

# Random 4K (database-like)
fio --name=rand --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=30 --group_reporting

# Sequential (large file)
fio --name=seq --ioengine=libaio --iodepth=1 --rw=read --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=30 --group_reporting

# Network benchmark (between VMs)
apt install iperf3
# Server: iperf3 -s
# Client: iperf3 -c <server-ip>

VirtIO Drivers

VirtIO is paravirtualized I/O. Instead of emulating real hardware, the VM knows it’s virtualized and uses optimized drivers.

Performance Impact

Device	Emulated	VirtIO
Network	E1000: ~1 Gbps	virtio-net: 10+ Gbps
Disk	IDE: slow, high CPU	virtio-blk: fast, low CPU
Display	VGA: basic	virtio-gpu: better

Configuring VirtIO

# Disk: Use virtio-scsi controller
qm set 100 --scsihw virtio-scsi-pci
qm set 100 --scsi0 local-zfs:vm-100-disk-0

# Network: Use virtio
qm set 100 --net0 virtio,bridge=vmbr0

# Display: Use virtio (Linux VMs)
qm set 100 --vga virtio

Windows VirtIO Drivers

Windows doesn’t include VirtIO drivers. Install them:

Download ISO from Fedora: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/
Attach ISO to VM
During Windows install: Load driver from ISO
After install: Run virtio-win-gt-x64.msi for guest tools

Storage Cache Modes

Cache mode affects performance vs. data safety:

Mode	Speed	Safety	Use Case
none	Fast	Safe	Production (default)
writeback	Fastest	Less safe	Benchmarks, non-critical
writethrough	Slower	Safest	Critical data
directsync	Slowest	Safest	Maximum safety

Configure Cache

# No cache (recommended for production)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=none

# Writeback (faster, less safe)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=writeback

# With ZFS, cache=none is usually best
# ZFS has its own caching (ARC)

When to Use Writeback

Only with:

Battery-backed write cache (enterprise storage)
Non-critical VMs (dev, test)
Understanding that power loss = potential data loss

IO Threads

By default, all VM disk I/O goes through one QEMU thread. With IO threads, each disk gets its own thread.

Enable IO Threads

# Enable iothread for disk
qm set 100 --scsi0 local-zfs:vm-100-disk-0,iothread=1

# For multiple disks, each can have its own thread
qm set 100 --scsi1 local-zfs:vm-100-disk-1,iothread=1

When IO Threads Help

Multiple disks per VM
High IOPS workloads
VMs with concurrent disk access

CPU Configuration

CPU Type

# Host passthrough (best performance, limits migration)
qm set 100 --cpu host

# Specific type (allows migration between similar CPUs)
qm set 100 --cpu kvm64

# With flags (enable specific features)
qm set 100 --cpu host,flags=+aes

host gives best performance but limits live migration to identical CPUs.

NUMA Awareness

NUMA (Non-Uniform Memory Access) matters on multi-socket systems. Memory attached to one socket is faster for CPUs on that socket.

# Check host NUMA topology
numactl --hardware

# Example output:
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 0 size: 32768 MB
# node 1 size: 32768 MB

Configure NUMA for VMs

# Enable NUMA for VM
qm set 100 --numa 1

# Pin VM to specific NUMA node
qm set 100 --numa0 cpus=0-3,memory=8192

# For large VMs spanning nodes
qm set 100 --numa0 cpus=0-3,memory=4096
qm set 100 --numa1 cpus=8-11,memory=4096

CPU Pinning

Dedicate specific CPUs to a VM (reduces context switching):

# Pin VM to CPUs 0-3
qm set 100 --affinity 0-3

# Or via NUMA config
qm set 100 --numa0 cpus=0-3,memory=8192

Caution: Over-pinning leaves other VMs fighting for remaining CPUs.

Hugepages

Normal memory pages are 4KB. Hugepages (2MB or 1GB) reduce TLB misses for memory-intensive workloads.

Enable Hugepages

# Reserve hugepages on host
echo 4096 > /proc/sys/vm/nr_hugepages  # 4096 × 2MB = 8GB

# Make persistent
echo "vm.nr_hugepages = 4096" >> /etc/sysctl.conf

# Verify
grep Huge /proc/meminfo

Configure VM for Hugepages

# Enable hugepages for VM
qm set 100 --hugepages 2

# Values: 2 (2MB pages), 1024 (1GB pages), any (auto)

When Hugepages Help

Large VMs (8GB+ RAM)
Memory-intensive workloads (databases)
Many VMs with significant memory

Memory Ballooning

Balloon driver lets host reclaim unused VM memory.

# Enable ballooning
qm set 100 --balloon 2048  # Minimum memory
qm set 100 --memory 8192   # Maximum memory

# VM starts with 8GB, can shrink to 2GB if host needs RAM

Ballooning Trade-offs

Pro: Better memory utilization across VMs
Con: Performance impact when balloon inflates
Con: Swap inside VM if balloon too aggressive

For latency-sensitive VMs, disable ballooning:

qm set 100 --balloon 0

Network Performance

Multiqueue

Enable multiple queues for virtio-net:

# Enable multiqueue (match to VM vCPUs, max 8)
qm set 100 --net0 virtio,bridge=vmbr0,queues=4

Inside VM:

# Set queues on interface
ethtool -L eth0 combined 4

Vhost-net

Offload network processing to kernel (usually enabled by default):

# Verify vhost-net is loaded
lsmod | grep vhost_net

# If not loaded
modprobe vhost_net

Storage Performance

ZFS Tuning

# Check ARC size
arc_summary | grep "ARC size"

# Increase ARC max (if you have RAM)
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf  # 8GB

# For SSDs, adjust transaction group timing
# (faster sync, lower latency)
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

LVM-thin Tuning

# Check thin pool status
lvs -o+data_percent

# Zeroing (disable for SSD, faster provisioning)
lvchange --zero n pve/data

Ceph Tuning

# Check pool settings
ceph osd pool get vmpool all

# Increase pg_num if needed
ceph osd pool set vmpool pg_num 256

# Adjust recovery (if impacting production)
ceph config set osd osd_recovery_max_active 1

Common Bottlenecks

CPU Bottleneck

Symptoms: High CPU usage, steal time in VMs

# Check host CPU
mpstat -P ALL 1

# Check VM steal time
top  # %st column

# Solutions:
# - Reduce VM count
# - Pin VMs to specific CPUs
# - Upgrade host CPU

Memory Bottleneck

Symptoms: Swapping, OOM, balloon activity

# Check host memory
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree"

# Check ZFS ARC (consuming RAM)
arc_summary | head -20

# Solutions:
# - Reduce ZFS ARC max
# - Reduce VM memory
# - Add more host RAM

Storage Bottleneck

Symptoms: High I/O wait, slow disk operations

# Check disk latency
iostat -x 1

# Look for:
# - await > 10ms (spinning disk) or > 1ms (SSD)
# - %util > 80%

# Solutions:
# - Move to faster storage
# - Enable IO threads
# - Reduce concurrent I/O (fewer VMs)

Network Bottleneck

Symptoms: Low throughput, high latency

# Check interface utilization
iftop -i vmbr0

# Check for errors
ip -s link show vmbr0

# Solutions:
# - Enable virtio multiqueue
# - Bond multiple NICs
# - Upgrade to faster network

Performance Testing Workflow

Baseline: Measure current performance
Identify: Find the bottleneck (CPU, RAM, disk, network)
Change: Make ONE change
Measure: Test the same workload
Compare: Did it improve?
Iterate: Repeat until satisfied

Never change multiple things at once. You won’t know what helped.

The Lesson

Optimization starts with measurement, not with tweaks.

Random performance settings from forums:

Might help your workload
Might hurt your workload
Might do nothing
You won’t know which without measuring

The process:

Measure baseline
Identify bottleneck
Research solutions for THAT bottleneck
Apply change
Measure again
Keep or revert

Performance tuning isn’t about knowing magic settings. It’s about understanding your workload, measuring it, and systematically removing bottlenecks.

The best optimization is often avoiding the problem: use VirtIO, use SSDs, have enough RAM. The tweaks come after the fundamentals are right.