Performance tuning is seductive. Forums are full of “enable this setting for 20% more speed.” Most of it is cargo culting — copying settings without understanding why.
Real performance optimization follows a process: measure, identify bottleneck, address bottleneck, measure again. Tweaking random settings without measuring is just superstition.
Optimization starts with measurement, not with tweaks.
Measure First
Before changing anything, understand your current performance.
Host Metrics
# Overall system performancehtop
# CPU usage per corempstat -P ALL 1
# Memory usagefree -hvmstat 1
# Disk I/Oiostat -xz 1
# Networkiftop -i vmbr0VM Performance
# Inside VM: Check for virtualization overhead# CPU steal time (other VMs taking your CPU)top # Look at %st column
# Disk latencyiostat -x 1
# From host: VM-specific metricsqm monitor 100info cpusinfo blockBenchmark Tools
# CPU benchmarkapt install sysbenchsysbench cpu run
# Disk benchmarkapt install fio
# Random 4K (database-like)fio --name=rand --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=30 --group_reporting
# Sequential (large file)fio --name=seq --ioengine=libaio --iodepth=1 --rw=read --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=30 --group_reporting
# Network benchmark (between VMs)apt install iperf3# Server: iperf3 -s# Client: iperf3 -c <server-ip>VirtIO Drivers
VirtIO is paravirtualized I/O. Instead of emulating real hardware, the VM knows it’s virtualized and uses optimized drivers.
Performance Impact
| Device | Emulated | VirtIO |
|---|---|---|
| Network | E1000: ~1 Gbps | virtio-net: 10+ Gbps |
| Disk | IDE: slow, high CPU | virtio-blk: fast, low CPU |
| Display | VGA: basic | virtio-gpu: better |
Configuring VirtIO
# Disk: Use virtio-scsi controllerqm set 100 --scsihw virtio-scsi-pciqm set 100 --scsi0 local-zfs:vm-100-disk-0
# Network: Use virtioqm set 100 --net0 virtio,bridge=vmbr0
# Display: Use virtio (Linux VMs)qm set 100 --vga virtioWindows VirtIO Drivers
Windows doesn’t include VirtIO drivers. Install them:
- Download ISO from Fedora: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/
- Attach ISO to VM
- During Windows install: Load driver from ISO
- After install: Run
virtio-win-gt-x64.msifor guest tools
Storage Cache Modes
Cache mode affects performance vs. data safety:
| Mode | Speed | Safety | Use Case |
|---|---|---|---|
| none | Fast | Safe | Production (default) |
| writeback | Fastest | Less safe | Benchmarks, non-critical |
| writethrough | Slower | Safest | Critical data |
| directsync | Slowest | Safest | Maximum safety |
Configure Cache
# No cache (recommended for production)qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=none
# Writeback (faster, less safe)qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=writeback
# With ZFS, cache=none is usually best# ZFS has its own caching (ARC)When to Use Writeback
Only with:
- Battery-backed write cache (enterprise storage)
- Non-critical VMs (dev, test)
- Understanding that power loss = potential data loss
IO Threads
By default, all VM disk I/O goes through one QEMU thread. With IO threads, each disk gets its own thread.
Enable IO Threads
# Enable iothread for diskqm set 100 --scsi0 local-zfs:vm-100-disk-0,iothread=1
# For multiple disks, each can have its own threadqm set 100 --scsi1 local-zfs:vm-100-disk-1,iothread=1When IO Threads Help
- Multiple disks per VM
- High IOPS workloads
- VMs with concurrent disk access
CPU Configuration
CPU Type
# Host passthrough (best performance, limits migration)qm set 100 --cpu host
# Specific type (allows migration between similar CPUs)qm set 100 --cpu kvm64
# With flags (enable specific features)qm set 100 --cpu host,flags=+aeshost gives best performance but limits live migration to identical CPUs.
NUMA Awareness
NUMA (Non-Uniform Memory Access) matters on multi-socket systems. Memory attached to one socket is faster for CPUs on that socket.
# Check host NUMA topologynumactl --hardware
# Example output:# node 0 cpus: 0 1 2 3 4 5 6 7# node 1 cpus: 8 9 10 11 12 13 14 15# node 0 size: 32768 MB# node 1 size: 32768 MBConfigure NUMA for VMs
# Enable NUMA for VMqm set 100 --numa 1
# Pin VM to specific NUMA nodeqm set 100 --numa0 cpus=0-3,memory=8192
# For large VMs spanning nodesqm set 100 --numa0 cpus=0-3,memory=4096qm set 100 --numa1 cpus=8-11,memory=4096CPU Pinning
Dedicate specific CPUs to a VM (reduces context switching):
# Pin VM to CPUs 0-3qm set 100 --affinity 0-3
# Or via NUMA configqm set 100 --numa0 cpus=0-3,memory=8192Caution: Over-pinning leaves other VMs fighting for remaining CPUs.
Hugepages
Normal memory pages are 4KB. Hugepages (2MB or 1GB) reduce TLB misses for memory-intensive workloads.
Enable Hugepages
# Reserve hugepages on hostecho 4096 > /proc/sys/vm/nr_hugepages # 4096 × 2MB = 8GB
# Make persistentecho "vm.nr_hugepages = 4096" >> /etc/sysctl.conf
# Verifygrep Huge /proc/meminfoConfigure VM for Hugepages
# Enable hugepages for VMqm set 100 --hugepages 2
# Values: 2 (2MB pages), 1024 (1GB pages), any (auto)When Hugepages Help
- Large VMs (8GB+ RAM)
- Memory-intensive workloads (databases)
- Many VMs with significant memory
Memory Ballooning
Balloon driver lets host reclaim unused VM memory.
# Enable ballooningqm set 100 --balloon 2048 # Minimum memoryqm set 100 --memory 8192 # Maximum memory
# VM starts with 8GB, can shrink to 2GB if host needs RAMBallooning Trade-offs
- Pro: Better memory utilization across VMs
- Con: Performance impact when balloon inflates
- Con: Swap inside VM if balloon too aggressive
For latency-sensitive VMs, disable ballooning:
qm set 100 --balloon 0Network Performance
Multiqueue
Enable multiple queues for virtio-net:
# Enable multiqueue (match to VM vCPUs, max 8)qm set 100 --net0 virtio,bridge=vmbr0,queues=4Inside VM:
# Set queues on interfaceethtool -L eth0 combined 4Vhost-net
Offload network processing to kernel (usually enabled by default):
# Verify vhost-net is loadedlsmod | grep vhost_net
# If not loadedmodprobe vhost_netStorage Performance
ZFS Tuning
# Check ARC sizearc_summary | grep "ARC size"
# Increase ARC max (if you have RAM)echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf # 8GB
# For SSDs, adjust transaction group timing# (faster sync, lower latency)echo 5 > /sys/module/zfs/parameters/zfs_txg_timeoutLVM-thin Tuning
# Check thin pool statuslvs -o+data_percent
# Zeroing (disable for SSD, faster provisioning)lvchange --zero n pve/dataCeph Tuning
# Check pool settingsceph osd pool get vmpool all
# Increase pg_num if neededceph osd pool set vmpool pg_num 256
# Adjust recovery (if impacting production)ceph config set osd osd_recovery_max_active 1Common Bottlenecks
CPU Bottleneck
Symptoms: High CPU usage, steal time in VMs
# Check host CPUmpstat -P ALL 1
# Check VM steal timetop # %st column
# Solutions:# - Reduce VM count# - Pin VMs to specific CPUs# - Upgrade host CPUMemory Bottleneck
Symptoms: Swapping, OOM, balloon activity
# Check host memoryfree -hcat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree"
# Check ZFS ARC (consuming RAM)arc_summary | head -20
# Solutions:# - Reduce ZFS ARC max# - Reduce VM memory# - Add more host RAMStorage Bottleneck
Symptoms: High I/O wait, slow disk operations
# Check disk latencyiostat -x 1
# Look for:# - await > 10ms (spinning disk) or > 1ms (SSD)# - %util > 80%
# Solutions:# - Move to faster storage# - Enable IO threads# - Reduce concurrent I/O (fewer VMs)Network Bottleneck
Symptoms: Low throughput, high latency
# Check interface utilizationiftop -i vmbr0
# Check for errorsip -s link show vmbr0
# Solutions:# - Enable virtio multiqueue# - Bond multiple NICs# - Upgrade to faster networkPerformance Testing Workflow
- Baseline: Measure current performance
- Identify: Find the bottleneck (CPU, RAM, disk, network)
- Change: Make ONE change
- Measure: Test the same workload
- Compare: Did it improve?
- Iterate: Repeat until satisfied
Never change multiple things at once. You won’t know what helped.
The Lesson
Optimization starts with measurement, not with tweaks.
Random performance settings from forums:
- Might help your workload
- Might hurt your workload
- Might do nothing
- You won’t know which without measuring
The process:
- Measure baseline
- Identify bottleneck
- Research solutions for THAT bottleneck
- Apply change
- Measure again
- Keep or revert
Performance tuning isn’t about knowing magic settings. It’s about understanding your workload, measuring it, and systematically removing bottlenecks.
The best optimization is often avoiding the problem: use VirtIO, use SSDs, have enough RAM. The tweaks come after the fundamentals are right.