Berik Ashimov's Blog

NX-OS Spine/Leaf Operations: vPC, Port-Channels, and Pre-Production Checks

berik@ashimov.com (Berik Ashimov) — Tue, 17 Mar 2026 00:00:00 GMT

Fabric is up. vPC formed. Port-channels bundled. Then a link fails, and traffic blackholes. Or a leaf reboots, and half the servers lose connectivity. Or ECMP doesn't balance as expected.

Spine/leaf sounds simple — until failure scenarios reveal configuration gaps. The time to discover these is before production, not during an outage.

vPC as an Operational Object

What vPC Actually Is

vPC (Virtual Port Channel) makes two Nexus switches appear as one to downstream devices:

        [Spine 1]     [Spine 2]
           │              │
     ┌─────┴──────────────┴─────┐
     │                          │
  [Leaf 1]──vPC Peer Link──[Leaf 2]
     │                          │
     └──────────┬───────────────┘
                │
           [Server]
           (port-channel)

The server sees one port-channel to one "switch." In reality, half the links go to Leaf 1, half to Leaf 2.

Critical vPC Components

! vPC Domain configuration
vpc domain 1
  peer-switch
  peer-keepalive destination 10.0.0.2 source 10.0.0.1
  peer-gateway
  layer3 peer-router
  auto-recovery
  delay restore 120
  delay restore interface-vlan 60

Key elements:

Component	Purpose	Failure Impact
Peer-link	Sync MAC tables, forward orphan traffic	vPC suspends on secondary
Peer-keepalive	Detect peer failure	Split-brain if both fail
Peer-gateway	Allow peer to route for other's HSRP MAC	Traffic blackhole
Auto-recovery	Re-enable vPC after split-brain	Manual intervention needed

vPC Health Checks

# Overall vPC status
show vpc

# Expected output:
# vPC domain id         : 1
# Peer status           : peer adjacency formed ok
# vPC keep-alive status : peer is alive
# Configuration consistency status : success
# Per-vlan consistency status     : success
# Type-2 consistency status       : success
# vPC role              : primary

# Peer-link status
show vpc peer-link

# vPC consistency check
show vpc consistency-parameters global
show vpc consistency-parameters interface port-channel 10

Consistency Check Failures

vPC requires certain configs to match on both peers:

# Check what's inconsistent
show vpc consistency-parameters global

# Type 1 (must match or vPC won't form):
# - STP mode, VLAN state, port-type
# - vPC domain settings

# Type 2 (warning, vPC still works):
# - VLAN configurations
# - IGMP snooping settings

Fix pattern: Compare configs side by side:

# On both switches
show run vpc
show run interface port-channel X

# Look for differences in:
# - allowed VLANs
# - switchport mode
# - STP settings

Port-Channel Hygiene

LACP Configuration

Always use LACP, never static:

! Server-facing port-channel (vPC)
interface port-channel 10
  description Server-Cluster-01
  switchport mode trunk
  switchport trunk allowed vlan 100-110
  vpc 10

interface Ethernet1/1
  description Server-Cluster-01-Link1
  switchport mode trunk
  switchport trunk allowed vlan 100-110
  channel-group 10 mode active

! LACP must be active on both ends
! "mode active" = initiate LACP
! "mode passive" = respond only (avoid)

Allowed VLANs

Only allow VLANs that should traverse the link:

! WRONG: Allow all VLANs
interface port-channel 10
  switchport trunk allowed vlan all

! RIGHT: Explicit VLAN list
interface port-channel 10
  switchport trunk allowed vlan 100-110,200

Why it matters:

Broadcast domains stay contained
STP topology is cleaner
Troubleshooting is easier

Native VLAN

Match native VLAN on both ends to avoid untagged traffic issues:

! Set explicit native VLAN
interface port-channel 10
  switchport trunk native vlan 999

! Verify
show interface port-channel 10 trunk

MTU Configuration

Jumbo frames require consistent MTU end-to-end:

! System MTU (affects all L2 interfaces)
system jumbomtu 9216

! Per-interface MTU (L3)
interface Ethernet1/1
  mtu 9216

! Verify
show interface port-channel 10 | include MTU

! Test end-to-end
ping 10.0.1.100 df-bit packet-size 9000

Port-Channel Verification

# Status overview
show port-channel summary

# Expected output:
# 10     Po10(SU)   Eth      LACP     Eth1/1(P)   Eth1/2(P)
# SU = Layer2, Up
# P = member is up and bundled

# Detailed status
show port-channel database interface port-channel 10

# LACP counters
show lacp counters interface port-channel 10

# Member interface status
show lacp neighbor interface port-channel 10

Underlay Routing Sanity

OSPF Underlay Checks

# Verify all adjacencies are FULL
show ip ospf neighbors

# Expected: All neighbors in FULL state
# FULL/DR, FULL/BDR, FULL/DROTHER

# Check for stuck adjacencies
show ip ospf neighbors | include INIT|2WAY|EXSTART

# Verify routes are learned
show ip route ospf

# Check OSPF database consistency
show ip ospf database summary

BGP Underlay Checks

For eBGP spine/leaf:

# All neighbors established
show bgp ipv4 unicast summary

# Expected: State = Established, or showing prefix count
# Neighbor        V    AS    MsgRcvd  MsgSent   State/PfxRcd
# 10.0.1.1        4    65001 1234     1234      10

# Check for routes from all spines
show bgp ipv4 unicast

# Verify ECMP
show ip route 10.0.2.0/24

# Should show multiple next-hops if ECMP working
# via 10.0.1.1, Eth1/49, via 10.0.1.2, Eth1/50

ECMP Behavior

# Check maximum ECMP paths
show running-config | include maximum-paths

! Configure if needed
router bgp 65001
  address-family ipv4 unicast
    maximum-paths 4
    maximum-paths ibgp 4

# Verify load balancing
show ip load-sharing

# Test ECMP path selection
show routing hash 10.0.1.100 10.0.2.100 ip

Timer Alignment

Fast convergence requires aggressive timers:

! BGP timers
router bgp 65001
  neighbor 10.0.1.1 timers 3 9
  neighbor 10.0.1.1 bfd

! OSPF timers
interface Ethernet1/49
  ip ospf hello-interval 1
  ip ospf dead-interval 3
  ip ospf bfd

! BFD configuration
feature bfd
bfd interval 250 min_rx 250 multiplier 3

Failure Drills

What to Test Before Go-Live

Failure Scenario	Expected Behavior	Verify
Single uplink fails	Traffic shifts to other uplinks	`show port-channel summary`
vPC member fails	vPC still operational	`show vpc brief`
Peer-link fails	Secondary suspends vPCs	`show vpc`
Leaf fails	Servers failover to peer	Ping from server
Spine fails	ECMP removes path	`show ip route`

Drill 1: Single Uplink Failure

# On leaf, shut one uplink
interface Ethernet1/49
  shutdown

# Verify:
# 1. Port-channel stays up (degraded)
show port-channel summary

# 2. Routing adjusts
show ip route

# 3. Traffic still flows
ping <destination>

# Restore
no shutdown

Drill 2: vPC Member Failure

# Shut one member of server port-channel
interface Ethernet1/1
  shutdown

# Verify:
# 1. vPC stays up
show vpc brief

# 2. Server still has connectivity (via peer)
# Test from server

# 3. Traffic flows through peer-link if needed
show interface port-channel <peer-link> counters

# Restore
no shutdown

Drill 3: Peer-Link Failure

Caution: This is disruptive. Schedule maintenance window.

# Simulate peer-link failure
interface port-channel 1  # peer-link
  shutdown

# Expected on secondary:
# - vPCs suspend
# - Peer-keepalive maintains split-brain prevention

show vpc
# Role should show: secondary, operational secondary

# Restore immediately
no shutdown

Drill 4: Leaf Failure

# Simulate complete leaf failure (reload)
reload

# On peer leaf, verify:
show vpc orphan-ports
show vpc

# Servers should failover to surviving leaf
# vPC ports on surviving leaf stay up

Drill 5: Spine Failure

# On spine, shut all downlinks
interface Ethernet1/1-48
  shutdown

# On leaves, verify:
# 1. OSPF/BGP removes routes via failed spine
show ip route

# 2. ECMP still works via remaining spine(s)
show ip route <destination>

# 3. Traffic flows
ping <destination> source <loopback>

Pre-Production Checklist

vPC Checklist

[ ] Peer-link is port-channel (not single link)
[ ] Peer-keepalive uses dedicated link/VRF
[ ] Consistency checks pass (show vpc consistency-parameters global)
[ ] Auto-recovery is configured
[ ] Delay restore timers appropriate for environment
[ ] peer-gateway enabled
[ ] layer3 peer-router enabled (if routing on vPC VLANs)

Port-Channel Checklist

[ ] LACP mode active (not passive or on)
[ ] Allowed VLANs explicitly configured
[ ] Native VLAN matches both ends
[ ] MTU consistent end-to-end
[ ] Spanning-tree port type configured (edge for servers)
[ ] BPDU guard enabled on edge ports

Underlay Routing Checklist

[ ] All adjacencies FULL/Established
[ ] Routes learned from all spines
[ ] ECMP working (multiple next-hops)
[ ] BFD enabled for fast failure detection
[ ] Timers aligned (hello/dead intervals)
[ ] Loopback addresses reachable from all leaves

Failure Testing Checklist

[ ] Single uplink failure tested
[ ] vPC member failure tested
[ ] Peer-link failure tested (with maintenance window)
[ ] Leaf failure simulated
[ ] Spine failure simulated
[ ] Convergence time documented
[ ] Alerts verified during failures

Monitoring Commands Summary

# vPC status
show vpc
show vpc brief
show vpc peer-link
show vpc consistency-parameters global
show vpc orphan-ports

# Port-channel status
show port-channel summary
show port-channel database
show lacp counters
show lacp neighbor

# Routing status
show ip ospf neighbors
show bgp ipv4 unicast summary
show ip route
show ip route summary

# Interface status
show interface status
show interface trunk
show interface counters errors

# Spanning tree
show spanning-tree summary
show spanning-tree vlan <id>

The Lesson

Spine/leaf operations require:

vPC hygiene — consistency checks, proper peer-link/keepalive, recovery settings
Port-channel discipline — LACP active, explicit VLANs, matching MTU
Underlay verification — all adjacencies up, ECMP working, fast timers
Failure drills — test every failure scenario before go-live

The fabric that "just works" in the lab will surprise you in production when:

A link fails and traffic asymmetry begins
A leaf reboots and half the vPCs suspend
ECMP doesn't balance because of misconfigured maximum-paths

Test failures before they test you. Document expected behavior. Verify convergence times. A spine/leaf fabric is only as reliable as your preparation.

Cisco IOS-XE Edge Baseline: AAA, SSH, ACL, Logging, and IP SLA

berik@ashimov.com (Berik Ashimov) — Fri, 13 Mar 2026 00:00:00 GMT

Internet works — sometimes. SSH is open to the world. No logs. NTP not configured, so timestamps are meaningless. When an incident happens, investigation is impossible.

This is the typical state of edge routers. Nobody configures them properly from day one, and technical debt accumulates until a breach forces action.

Here's the baseline every IOS-XE edge router needs.

Secure Management Plane

AAA Configuration

Always configure AAA, even for local authentication:

! Enable AAA
aaa new-model

! Local authentication fallback
aaa authentication login default local
aaa authentication login CONSOLE local
aaa authorization console
aaa authorization exec default local

! Create local admin user with privilege 15
username admin privilege 15 algorithm-type scrypt secret <strong-password>

! Alternative: TACACS+ with local fallback
aaa authentication login default group tacacs+ local
aaa authorization exec default group tacacs+ local
aaa accounting exec default start-stop group tacacs+

tacacs server PRIMARY
 address ipv4 10.0.1.100
 key 7 <encrypted-key>
 timeout 3
tacacs server SECONDARY
 address ipv4 10.0.1.101
 key 7 <encrypted-key>
 timeout 3

aaa group server tacacs+ TACACS-SERVERS
 server name PRIMARY
 server name SECONDARY

SSH Hardening

Disable telnet. Configure SSH properly:

! Generate RSA key (2048 minimum, 4096 preferred)
crypto key generate rsa modulus 4096 label SSH-KEY

! SSH version 2 only
ip ssh version 2
ip ssh time-out 60
ip ssh authentication-retries 3

! Disable weak algorithms
ip ssh server algorithm encryption aes256-ctr aes192-ctr aes128-ctr
ip ssh server algorithm mac hmac-sha2-256 hmac-sha2-512

! VTY configuration
line vty 0 15
 login authentication default
 transport input ssh
 transport output ssh
 exec-timeout 15 0
 logging synchronous
 access-class VTY-ACCESS in

! Console
line con 0
 login authentication CONSOLE
 exec-timeout 15 0
 logging synchronous

VTY Access Control

Restrict who can SSH:

! ACL for management access
ip access-list extended VTY-ACCESS
 10 permit tcp 10.0.0.0 0.255.255.255 any eq 22 log
 20 permit tcp 192.168.1.0 0.0.0.255 any eq 22 log
 30 deny ip any any log

! Apply to VTY lines
line vty 0 15
 access-class VTY-ACCESS in

SNMPv3

If you need SNMP, use v3 with authentication and encryption:

! Disable SNMP v1/v2c
no snmp-server community public
no snmp-server community private

! SNMPv3 configuration
snmp-server group MONITORING v3 priv
snmp-server user monitor MONITORING v3 auth sha <auth-password> priv aes 256 <priv-password>

! Restrict SNMP source
snmp-server host 10.0.1.50 version 3 priv monitor

! ACL to restrict SNMP (apply to interface or use control-plane)
ip access-list extended SNMP-ACCESS
 permit udp host 10.0.1.50 any eq snmp
 deny udp any any eq snmp log

IP SLA for Real Failover

The Problem with Link State

Interface up ≠ Internet works. Your uplink can be "up" while the ISP has internal issues.

! Interface is UP
Router#show ip interface brief
GigabitEthernet0/0/0     203.0.113.2    YES NVRAM  up      up

! But Internet is unreachable
Router#ping 8.8.8.8
.....
Success rate is 0 percent (0/5)

IP SLA Configuration

Track actual reachability, not just link state:

! ICMP echo to reliable target
ip sla 1
 icmp-echo 8.8.8.8 source-interface GigabitEthernet0/0/0
 frequency 10
 threshold 1000
 timeout 2000
ip sla schedule 1 life forever start-time now

ip sla 2
 icmp-echo 1.1.1.1 source-interface GigabitEthernet0/0/0
 frequency 10
 threshold 1000
 timeout 2000
ip sla schedule 2 life forever start-time now

! Track SLA results
track 1 ip sla 1 reachability
track 2 ip sla 2 reachability

! Track both (require both to be up)
track 10 list boolean and
 object 1
 object 2

! Or track either (failover if both fail)
track 20 list boolean or
 object 1
 object 2

Static Route with Tracking

! Primary default route (tracked)
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/0 203.0.113.1 10 track 10

! Backup default route (higher metric, activates when primary fails)
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/1 198.51.100.1 20

When track 10 goes down, primary route is removed, backup takes over.

Verify SLA Status

Router#show ip sla statistics
IPSLAs Latest Operation Statistics

IPSLA operation id: 1
        Latest RTT: 12 milliseconds
Latest operation start time: 10:30:15 UTC Thu Mar 13 2026
Latest operation return code: OK
Number of successes: 1000
Number of failures: 2

Router#show track
Track 1
  IP SLA 1 reachability
  Reachability is Up
    1 change, last change 00:15:32
  Latest operation return code: OK
  Latest RTT (millisecs) 12

Track 10
  List boolean and
  Boolean AND is Up
    1 change, last change 00:15:32
  object 1 Up
  object 2 Up

Logging Configuration

Timestamps and Buffered Logging

! Enable timestamps on all logs
service timestamps debug datetime msec localtime show-timezone
service timestamps log datetime msec localtime show-timezone

! Buffered logging (local storage)
logging buffered 1000000 informational
logging buffered xml

! Console logging (limit to critical)
logging console critical

! Monitor logging (terminal)
logging monitor informational

Syslog Configuration

! Remote syslog servers
logging host 10.0.1.50 transport udp port 514
logging host 10.0.1.51 transport tcp port 514

! Logging source interface
logging source-interface Loopback0

! Facility
logging facility local6

! Log level
logging trap informational

Archive Logging

Capture configuration changes:

! Archive configuration
archive
 log config
  logging enable
  logging size 500
  notify syslog contenttype plaintext
  hidekeys

! View config changes
Router#show archive log config all

NTP Configuration

Without accurate time, logs are useless for incident response:

! NTP servers
ntp server 10.0.1.10 prefer
ntp server 10.0.1.11

! Or public NTP (less ideal)
ntp server 0.pool.ntp.org
ntp server 1.pool.ntp.org

! NTP authentication (optional but recommended)
ntp authenticate
ntp authentication-key 1 md5 <key>
ntp trusted-key 1
ntp server 10.0.1.10 key 1

! Timezone
clock timezone UTC 0
! or
clock timezone EST -5
clock summer-time EDT recurring

! Verify
Router#show ntp status
Clock is synchronized, stratum 3, reference is 10.0.1.10

Common Mistakes

Mistake 1: ACL Applied Wrong Direction

! WRONG: ACL blocking return traffic
interface GigabitEthernet0/0/0
 ip access-group INBOUND in   ! This blocks return traffic!

! The ACL:
ip access-list extended INBOUND
 permit tcp any any eq 80
 deny ip any any

! Traffic flows: Inside → Outside (port 80)
! Return traffic: Outside → Inside (source port 80, dest port random)
! ACL blocks the return because dest port isn't 80!

Fix: Use reflexive ACLs or proper stateful inspection:

! Better approach: permit established
ip access-list extended INBOUND
 permit tcp any any established
 permit tcp any any eq 80
 deny ip any any log

Mistake 2: SLA Checks Wrong Target

! WRONG: Checking ISP's gateway only
ip sla 1
 icmp-echo 203.0.113.1   ! ISP gateway
 source-interface GigabitEthernet0/0/0

! ISP gateway is up, but their upstream is down
! Your router thinks everything is fine

Fix: Check destinations beyond ISP's network:

! Better: Check real Internet destinations
ip sla 1
 icmp-echo 8.8.8.8       ! Google DNS
ip sla 2
 icmp-echo 1.1.1.1       ! Cloudflare DNS

Mistake 3: NAT + ACL Ordering

NAT changes addresses. ACL evaluation order matters:

! Inbound traffic:
! 1. ACL on interface (original destination IP)
! 2. NAT translation (changes destination)
! 3. Routing (uses NAT'd address)

! Outbound traffic:
! 1. Routing decision
! 2. NAT translation (changes source)
! 3. ACL on interface (NAT'd source IP!)

Fix: Understand where your ACL is evaluated:

! Inbound ACL - matches ORIGINAL destination
interface GigabitEthernet0/0/0
 ip nat outside
 ip access-group OUTSIDE-IN in

! ACL matches the public IP, before NAT translation
ip access-list extended OUTSIDE-IN
 permit tcp any host 203.0.113.10 eq 443  ! Public IP

Mistake 4: Forgetting logging on deny

! WRONG: Silent drops
ip access-list extended BLOCK-BAD
 deny ip 10.0.0.0 0.255.255.255 any
 permit ip any any

! Can't tell what's being blocked

Fix: Always log deny actions:

! Better: Log blocked traffic
ip access-list extended BLOCK-BAD
 deny ip 10.0.0.0 0.255.255.255 any log
 permit ip any any

Complete Edge Baseline

Putting it all together:

! === MANAGEMENT ===
aaa new-model
aaa authentication login default local
aaa authorization exec default local

username admin privilege 15 algorithm-type scrypt secret <password>

ip ssh version 2
ip ssh time-out 60
crypto key generate rsa modulus 4096 label SSH-KEY

line vty 0 15
 login authentication default
 transport input ssh
 exec-timeout 15 0
 access-class VTY-ACCESS in

ip access-list extended VTY-ACCESS
 permit tcp 10.0.0.0 0.255.255.255 any eq 22 log
 deny ip any any log

! === TIME ===
clock timezone UTC 0
ntp server 10.0.1.10 prefer
ntp server 10.0.1.11

! === LOGGING ===
service timestamps debug datetime msec localtime show-timezone
service timestamps log datetime msec localtime show-timezone
logging buffered 1000000 informational
logging host 10.0.1.50
logging source-interface Loopback0
logging trap informational

archive
 log config
  logging enable

! === IP SLA ===
ip sla 1
 icmp-echo 8.8.8.8 source-interface GigabitEthernet0/0/0
 frequency 10
ip sla schedule 1 life forever start-time now

ip sla 2
 icmp-echo 1.1.1.1 source-interface GigabitEthernet0/0/0
 frequency 10
ip sla schedule 2 life forever start-time now

track 10 list boolean and
 object 1
 object 2

! === ROUTING ===
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/0 203.0.113.1 10 track 10
ip route 0.0.0.0 0.0.0.0 GigabitEthernet0/0/1 198.51.100.1 20

Verification Commands

! AAA status
show aaa sessions
show aaa servers

! SSH sessions
show ssh
show ip ssh

! SLA status
show ip sla statistics
show ip sla configuration
show track
show track brief

! Logging
show logging
show archive log config all

! NTP
show ntp status
show ntp associations

! ACL hits
show access-lists
show ip access-lists VTY-ACCESS

The Lesson

Every edge router needs:

Secure management — AAA, SSH-only, ACL on VTY
Real failover — IP SLA tracking actual reachability, not just link state
Proper logging — timestamps, buffered, syslog, NTP
Configuration auditing — archive log config

Common mistakes that break production:

ACL in wrong direction (blocks return traffic)
SLA checking ISP gateway instead of Internet destinations
NAT/ACL ordering confusion
Silent denies without logging

This baseline isn't optional — it's the minimum for production. Configure it on day one, not after an incident forces you to investigate with no logs and wrong timestamps.

Junos Routing Policy That Scales: Policy-Statement Patterns and Safe Defaults

berik@ashimov.com (Berik Ashimov) — Tue, 10 Mar 2026 00:00:00 GMT

Config grew over years. Multiple engineers added policies. Nobody documented communities. New engineer joins, asks "what does community 65000:999 mean?" — silence. "Don't touch it — it works."

This is how routing policies become unmaintainable. Junos provides powerful policy tools, but power without structure creates chaos.

Junos Policy Mental Model

Terms Evaluate Top to Bottom

policy-statement EXAMPLE {
    term FIRST {
        from { ... }
        then accept;
    }
    term SECOND {
        from { ... }
        then reject;
    }
    term DEFAULT {
        then reject;  # Explicit default
    }
}

First matching term wins. If FIRST matches, SECOND never evaluates. No match in any term? Implicit accept — the silent danger.

The Implicit Accept Problem

# Dangerous policy - implicit accept at end
policy-statement FILTER-ROUTES {
    term BLOCK-BOGONS {
        from {
            route-filter 10.0.0.0/8 orlonger;
        }
        then reject;
    }
    # No default term = everything else ACCEPTED
}

Traffic you didn't explicitly handle gets accepted. Always add explicit default:

# Safe policy - explicit default
policy-statement FILTER-ROUTES {
    term BLOCK-BOGONS {
        from {
            route-filter 10.0.0.0/8 orlonger;
        }
        then reject;
    }
    term DEFAULT-DENY {
        then reject;
    }
}

Accept vs Next Policy vs Next Term

then accept;       # Accept route, stop processing THIS policy
then reject;       # Reject route, stop processing THIS policy
then next policy;  # Continue to next policy in chain
then next term;    # Continue to next term in THIS policy

Multiple policies can be chained:

set protocols bgp group PEERS import [ POLICY-1 POLICY-2 POLICY-3 ]
# Evaluates POLICY-1, then POLICY-2, then POLICY-3
# First explicit accept/reject wins

Policy Building Blocks

Prefix Lists

Named lists of prefixes. Reusable across policies.

# Define prefix-list
set policy-options prefix-list INTERNAL-NETWORKS 10.0.0.0/8
set policy-options prefix-list INTERNAL-NETWORKS 172.16.0.0/12
set policy-options prefix-list INTERNAL-NETWORKS 192.168.0.0/16

# Use in policy
set policy-options policy-statement ALLOW-INTERNAL term MATCH from prefix-list INTERNAL-NETWORKS
set policy-options policy-statement ALLOW-INTERNAL term MATCH then accept
set policy-options policy-statement ALLOW-INTERNAL term DEFAULT then reject

Route Filters

More granular than prefix-lists. Match exact prefixes or ranges.

# Exact match
route-filter 10.0.0.0/24 exact;

# Match this and longer (subnets)
route-filter 10.0.0.0/16 orlonger;

# Match longer only (not the /16 itself)
route-filter 10.0.0.0/16 longer;

# Match range
route-filter 10.0.0.0/16 prefix-length-range /24-/28;

# Match up to a length
route-filter 10.0.0.0/8 upto /24;

Practical example:

policy-statement CUSTOMER-ROUTES {
    term ACCEPT-ALLOCATED {
        from {
            route-filter 203.0.113.0/24 orlonger;  # Customer's allocation
            route-filter 198.51.100.0/24 orlonger; # Second allocation
        }
        then accept;
    }
    term REJECT-REST {
        then reject;
    }
}

AS Path Filters

Match routes by AS path patterns.

# Define AS path regex
set policy-options as-path ORIGIN-65001 ".* 65001"
set policy-options as-path DIRECT-PEER "^65001$"
set policy-options as-path TRANSIT ".* 65001 .*"

# Use in policy
policy-statement PREFER-DIRECT {
    term DIRECT {
        from as-path DIRECT-PEER;
        then {
            local-preference 150;
            accept;
        }
    }
    term TRANSIT {
        from as-path TRANSIT;
        then {
            local-preference 100;
            accept;
        }
    }
}

AS path regex patterns:

^ — start of path
$ — end of path
. — any single AS
.* — zero or more ASes
[0-9]+ — one or more digits (any AS number)

Communities

Tags attached to routes. The glue for policy communication.

# Define communities
set policy-options community CUSTOMER-ROUTES members 65000:100
set policy-options community NO-EXPORT members no-export
set policy-options community BLACKHOLE members 65000:666

# Match community
policy-statement CUSTOMER-IMPORT {
    term TAGGED {
        from community CUSTOMER-ROUTES;
        then accept;
    }
}

# Set community
policy-statement TAG-OUTBOUND {
    term ADD-TAG {
        then {
            community add CUSTOMER-ROUTES;
            accept;
        }
    }
}

Community Design That Scales

Naming Convention

Without documentation, 65000:100 means nothing. Create a system:

# Pattern: ASN:TYPE+VALUE
# Types:
#   1xx = Origin (where route came from)
#   2xx = Region
#   3xx = Customer type
#   4xx = Traffic engineering
#   666 = Blackhole

# Examples:
set policy-options community ORIGIN-CUSTOMER members 65000:100
set policy-options community ORIGIN-PEER members 65000:101
set policy-options community ORIGIN-TRANSIT members 65000:102

set policy-options community REGION-US-EAST members 65000:201
set policy-options community REGION-US-WEST members 65000:202
set policy-options community REGION-EU members 65000:203

set policy-options community TYPE-ENTERPRISE members 65000:301
set policy-options community TYPE-RESIDENTIAL members 65000:302

set policy-options community TE-BACKUP-ONLY members 65000:401
set policy-options community TE-PRIMARY members 65000:402

set policy-options community BLACKHOLE members 65000:666

Document in Config

Junos supports description on most objects. Use it:

set policy-options community ORIGIN-CUSTOMER members 65000:100
annotate policy-options community ORIGIN-CUSTOMER "Routes learned from direct customers"

set policy-options community BLACKHOLE members 65000:666
annotate policy-options community BLACKHOLE "Trigger RTBH - null route this prefix"

Community Groups

Group related communities for easier matching:

# Define community group
set policy-options community ALL-ORIGINS members "65000:10[0-9]"
set policy-options community ALL-REGIONS members "65000:2[0-9][0-9]"

# Match any origin community
policy-statement CHECK-ORIGIN {
    term HAS-ORIGIN {
        from community ALL-ORIGINS;
        then accept;
    }
    term MISSING-ORIGIN {
        then reject;  # Reject routes without origin tag
    }
}

Safe Defaults

Bogon Filtering

Always filter RFC1918, documentation prefixes, and other bogons:

# Bogon prefix-list (base prefixes)
set policy-options prefix-list BOGONS 0.0.0.0/8
set policy-options prefix-list BOGONS 10.0.0.0/8
set policy-options prefix-list BOGONS 100.64.0.0/10
set policy-options prefix-list BOGONS 127.0.0.0/8
set policy-options prefix-list BOGONS 169.254.0.0/16
set policy-options prefix-list BOGONS 172.16.0.0/12
set policy-options prefix-list BOGONS 192.0.0.0/24
set policy-options prefix-list BOGONS 192.0.2.0/24
set policy-options prefix-list BOGONS 192.168.0.0/16
set policy-options prefix-list BOGONS 198.18.0.0/15
set policy-options prefix-list BOGONS 198.51.100.0/24
set policy-options prefix-list BOGONS 203.0.113.0/24
set policy-options prefix-list BOGONS 224.0.0.0/4
set policy-options prefix-list BOGONS 240.0.0.0/4

# Apply with orlonger match (catches subnets too)
policy-statement REJECT-BOGONS {
    term BOGONS {
        from {
            prefix-list-filter BOGONS orlonger;
        }
        then reject;
    }
}

Max Prefix Protection

Limit prefixes accepted from peers:

# Limit with teardown
set protocols bgp group CUSTOMERS neighbor 192.0.2.1 family inet unicast prefix-limit maximum 1000
set protocols bgp group CUSTOMERS neighbor 192.0.2.1 family inet unicast prefix-limit teardown 80 idle-timeout 30

# Teardown at 80% (800 prefixes), wait 30 minutes before retry

Reject Unless Explicitly Allowed

Default-deny at BGP group level:

# Import policy for customer
policy-statement CUSTOMER-192-0-2-1-IMPORT {
    term ACCEPT-ANNOUNCED {
        from {
            prefix-list CUSTOMER-192-0-2-1-PREFIXES;
        }
        then {
            community add ORIGIN-CUSTOMER;
            accept;
        }
    }
    term REJECT-ALL {
        then reject;
    }
}

# Customer's allowed prefixes
set policy-options prefix-list CUSTOMER-192-0-2-1-PREFIXES 198.51.100.0/24

AS Path Sanity

Reject private ASNs and your own ASN from external peers:

# Private ASN range (64512-65534)
set policy-options as-path PRIVATE-ASN ".* (6451[2-9]|645[2-9][0-9]|6[5-9][0-4][0-9]{2}|655[0-2][0-9]|6553[0-4]) .*"

# Simpler alternative: match specific private ASNs you might see
set policy-options as-path PRIVATE-ASN-SIMPLE "64[5-9][0-9]{2}|65[0-4][0-9]{2}|655[0-3][0-4]"

# Own ASN in path (shouldn't happen from external)
set policy-options as-path OWN-ASN ".* 65000 .*"

policy-statement SANITY-CHECK {
    term REJECT-PRIVATE-ASN {
        from as-path PRIVATE-ASN;
        then reject;
    }
    term REJECT-OWN-ASN {
        from as-path OWN-ASN;
        then reject;
    }
}

Note: AS path regex for full private range coverage is complex. Many operators maintain external prefix/AS-path lists (e.g., from Team Cymru or RIPE) rather than hand-crafted regex.

Policy Structure Patterns

Layered Import Policy

Build policies in layers for maintainability:

# Layer 1: Sanity checks (apply to all)
policy-statement IMPORT-SANITY {
    term REJECT-BOGONS { from prefix-list BOGONS; then reject; }
    term REJECT-TOO-LONG { from route-filter 0.0.0.0/0 prefix-length-range /25-/32; then reject; }
    term REJECT-DEFAULT { from route-filter 0.0.0.0/0 exact; then reject; }
    term CONTINUE { then next policy; }
}

# Layer 2: Peer-specific acceptance
policy-statement IMPORT-PEER-65001 {
    term ACCEPT-PREFIXES {
        from prefix-list PEER-65001-PREFIXES;
        then {
            community add ORIGIN-PEER;
            local-preference 100;
            accept;
        }
    }
    term REJECT-REST { then reject; }
}

# Apply both
set protocols bgp group PEERS neighbor 192.0.2.1 import [ IMPORT-SANITY IMPORT-PEER-65001 ]

Export Policy Template

Consistent export structure:

policy-statement EXPORT-TO-PEERS {
    term EXPORT-CUSTOMERS {
        from community ORIGIN-CUSTOMER;
        then {
            community delete ALL-INTERNAL;  # Strip internal communities
            accept;
        }
    }
    term EXPORT-OWN {
        from {
            protocol direct;
            prefix-list OWN-PREFIXES;
        }
        then accept;
    }
    term REJECT-REST {
        then reject;
    }
}

Debugging Policies

Test Policy Match

# Test which policy term matches a route
test policy POLICY-NAME 192.0.2.0/24

# Output shows:
# Route 192.0.2.0/24
#   Term: ACCEPT-CUSTOMERS
#   Action: accept

Show Received vs Active

# What peer is sending (before import policy)
show route receive-protocol bgp 192.0.2.1

# What we accepted (after import policy)
show route protocol bgp neighbor 192.0.2.1

# Compare to find filtered routes

Hidden Routes

Routes filtered by policy become "hidden":

# Show hidden routes
show route hidden

# Why is it hidden?
show route 192.0.2.0/24 hidden extensive

Policy Decision Flow

┌─────────────────────────────────────────────────────────┐
│                  Import Policy Chain                      │
├─────────────────────────────────────────────────────────┤
│                                                           │
│  Policy 1          Policy 2          Policy 3            │
│  ┌─────────┐      ┌─────────┐      ┌─────────┐          │
│  │ Term A  │─no─→│ Term A  │─no─→│ Term A  │          │
│  └────┬────┘      └────┬────┘      └────┬────┘          │
│       │yes             │yes             │yes             │
│       ↓                ↓                ↓                │
│  [accept/reject]  [accept/reject]  [accept/reject]      │
│                                                           │
│  If no match in any term of any policy:                  │
│  → IMPLICIT ACCEPT (danger!)                             │
│                                                           │
└─────────────────────────────────────────────────────────┘

The Lesson

Junos routing policy is powerful but requires discipline:

Always explicit default — never rely on implicit accept
Name everything meaningfully — communities, prefix-lists, policies
Document in config — use annotate liberally
Layer your policies — sanity checks separate from peer-specific logic
Test before commit — use test policy command

A well-structured policy config is:

Readable by new engineers
Modifiable without fear
Auditable for compliance

The goal isn't clever regex — it's maintainability. If you can't explain what 65000:247 means without checking documentation, your community scheme needs work.

Policy-statement is an engineering system. Treat it like code: structured, documented, tested.

Junos SRX Security Policies in Real Life: Why Traffic Doesn't Match

berik@ashimov.com (Berik Ashimov) — Fri, 06 Mar 2026 00:00:00 GMT

Everything configured. Policy created. Commit complete. Traffic doesn't flow. Or flows to the wrong place. Or NAT doesn't apply. Welcome to SRX troubleshooting.

SRX security processing has a specific order. Understanding that order — and knowing where to look when things break — separates frustrating hours from quick fixes.

How SRX Actually Processes Traffic

Packet arrives
    ↓
Interface → Security Zone lookup
    ↓
Route lookup (egress interface/zone)
    ↓
Security Policy evaluation (from-zone → to-zone)
    ↓
NAT processing (if matched)
    ↓
Forward packet

The critical insight: zone is determined by interface, policy is matched by zone pair. If your zones are wrong, your policy will never match.

The Zone Chain

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Interface  │ ──→ │    Zone     │ ──→ │   Policy    │
│   ge-0/0/1  │     │   trust     │     │ trust→untrust│
└─────────────┘     └─────────────┘     └─────────────┘

Traffic from ge-0/0/1 (trust zone) going to ge-0/0/0 (untrust zone) needs a policy from trust to untrust. Simple concept, endless misconfigurations.

Top 5 Reasons "Policy Not Hit"

1. Zone Mismatch

The most common issue. Traffic enters through one interface but policy expects a different zone.

# Check interface zone assignment
show security zones

# Output:
# Security zone: trust
#   Interfaces bound: 1
#     ge-0/0/1.0
# Security zone: untrust
#   Interfaces bound: 1
#     ge-0/0/0.0

Fix pattern: Verify source and destination zones match your policy exactly.

# See which zone traffic is hitting
show security flow session source-prefix 192.168.1.0/24

# Check policy for that zone pair
show security policies from-zone trust to-zone untrust

2. Address Book Scope

Address book entries are zone-scoped by default. Address defined in zone A is invisible to policies involving zone B.

# Wrong: Address in wrong zone's address book
set security zones security-zone untrust address-book address internal-server 10.0.1.100/32

# Policy from trust→untrust can't see this address!

Fix pattern: Use global address book for addresses used across zones.

# Correct: Global address book
set security address-book global address internal-server 10.0.1.100/32
set security address-book global address-set servers address internal-server

# Now usable in any policy
set security policies from-zone trust to-zone untrust policy allow-server match destination-address internal-server

3. Application/ALG Issues

Application identification can prevent matches. ALG (Application Layer Gateway) can modify packets unexpectedly.

# Policy expects specific application
set security policies from-zone trust to-zone untrust policy web-traffic match application junos-http

# But traffic is HTTPS on non-standard port - doesn't match junos-http

Fix pattern: Use application sets or any for troubleshooting, then narrow down.

# Debug: What application is SRX seeing?
show security flow session extensive

# Look for "Application:" field
# Session ID: 12345, Application: junos-https, ...

# For non-standard ports
set applications application custom-https protocol tcp destination-port 8443
set security policies from-zone trust to-zone untrust policy web-traffic match application custom-https

4. Routing Instance Issues

Traffic in a routing instance may not hit policies as expected. VRF leaking adds complexity.

# Traffic in VRF but policy in main instance
show route table CUSTOMER-VRF 10.0.0.0/24

# Check session table for routing-instance
show security flow session routing-instance CUSTOMER-VRF

Fix pattern: Ensure policy exists for the routing instance context.

# Policy must reference correct zones
# Zones are global, but routing affects egress interface selection
show security zones security-zone trust interfaces
# Verify interface is in expected routing instance

5. NAT Order Confusion

Source NAT happens after policy match. Destination NAT happens before. This ordering causes endless confusion.

Inbound traffic:
  1. Destination NAT (change dest IP)
  2. Route lookup (with new dest)
  3. Policy match (with original source, NAT'd destination)
  4. Source NAT (if configured)

Outbound traffic:
  1. Route lookup
  2. Policy match (original IPs)
  3. Source NAT

Fix pattern: For destination NAT, policy must match the post-NAT destination.

# Destination NAT: public 203.0.113.10 → private 10.0.1.100
set security nat destination pool server-pool address 10.0.1.100/32
set security nat destination rule-set inbound rule to-server match destination-address 203.0.113.10/32
set security nat destination rule-set inbound rule to-server then destination-nat pool server-pool

# Policy must match the NAT'd destination (10.0.1.100), not public IP!
set security policies from-zone untrust to-zone trust policy allow-server match destination-address 10.0.1.100/32

Debug Workflow

Step 1: Session Table

First, check if session exists:

show security flow session

# Filter by IP
show security flow session source-prefix 192.168.1.100/32
show security flow session destination-prefix 10.0.1.100/32

# Extensive output shows policy that matched
show security flow session extensive

Session exists? Check which policy matched. No session? Traffic isn't flowing or policy is denying.

Step 2: Policy Hit Counters

# Show all policies with hit counts
show security policies hit-count

# Output:
# Logical system: root-logical-system
# Index   From zone        To zone          Name             Policy count
# 1       trust            untrust          allow-web        12543
# 2       trust            untrust          allow-dns        3421
# 3       trust            untrust          deny-all         89

# Zero hits on your policy? It's not matching.

Clear and retest:

# Clear counters
clear security policies hit-count

# Generate test traffic, then check again
show security policies hit-count

Step 3: Flow Trace

The nuclear option. Enables detailed logging of packet processing.

# Enable traceoptions
set security flow traceoptions file flow-trace
set security flow traceoptions file size 10m
set security flow traceoptions flag basic-datapath
set security flow traceoptions packet-filter my-filter source-prefix 192.168.1.100/32
commit

# Generate traffic, then view trace
show log flow-trace

# CRITICAL: Disable when done (performance impact!)
delete security flow traceoptions
commit

Reading Flow Trace Output

flow_process_pkt: received packet on ge-0/0/1.0, src 192.168.1.100, dst 8.8.8.8
  flow_first_policy_search: policy search from zone trust to zone untrust
  flow_first_policy_search: policy found: allow-web
  flow_first_routing_lookup: dst 8.8.8.8, ifl 72 (ge-0/0/0.0)
  flow_first_install_session: session installed successfully

Key lines to watch:

policy search from zone X to zone Y — confirms zone pair
policy found: <name> — which policy matched
policy search: no policy match — the dreaded no-match

Common Fix Patterns

Pattern 1: "Any" Debugging

When nothing works, start with wide-open policy to confirm traffic flow:

# Temporary debug policy (REMOVE AFTER!)
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match source-address any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match destination-address any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW match application any
set security policies from-zone trust to-zone untrust policy DEBUG-ALLOW then permit
insert security policies from-zone trust to-zone untrust policy DEBUG-ALLOW before policy <first-policy>
commit

# Traffic works? Problem is in policy specifics.
# Traffic fails? Problem is zones, routing, or NAT.

Pattern 2: Explicit Logging

Enable session logs to see what's happening:

# Log at session init and close
set security policies from-zone trust to-zone untrust policy allow-web then log session-init
set security policies from-zone trust to-zone untrust policy allow-web then log session-close
commit

# View logs
show log messages | match RT_FLOW

Pattern 3: Policy Order Audit

Policies evaluate top-to-bottom. First match wins.

# See policy order
show security policies from-zone trust to-zone untrust

# Check if broad policy is shadowing specific one
# Policy "deny-all" at position 3 will never be reached if "permit-any" is at position 2

Reorder if needed:

# Move policy
insert security policies from-zone trust to-zone untrust policy specific-rule before policy broad-rule

Pattern 4: Global Policy Check

Don't forget global policies — they apply across all zone pairs:

show security policies global

# Global policy might be permitting/denying before zone policy is evaluated

NAT Debugging

Source NAT Not Working

# Check NAT rule hit counts
show security nat source rule all

# Verify pool has addresses
show security nat source pool all

# Check if NAT is actually translating
show security flow session extensive | match "NAT"
# Look for: "In: ... NAT: ... Out: ..."

Destination NAT Not Working

# Check destination NAT rules
show security nat destination rule all

# Remember: policy must match POST-NAT destination!
# If NAT changes 203.0.113.10 → 10.0.1.100
# Policy destination-address must be 10.0.1.100

Verification Commands Summary

# Zone verification
show security zones
show interfaces terse | match "ge-"

# Policy verification
show security policies from-zone X to-zone Y
show security policies hit-count
show security policies detail

# Session verification
show security flow session
show security flow session extensive
show security flow session count

# NAT verification
show security nat source rule all
show security nat destination rule all
show security nat source summary
show security nat destination summary

# Address book verification
show security address-book global
show security zones security-zone <zone> address-book

# Flow trace (use carefully)
show log flow-trace

The Lesson

SRX policy debugging follows a predictable pattern:

Verify zones first — most issues are zone mismatches
Check session table — does traffic create a session?
Look at hit counters — which policy is matching (or not)?
Use flow trace last — performance impact, use surgically

The processing order matters:

Destination NAT → Routing → Policy → Source NAT
Policy matches on post-DNAT, pre-SNAT addresses

When stuck, simplify: broad "any" policy to confirm flow, then narrow down. Log everything. Read the trace output carefully.

SRX is powerful but unforgiving. Understanding the flow means faster troubleshooting.

NAT Logging: Session Tracking for CGNAT and Compliance

berik@ashimov.com (Berik Ashimov) — Tue, 03 Mar 2026 00:00:00 GMT

An abuse complaint arrives: "IP 203.0.113.50 attacked our server at 14:32:15 UTC." But 203.0.113.50 is your NAT gateway, used by 500 subscribers. Who was it?

Without NAT logging, you don't know. With NAT logging, you can trace 203.0.113.50:32768 at 14:32:15 back to internal IP 10.0.15.42 — subscriber John Smith. Now you can respond to the complaint.

NAT logs are essential for troubleshooting and legal requirements.

Why Log NAT

Abuse Handling

Complaint: "Attack from 203.0.113.50:32768 to victim:80 at 14:32:15"

Without logs: "We use NAT, could be anyone"
With logs: "That was 10.0.15.42 (subscriber #12345)"

Legal Requirements

Many jurisdictions require ISPs to:

Log NAT translations
Retain logs for specified period
Provide data upon legal request

Troubleshooting

User: "My connection keeps dropping"
NAT logs show: Frequent connection resets, port exhaustion

Connection Tracking Logging

Enable Conntrack Logging

configure

# Log new connections
set system conntrack log new

# Log connection updates (optional, verbose)
set system conntrack log update

# Log connection destroy (useful for session duration)
set system conntrack log destroy

commit

Log Format

# Example log entries:
[NEW] tcp 6 120 SYN_SENT src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 [UNREPLIED] src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768
[UPDATE] tcp 6 60 SYN_RECV src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768
[DESTROY] tcp 6 src=10.0.15.42 dst=93.184.216.34 sport=45678 dport=80 src=93.184.216.34 dst=203.0.113.50 sport=80 dport=32768

Key Fields

[NEW] - Connection start
[DESTROY] - Connection end

src=10.0.15.42   - Original source (internal)
dst=93.184.216.34 - Original destination
sport=45678      - Original source port

src=93.184.216.34 - Reply source
dst=203.0.113.50  - Reply destination (your NAT IP)
dport=32768       - NAT'd port (what abuse reports show)

Syslog Configuration

Send to Remote Syslog

configure

# Send logs to remote syslog server
set system syslog host 10.0.0.100 facility kern level debug

# Or specific file locally
set system syslog file nat-log facility kern level debug

commit

Log Rotation

# Configure logrotate for NAT logs
# /etc/logrotate.d/nat-logs

/var/log/messages {
    daily
    rotate 90    # Keep 90 days for compliance
    compress
    delaycompress
    notifempty
    create 640 root adm
}

NAT Session Table

View Current Sessions

# Show all NAT sessions
show nat source translations

# Or via conntrack
sudo conntrack -L

# Filter for specific internal IP
sudo conntrack -L -s 10.0.15.42

# Filter for specific NAT IP
sudo conntrack -L -r 203.0.113.50

Real-Time Monitoring

# Watch new connections
sudo conntrack -E -e NEW

# Watch specific source
sudo conntrack -E -s 10.0.15.42

Log Analysis

Find Session by External Port

# Complaint: 203.0.113.50:32768 at 14:32:15

# Search logs
grep "dport=32768" /var/log/messages | grep "14:32"

# Output:
# Jan 8 14:32:15 router kernel: [NEW] tcp ... src=10.0.15.42 ... dport=32768

Find All Sessions for Internal IP

# Who was 10.0.15.42 talking to?
grep "src=10.0.15.42" /var/log/messages | grep "\[NEW\]"

Session Statistics

# Connections per internal IP
grep "\[NEW\]" /var/log/messages | grep -oP "src=\S+" | sort | uniq -c | sort -rn | head

# Destinations per internal IP
grep "src=10.0.15.42" /var/log/messages | grep "\[NEW\]" | grep -oP "dst=\S+" | sort | uniq -c | sort -rn

Log Storage and Retention

Requirements

GDPR (EU): Not specifically defined, but purpose limitation applies
Various ISP regulations: 6 months to 2 years typical

Calculate storage:
- ~200 bytes per session
- 1000 sessions/second = 200 KB/s = 17 GB/day
- 90 days retention = 1.5 TB

Efficient Storage

# Use structured logging
# Compress old logs
# Consider log aggregation (ELK stack, etc.)

# Example: rsyslog to Elasticsearch
# /etc/rsyslog.d/nat-to-elastic.conf
module(load="omelasticsearch")

if $programname == 'kernel' and $msg contains 'conntrack' then {
    action(type="omelasticsearch"
           server="elasticsearch.local"
           serverport="9200"
           template="nat-log-template")
}

CGNAT Specific Logging

Extended Port Blocks

For CGNAT, logging individual connections is expensive. Alternative: log port block assignments.

# Instead of logging every connection:
# [NEW] 10.0.15.42:12345 -> 93.184.216.34:80 via 203.0.113.50:32768

# Log port block assignment:
# 10.0.15.42 assigned ports 32768-34815 on 203.0.113.50 at 14:00:00
# 10.0.15.42 released ports 32768-34815 on 203.0.113.50 at 16:00:00

# Abuse lookup: 203.0.113.50:32768 at 14:32:15 was in 10.0.15.42's block

VyOS CGNAT Logging

configure

# If using deterministic NAT (fixed mapping)
# Logging is simpler - mapping is predictable

# For dynamic NAT, full connection logging required

commit

Responding to Abuse Complaints

Process

1. Receive complaint with:
   - External IP (your NAT)
   - External port
   - Timestamp (UTC!)
   - Destination IP/port
   - Protocol

2. Convert timestamp to your timezone

3. Search logs:
   grep "dport=<port>" /var/log/nat.log | grep "<timestamp>"

4. Identify internal IP

5. Map internal IP to subscriber (from DHCP logs, etc.)

6. Respond to complaint with internal reference

Response Template

Reference: ABUSE-2025-0108-001
Complaint received: 2025-01-08

External IP: 203.0.113.50
External Port: 32768
Timestamp: 2025-01-08 14:32:15 UTC
Destination: victim.example.com:80

Investigation:
NAT logs show this connection originated from internal IP 10.0.15.42.
This corresponds to subscriber account #12345.

Action taken:
[Your action - warning, suspension, etc.]

Performance Considerations

Log Volume

# Logging every connection has cost:
# - CPU for logging
# - Disk I/O
# - Storage space

# For high-throughput NAT:
# Consider sampling
# Use dedicated log server
# Aggregate logs

Log Only What's Needed

# Log only NEW events (not UPDATE/DESTROY)
set system conntrack log new

# DELETE logs useful for session duration, but doubles volume

Security of Logs

Access Control

# NAT logs contain sensitive data
# - Who visited what
# - Privacy implications

# Restrict access
chmod 600 /var/log/nat.log
# Only authorized personnel

# Encrypt at rest
# Consider log integrity (signing)

Retention Policy

# Define retention period based on:
# - Legal requirements
# - Business needs
# - Privacy obligations

# Automatic purge after retention period
find /var/log/nat-archive -mtime +90 -delete

Best Practices

1. Log to Remote Server

# Don't rely on local storage
# Remote syslog or log aggregation
set system syslog host 10.0.0.100 facility kern level debug

2. Timestamp Accuracy

# Use NTP for accurate timestamps
set system ntp server pool.ntp.org

# Complaints reference specific times
# Accuracy matters for correlation

3. Test Log Search

# Before you need it:
# - Generate test traffic
# - Verify logs captured
# - Practice searching
# - Measure search time

4. Document Process

# NAT Log Lookup Procedure
1. Receive complaint
2. Verify timestamp timezone
3. Search command: grep "dport=X" /path/to/logs
4. Identify internal IP
5. Cross-reference with DHCP/subscriber database
6. Document findings

The Lesson

NAT logs are essential for troubleshooting and legal requirements.

Without NAT logs:

Can't respond to abuse complaints
Can't troubleshoot user issues
May violate legal requirements
"It could be anyone"

With NAT logs:

Trace any connection to internal source
Respond to abuse with evidence
Meet compliance requirements
Debug NAT issues

The overhead of logging is real but necessary. Plan storage, plan retention, and practice lookup before you need it during an incident.

Your first abuse complaint is too late to set up logging.

FlowSpec: Programmable Filters via BGP

berik@ashimov.com (Berik Ashimov) — Fri, 27 Feb 2026 00:00:00 GMT

Blocking an attacking IP requires logging into every border router, adding firewall rules, and hoping you don't make a typo. For 10 routers, that's 10 changes. During an attack, when speed matters most.

BGP FlowSpec distributes filter rules via BGP. Define a rule once, BGP propagates it to all routers. Routers automatically install the filters. Network-wide blocking from a single point.

FlowSpec enables network-wide filtering from a single control point.

What FlowSpec Does

Traditional:
[Admin] → SSH → [Router 1] → add rule
        → SSH → [Router 2] → add rule
        → SSH → [Router 3] → add rule
        ... (manual, slow, error-prone)

FlowSpec:
[Admin] → [FlowSpec Controller] → BGP FlowSpec → [All Routers]
                                                  (automatic installation)

FlowSpec Components

NLRI (Network Layer Reachability Information)

FlowSpec rules are BGP NLRIs that describe traffic:

Component	Description	Example
Destination	Destination prefix	203.0.113.0/24
Source	Source prefix	198.51.100.0/24
Protocol	IP protocol	TCP, UDP, ICMP
Port	L4 port	80, 443, 53
Fragment	Fragmentation flags	don't-fragment
Packet length	Packet size	100-1500
DSCP	DiffServ code point	46

Actions

Action	Description
Traffic-rate	Rate limit (0 = drop)
Traffic-action	Sample, terminal
Redirect	Send to specific VRF
Traffic-marking	Set DSCP

FlowSpec on VyOS

VyOS FlowSpec support depends on version and FRRouting capabilities.

Enable FlowSpec Address Family

configure

# Enable FlowSpec for IPv4
set protocols bgp address-family ipv4-flowspec

# Configure neighbor for FlowSpec
set protocols bgp neighbor 10.255.0.1 address-family ipv4-flowspec

commit

Receive FlowSpec Rules

configure

# Accept FlowSpec from upstream
set protocols bgp neighbor 10.255.0.1 address-family ipv4-flowspec

# Interface to apply FlowSpec rules
set protocols bgp address-family ipv4-flowspec local-install interface eth0

commit

FlowSpec Rule Examples

Block Traffic to Destination

# Block all traffic to 203.0.113.100/32

# FlowSpec NLRI components:
# Destination: 203.0.113.100/32
# Action: traffic-rate 0 (drop)

Block Specific Port

# Block UDP port 53 to destination (DNS amplification)

# FlowSpec NLRI:
# Destination: 203.0.113.0/24
# Protocol: UDP (17)
# Destination port: 53
# Action: traffic-rate 0

Rate Limit

# Rate limit ICMP to destination

# FlowSpec NLRI:
# Destination: 203.0.113.100/32
# Protocol: ICMP (1)
# Action: traffic-rate 1000000 (1 Mbps)

Block Source Network

# Block all traffic from attacking network

# FlowSpec NLRI:
# Source: 198.51.100.0/24
# Action: traffic-rate 0

FlowSpec Controller

ExaBGP for FlowSpec

ExaBGP can inject FlowSpec rules:

# exabgp.conf
neighbor 10.255.0.1 {
    router-id 10.255.0.100;
    local-address 10.255.0.100;
    local-as 65001;
    peer-as 65001;

    flow {
        # Block UDP 53 to victim
        route destination 203.0.113.100/32
              protocol udp
              destination-port 53
              rate-limit 0;
    }
}

Inject Rule via API

# Using ExaBGP API
echo "announce flow route destination 203.0.113.100/32 protocol tcp destination-port 80 rate-limit 0" | socat - /var/run/exabgp.sock

Viewing FlowSpec

Show Received Rules

# Show FlowSpec routes
show bgp ipv4 flowspec

# Output:
# Flow     Destination         Protocol  Port    Action
# 1        203.0.113.100/32   UDP       53      rate-limit 0
# 2        203.0.113.0/24     TCP       80-443  rate-limit 1000000

Show Installed Rules

# Show rules installed on interface
show policy pbr flowspec interface eth0

# Or via iptables
sudo iptables -L -v -n | grep -i flowspec

FlowSpec Validation

Important Security Measures

# Only accept FlowSpec from trusted sources
# Validate FlowSpec rules don't affect unintended traffic

# Prefix validation
# FlowSpec destination should match prefixes you announce
# Prevents upstream from filtering traffic you didn't request

Validation Mode

configure

# Enable FlowSpec validation
set protocols bgp address-family ipv4-flowspec validation

# Only accept FlowSpec for your own prefixes

commit

FlowSpec Use Cases

Use Case 1: DDoS Response

# Attack detected to 203.0.113.100
# Inject FlowSpec rule from controller

# Block all traffic (like RTBH but more granular)
# FlowSpec: destination 203.0.113.100/32, rate-limit 0

# Or block specific attack pattern
# FlowSpec: destination 203.0.113.100/32, protocol UDP, port 53, rate-limit 0

Use Case 2: Traffic Scrubbing

# Redirect attack traffic to scrubbing center

# FlowSpec: destination 203.0.113.100/32, redirect VRF scrubbing

# Clean traffic sent back via normal routing

Use Case 3: Rate Limiting

# Limit ICMP to all destinations (prevent ping flood amplification)

# FlowSpec: protocol ICMP, rate-limit 1000000

Use Case 4: Network-Wide Policy

# Block entire protocol network-wide

# FlowSpec: protocol 47 (GRE), rate-limit 0
# All routers now block GRE

FlowSpec vs RTBH

Feature	RTBH	FlowSpec
Granularity	/32 or /24	5-tuple
Actions	Drop only	Drop, rate-limit, redirect
Complexity	Simple	More complex
Support	Wide	Limited

RTBH: Block everything to a destination
FlowSpec: Block specific traffic patterns to a destination

Limitations

Hardware Support

FlowSpec requires:
- BGP implementation supporting FlowSpec
- Data plane capable of implementing rules
- Sufficient TCAM/memory for rules

VyOS (software router):
- FlowSpec implemented via iptables/nftables
- Works but limited by CPU performance

Rule Complexity

More rules = More processing
Complex rules = Harder to manage
Watch for:
- Too many concurrent rules
- Overlapping rules
- Stale rules (remove after attack)

Best Practices

1. Start Simple

# Begin with basic destination blocks
# Add complexity as needed
# Test before production use

2. Automate Rule Management

# Use FlowSpec controller
# Integrate with detection systems
# Automatic rule addition and removal

3. Set Timeouts

# FlowSpec rules should expire
# Don't leave blocking rules forever
# Implement automatic cleanup

4. Document and Alert

# Log all FlowSpec changes
# Alert team when rules added
# Review rules regularly

The Lesson

FlowSpec enables network-wide filtering from a single control point.

Traditional filtering:

Manual changes on each router
Slow during attacks
Error-prone
Inconsistent

FlowSpec:

Define once, propagate everywhere
Fast deployment via BGP
Consistent across network
Can be automated

VyOS FlowSpec support varies by version. For production use:

Verify feature support
Test thoroughly
Have fallback (manual rules, RTBH)

FlowSpec is powerful but complex. Start with simple rules, build automation, expand as you gain experience.

RTBH: Remote Triggered Blackhole Routing

berik@ashimov.com (Berik Ashimov) — Tue, 24 Feb 2026 00:00:00 GMT

A massive DDoS attack is saturating your upstream link. Your entire network is affected because one target is receiving gigabits of attack traffic. You can't filter it — there's too much. You can't absorb it — your link is overwhelmed.

RTBH (Remote Triggered Blackhole) tells your upstream provider: "Drop all traffic to this IP." The attack traffic is discarded at the upstream, before it reaches your network. Your target is offline, but your network survives.

RTBH sacrifices the target to save the network.

How RTBH Works

Normal:
Attack ─→ [ISP] ─→ [Your Router] ─→ [Victim Server]
                   │ Link saturated
                   └─→ [Other Servers] (collateral damage)

With RTBH:
Attack ─→ [ISP] ─✕─ (traffic blackholed)

         [Your Router] ─→ [Victim Server] (unreachable, but attack stopped)
                      └─→ [Other Servers] (working normally)

RTBH Setup

Prerequisites

BGP session with upstream provider
Agreement on blackhole community (e.g., ISP_ASN:666)
Prefix you can announce (/32 or /24 depending on ISP)

Configure Blackhole Route

configure

# Create blackhole next-hop
set protocols static route 192.0.2.1/32 blackhole

# This creates a null route locally
# Packets to 192.0.2.1 are dropped

commit

Configure BGP to Announce Blackhole

configure

# Define blackhole community (check with your ISP)
set policy community-list ISP-BLACKHOLE rule 10 regex "65000:666"

# Route map for blackhole announcements
set policy route-map BLACKHOLE-OUT rule 10 action permit
set policy route-map BLACKHOLE-OUT rule 10 match ip address prefix-list BLACKHOLE-PREFIXES
set policy route-map BLACKHOLE-OUT rule 10 set community "65000:666"
set policy route-map BLACKHOLE-OUT rule 10 set origin igp

# Regular announcements
set policy route-map BLACKHOLE-OUT rule 20 action permit

# Apply to BGP neighbor
set protocols bgp neighbor 10.0.0.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT

commit

Triggering RTBH

Manual Trigger

configure

# Add victim IP to blackhole prefix list
set policy prefix-list BLACKHOLE-PREFIXES rule 10 prefix 203.0.113.100/32
set policy prefix-list BLACKHOLE-PREFIXES rule 10 action permit

# Ensure route exists
set protocols static route 203.0.113.100/32 blackhole

commit

# ISP receives announcement with blackhole community
# ISP drops all traffic to 203.0.113.100

Remove Blackhole

configure

# Remove from prefix list
delete policy prefix-list BLACKHOLE-PREFIXES rule 10

# Remove blackhole route
delete protocols static route 203.0.113.100/32

commit

# ISP removes blackhole, traffic flows again

Trigger Router Architecture

Dedicated Trigger Router

                              ┌─────────────────┐
                              │ Trigger Router  │
                              │ (announces /32) │
                              └────────┬────────┘
                                       │ iBGP
┌────────────────────────────────────────────────────────────┐
│                    Your Network                            │
│  [Border1] ════════════════════════════════ [Border2]      │
│      │                                           │         │
│      └───────────────[ISP A]───────────────────┘           │
└────────────────────────────────────────────────────────────┘

Trigger router announces /32 with blackhole community
Border routers learn and propagate to ISP
ISP drops traffic to the /32

Trigger Router Configuration

configure

# Trigger router (separate from border routers)
set protocols bgp system-as 65001
set protocols bgp router-id 10.255.0.100

# iBGP to border routers
set protocols bgp neighbor 10.255.0.1 remote-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65001

# Blackhole routes announced via iBGP
# Border routers then announce to ISP with community

commit

Destination-Based vs Source-Based RTBH

Destination-Based (Common)

# Drop traffic TO the victim
# Victim is unreachable but network saved

set protocols static route 203.0.113.100/32 blackhole
# Announce 203.0.113.100/32 with blackhole community

Source-Based (If ISP Supports)

# Drop traffic FROM attacker
# Victim remains reachable
# Requires ISP support for S-RTBH

# Much less common
# Check with your specific ISP

Automation

Trigger Script

#!/bin/bash
# /config/scripts/trigger-rtbh.sh

ACTION=$1
TARGET=$2

VYOS_API="https://localhost/api"
API_KEY="your-api-key"

case $ACTION in
    add)
        # Add blackhole route
        vtysh -c "configure terminal" \
              -c "ip route $TARGET/32 blackhole"

        # Add to prefix list
        # (requires API or direct config manipulation)
        echo "Blackhole triggered for $TARGET"
        ;;
    remove)
        vtysh -c "configure terminal" \
              -c "no ip route $TARGET/32 blackhole"
        echo "Blackhole removed for $TARGET"
        ;;
esac

Integration with Monitoring

#!/bin/bash
# /config/scripts/auto-rtbh.sh

# Monitor traffic to critical IPs
# If threshold exceeded, trigger RTBH

THRESHOLD_PPS=1000000  # 1M pps

for ip in $(cat /config/protected-ips.txt); do
    PPS=$(get_pps_to_ip $ip)  # Your monitoring tool

    if [ $PPS -gt $THRESHOLD_PPS ]; then
        /config/scripts/trigger-rtbh.sh add $ip
        alert_team "RTBH triggered for $ip (${PPS} pps)"
    fi
done

ISP Community Reference

Common Blackhole Communities

# Format: ISP_ASN:666 (common convention)

# Check with your specific ISP
# Examples (verify before use):
# Level3:    3356:666 or 3356:9999
# NTT:       2914:666
# Cogent:    174:666
# Your ISP:  Check their BGP community documentation

Multiple Upstreams

configure

# Different community per upstream
set policy route-map BLACKHOLE-OUT-ISP1 rule 10 set community "65001:666"
set policy route-map BLACKHOLE-OUT-ISP2 rule 10 set community "65002:666"

# Apply to respective neighbors
set protocols bgp neighbor 10.0.0.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT-ISP1
set protocols bgp neighbor 10.0.1.1 address-family ipv4-unicast route-map export BLACKHOLE-OUT-ISP2

commit

Verification

Check Local Blackhole

# Verify blackhole route installed
show ip route 203.0.113.100

# Should show:
# B>* 203.0.113.100/32 [20/0] unreachable (blackhole), 00:05:00

Check BGP Announcement

# Verify route is being announced
show bgp ipv4 unicast 203.0.113.100/32

# Check communities
show bgp ipv4 unicast 203.0.113.100/32 community

# Should show blackhole community attached

Check with ISP

# Look at ISP's looking glass
# Verify they received the announcement
# Verify they're applying blackhole

# Most ISPs have looking glass tools
# Check route presence and community

Risks and Considerations

Victim Becomes Unreachable

RTBH drops ALL traffic to victim:
- Attack traffic: dropped ✓
- Legitimate traffic: dropped ✓

Victim is completely offline during RTBH
Only use when alternative is worse (entire network down)

Prefix Length Requirements

# Many ISPs only accept /24 or shorter
# /32 announcements may be filtered

# Options:
# 1. ISP accepts /32 with blackhole community (best)
# 2. Announce covering /24 (affects more IPs)
# 3. Use ISP-specific RTBH mechanism

BGP Propagation Time

Trigger RTBH → BGP updates propagate → ISP applies blackhole

Typical: 30 seconds to few minutes
During this time, attack still reaches you

Best Practices

1. Document Procedure

# RTBH Trigger Procedure

## When to Use
- Attack saturating upstream link
- Collateral damage to other services
- Manual filtering impossible

## Steps
1. Confirm attack target IP
2. Notify team/management
3. Execute trigger script
4. Verify with ISP
5. Monitor network recovery
6. Remove blackhole when attack stops

## Contacts
- ISP NOC: +1-xxx-xxx-xxxx
- Internal: @security-team

2. Test Before You Need It

# Test with non-critical IP
# Verify ISP accepts and applies blackhole
# Measure propagation time
# Test removal procedure

3. Have Rollback Ready

# Keep removal procedure ready
# Time-limit blackholes (auto-remove)
# Monitor for attack cessation

4. Combine with Other Measures

RTBH: Nuclear option, saves network
Also use:
- Rate limiting (smaller attacks)
- Upstream scrubbing (sophisticated attacks)
- CDN/DDoS protection (application layer)

The Lesson

RTBH sacrifices the target to save the network.

When to use RTBH:

Attack larger than your link can handle
Collateral damage affecting entire network
No other option available

When NOT to use RTBH:

Attack is manageable with rate limiting
Upstream scrubbing is available
Target availability is critical

RTBH is the last resort. It works by making the victim unreachable to everyone — attackers and legitimate users alike. Use it when the alternative (entire network down) is worse.

Have it configured and tested before you need it. During an attack is not the time to learn RTBH.

DDoS Mitigation at the Edge: Rate Limiting and Traffic Scrubbing

berik@ashimov.com (Berik Ashimov) — Fri, 20 Feb 2026 00:00:00 GMT

Your upstream link is 1 Gbps. The attack is 10 Gbps. Your edge router can't help — the link is already saturated before packets reach you.

Edge DDoS mitigation isn't about stopping massive volumetric attacks. It's about protecting against smaller attacks, reducing collateral damage, and buying time until upstream mitigation kicks in.

Edge mitigation buys time. It's not a complete solution, but it's better than nothing.

What Edge Routers Can Do

Effective Against

Application-layer attacks (HTTP, DNS)
SYN floods (up to link capacity)
Slowloris-style attacks
Amplification from your network
Small-scale volumetric attacks

Not Effective Against

Attacks larger than your upstream link
Sophisticated distributed attacks
Attacks that saturate your ISP's network

Rate Limiting

Basic Rate Limiting with Firewall

configure

# Rate limit incoming connections per source
set firewall ipv4 name WAN-IN rule 50 action drop
set firewall ipv4 name WAN-IN rule 50 recent count 100
set firewall ipv4 name WAN-IN rule 50 recent time minute
set firewall ipv4 name WAN-IN rule 50 state new
set firewall ipv4 name WAN-IN rule 50 description "Rate limit: 100 new conn/min/source"

commit

Rate Limit Specific Services

configure

# Rate limit SSH connections
set firewall ipv4 name WAN-LOCAL rule 100 action drop
set firewall ipv4 name WAN-LOCAL rule 100 destination port 22
set firewall ipv4 name WAN-LOCAL rule 100 protocol tcp
set firewall ipv4 name WAN-LOCAL rule 100 recent count 5
set firewall ipv4 name WAN-LOCAL rule 100 recent time minute
set firewall ipv4 name WAN-LOCAL rule 100 state new
set firewall ipv4 name WAN-LOCAL rule 100 description "SSH: Max 5 new conn/min/source"

# Allow SSH that passes rate limit
set firewall ipv4 name WAN-LOCAL rule 101 action accept
set firewall ipv4 name WAN-LOCAL rule 101 destination port 22
set firewall ipv4 name WAN-LOCAL rule 101 protocol tcp
set firewall ipv4 name WAN-LOCAL rule 101 state new

commit

Rate Limit DNS Queries

configure

# Protect DNS server from amplification abuse
set firewall ipv4 name WAN-IN rule 200 action drop
set firewall ipv4 name WAN-IN rule 200 destination port 53
set firewall ipv4 name WAN-IN rule 200 protocol udp
set firewall ipv4 name WAN-IN rule 200 recent count 50
set firewall ipv4 name WAN-IN rule 200 recent time second
set firewall ipv4 name WAN-IN rule 200 description "DNS: Max 50 queries/sec/source"

commit

Connection Limits

Limit Concurrent Connections

configure

# Limit connections per source IP
set firewall ipv4 name WAN-IN rule 60 action drop
set firewall ipv4 name WAN-IN rule 60 conntrack connection-limit source-mask 32
set firewall ipv4 name WAN-IN rule 60 conntrack connection-limit count 100
set firewall ipv4 name WAN-IN rule 60 state new
set firewall ipv4 name WAN-IN rule 60 description "Max 100 concurrent connections/IP"

commit

Conntrack Table Protection

configure

# Increase conntrack table size
set system conntrack table-size 524288

# Aggressive timeouts during attack
set system conntrack timeout tcp time-wait 30
set system conntrack timeout tcp close 10
set system conntrack timeout udp other 30

commit

SYN Flood Protection

SYN Cookies

# Enable SYN cookies (usually enabled by default)
configure
set system sysctl parameter net.ipv4.tcp_syncookies value 1
commit

# SYN cookies allow handling SYN floods without conntrack exhaustion

SYN Rate Limiting

configure

# Limit SYN packets per source
set firewall ipv4 name WAN-IN rule 70 action drop
set firewall ipv4 name WAN-IN rule 70 protocol tcp
set firewall ipv4 name WAN-IN rule 70 tcp flags syn
set firewall ipv4 name WAN-IN rule 70 recent count 20
set firewall ipv4 name WAN-IN rule 70 recent time second
set firewall ipv4 name WAN-IN rule 70 description "SYN flood protection"

commit

Invalid Packet Dropping

Drop Malformed Packets

configure

# Drop invalid state packets
set firewall ipv4 name WAN-IN rule 1 action drop
set firewall ipv4 name WAN-IN rule 1 state invalid
set firewall ipv4 name WAN-IN rule 1 description "Drop invalid packets"

# Drop fragments (often used in attacks)
set firewall ipv4 name WAN-IN rule 2 action drop
set firewall ipv4 name WAN-IN rule 2 fragment match-frag
set firewall ipv4 name WAN-IN rule 2 description "Drop fragments"

commit

TCP Flag Validation

configure

# Drop XMAS scan
set firewall ipv4 name WAN-IN rule 3 action drop
set firewall ipv4 name WAN-IN rule 3 protocol tcp
set firewall ipv4 name WAN-IN rule 3 tcp flags fin,psh,urg
set firewall ipv4 name WAN-IN rule 3 description "Drop XMAS packets"

# Drop NULL scan
set firewall ipv4 name WAN-IN rule 4 action drop
set firewall ipv4 name WAN-IN rule 4 protocol tcp
set firewall ipv4 name WAN-IN rule 4 tcp flags !syn,!ack,!fin,!rst,!urg,!psh
set firewall ipv4 name WAN-IN rule 4 description "Drop NULL packets"

commit

Source Address Validation

Block Bogons

configure

# Block RFC 1918 from WAN
set firewall group network-group BOGONS network 10.0.0.0/8
set firewall group network-group BOGONS network 172.16.0.0/12
set firewall group network-group BOGONS network 192.168.0.0/16
set firewall group network-group BOGONS network 127.0.0.0/8
set firewall group network-group BOGONS network 0.0.0.0/8

set firewall ipv4 name WAN-IN rule 5 action drop
set firewall ipv4 name WAN-IN rule 5 source group network-group BOGONS
set firewall ipv4 name WAN-IN rule 5 description "Block bogon sources"

commit

uRPF (Unicast Reverse Path Forwarding)

configure

# Enable strict uRPF on WAN interface
set firewall interface eth0 in ipv4 urpf strict

# Loose mode (accepts if any route exists)
set firewall interface eth0 in ipv4 urpf loose

commit

Traffic Shaping (QoS)

Prioritize Legitimate Traffic

configure

# Traffic policy
set traffic-policy shaper WAN-OUT bandwidth 1gbit
set traffic-policy shaper WAN-OUT default bandwidth 50%
set traffic-policy shaper WAN-OUT default ceiling 100%
set traffic-policy shaper WAN-OUT default queue-type fair-queue

# High priority class
set traffic-policy shaper WAN-OUT class 10 bandwidth 30%
set traffic-policy shaper WAN-OUT class 10 ceiling 100%
set traffic-policy shaper WAN-OUT class 10 match SSH ip destination port 22
set traffic-policy shaper WAN-OUT class 10 match ICMP ip protocol icmp

# Apply to interface
set interfaces ethernet eth0 traffic-policy out WAN-OUT

commit

Emergency Response

Quick Blocks During Attack

# Block specific attacking IP immediately
configure
set firewall ipv4 name WAN-IN rule 10 action drop
set firewall ipv4 name WAN-IN rule 10 source address 203.0.113.100
commit

# Block attacking network
set firewall ipv4 name WAN-IN rule 11 action drop
set firewall ipv4 name WAN-IN rule 11 source address 203.0.113.0/24
commit

Block by Country (GeoIP)

# VyOS doesn't have native GeoIP
# Use external IP lists

# Download country IP ranges
# Add to firewall group
set firewall group network-group BLOCKED-COUNTRY network x.x.x.x/xx
# ... many entries

set firewall ipv4 name WAN-IN rule 20 action drop
set firewall ipv4 name WAN-IN rule 20 source group network-group BLOCKED-COUNTRY

Monitoring During Attack

Watch Connection States

# Monitor conntrack
watch -n 1 'cat /proc/sys/net/netfilter/nf_conntrack_count'

# Show connections by source
sudo conntrack -L | awk '{print $5}' | cut -d= -f2 | sort | uniq -c | sort -rn | head -20

# Show firewall counters
watch -n 1 'show firewall'

Traffic Analysis

# Monitor interface traffic
watch -n 1 'show interfaces ethernet eth0'

# Capture attack traffic
sudo tcpdump -i eth0 -c 1000 -w /tmp/attack.pcap

# Quick packet rate estimate
timeout 10 tcpdump -i eth0 -c 10000 2>&1 | tail -1

What To Do When Overwhelmed

1. Contact Upstream Provider

# Your ISP can:
# - Apply upstream ACLs
# - Activate DDoS scrubbing
# - Null route attacking traffic

# Have their NOC number ready!

2. Enable Upstream Blackhole

# Advertise your prefix with blackhole community
# Traffic dropped at ISP, saves your link

# See RTBH article for details

3. Use DDoS Protection Service

Services like Cloudflare, Akamai, AWS Shield:
- Route traffic through their scrubbing centers
- They absorb attack, send clean traffic to you
- Works for attacks much larger than your capacity

Best Practices

1. Prepare Before Attack

# Have emergency playbook ready
# Know your upstream NOC contact
# Pre-configure blocking rules (disabled)
# Monitor baseline traffic patterns

2. Layer Your Defense

Layer 1: Upstream ISP (volumetric)
Layer 2: Edge router (smaller attacks)
Layer 3: Application firewall (app-layer)
Layer 4: Application hardening

3. Automate Response

# Script to block high-traffic sources
#!/bin/bash
# /config/scripts/auto-block.sh

THRESHOLD=1000  # connections
for ip in $(sudo conntrack -L | awk '{print $5}' | cut -d= -f2 | sort | uniq -c | awk -v t=$THRESHOLD '$1 > t {print $2}'); do
    echo "Blocking $ip"
    # Add firewall rule
done

The Lesson

Edge mitigation buys time. It's not a complete solution.

What edge routers can do:

Rate limit connections
Drop invalid traffic
Block known attackers
Protect specific services

What edge routers can't do:

Stop attacks larger than your pipe
Replace upstream scrubbing
Handle sophisticated multi-vector attacks

Build defense in depth:

Upstream DDoS protection for volumetric
Edge rate limiting for application-layer
Application hardening for everything else

The edge router is one layer. Make it effective, but don't rely on it alone.

Dynamic Routing Over Tunnels: BGP and OSPF Through Encrypted Links

berik@ashimov.com (Berik Ashimov) — Tue, 17 Feb 2026 00:00:00 GMT

Static routes over VPN tunnels work until you have multiple tunnels, need failover, or manage complex topologies. Then you want routing protocols to handle the complexity.

Running OSPF or BGP over tunnels adds resilience. If a tunnel goes down, the routing protocol detects it and converges to alternate paths. But tunnels add latency, may not support multicast, and need careful interface selection.

Routing over tunnels requires careful planning — but it's worth it for resilient networks.

Tunnel Types and Routing Support

Tunnel Type	OSPF	BGP	Notes
GRE	Yes (multicast)	Yes	Full support
IPsec VTI	Yes (unicast)	Yes	No multicast
WireGuard	Yes (unicast)	Yes	No multicast
OpenVPN	Yes (unicast)	Yes	No multicast

Multicast vs Unicast OSPF

# GRE supports multicast (normal OSPF)
set protocols ospf interface tun0 area 0

# IPsec VTI/WireGuard need unicast neighbors
set protocols ospf interface wg0 area 0
set protocols ospf neighbor 10.255.0.2  # Explicit neighbor

OSPF Over GRE

GRE Tunnel Setup

configure

# GRE tunnel
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30
set interfaces tunnel tun0 mtu 1476

commit

OSPF Configuration

configure

# OSPF over GRE (multicast works)
set protocols ospf interface tun0 area 0
set protocols ospf interface tun0 network point-to-point
set protocols ospf interface tun0 hello-interval 10
set protocols ospf interface tun0 dead-interval 40

# Advertise tunnel network
set protocols ospf area 0 network 10.255.0.0/30

# Advertise local networks
set protocols ospf area 0 network 192.168.1.0/24

commit

Verify OSPF

# Check neighbors
show ip ospf neighbor

# Should show neighbor via tun0
# Neighbor ID     Pri   State    Dead Time   Address      Interface
# 10.255.0.2      1     Full/-   00:00:32    10.255.0.2   tun0

# Check routes
show ip route ospf

OSPF Over IPsec (VTI)

IPsec VTI Setup

configure

# IPsec VTI (route-based VPN)
set vpn ipsec interface eth0
set vpn ipsec esp-group ESP proposal 1 encryption aes256gcm128
set vpn ipsec esp-group ESP proposal 1 hash sha256
set vpn ipsec ike-group IKE proposal 1 encryption aes256
set vpn ipsec ike-group IKE proposal 1 hash sha256
set vpn ipsec ike-group IKE proposal 1 dh-group 14

set vpn ipsec site-to-site peer 198.51.100.1 authentication mode pre-shared-secret
set vpn ipsec site-to-site peer 198.51.100.1 authentication pre-shared-secret "secret"
set vpn ipsec site-to-site peer 198.51.100.1 connection-type initiate
set vpn ipsec site-to-site peer 198.51.100.1 ike-group IKE
set vpn ipsec site-to-site peer 198.51.100.1 local-address 203.0.113.1
set vpn ipsec site-to-site peer 198.51.100.1 vti bind vti0
set vpn ipsec site-to-site peer 198.51.100.1 vti esp-group ESP

# VTI interface
set interfaces vti vti0 address 10.255.0.1/30
set interfaces vti vti0 mtu 1400

commit

OSPF Over VTI (No Multicast)

configure

# OSPF needs explicit neighbor (no multicast over IPsec)
set protocols ospf interface vti0 area 0
set protocols ospf interface vti0 network point-to-point
set protocols ospf neighbor 10.255.0.2

commit

OSPF Over WireGuard

WireGuard Setup

configure

# WireGuard interface
set interfaces wireguard wg0 address 10.255.0.1/30
set interfaces wireguard wg0 port 51820
set interfaces wireguard wg0 private-key <your-private-key>

# Peer configuration
set interfaces wireguard wg0 peer PEER1 public-key <peer-public-key>
set interfaces wireguard wg0 peer PEER1 allowed-ips 0.0.0.0/0
set interfaces wireguard wg0 peer PEER1 endpoint 198.51.100.1:51820
set interfaces wireguard wg0 peer PEER1 persistent-keepalive 25

commit

OSPF Over WireGuard

configure

# OSPF with explicit neighbor
set protocols ospf interface wg0 area 0
set protocols ospf interface wg0 network point-to-point
set protocols ospf neighbor 10.255.0.2

# BFD for faster failover
set protocols ospf interface wg0 bfd

commit

BGP Over Tunnels

BGP is easier — it uses TCP unicast, works over any tunnel.

BGP Over WireGuard

configure

# WireGuard tunnel (as above)
# ...

# BGP over WireGuard
set protocols bgp system-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.2 update-source wg0
set protocols bgp neighbor 10.255.0.2 address-family ipv4-unicast

commit

BGP Over IPsec VTI

configure

# IPsec VTI (as above)
# ...

# BGP over VTI
set protocols bgp system-as 65001
set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.2 update-source vti0
set protocols bgp neighbor 10.255.0.2 address-family ipv4-unicast

# BFD for fast failover
set protocols bgp neighbor 10.255.0.2 bfd

commit

Multi-Tunnel Design

Hub and Spoke with OSPF

          [Spoke1]
              │ wg1
        ┌─────┴─────┐
        │           │
       [Hub]        │
        │           │
        └─────┬─────┘
              │ wg2
          [Spoke2]

# Hub configuration
configure

# WireGuard to Spoke1
set interfaces wireguard wg1 address 10.255.1.1/30
set interfaces wireguard wg1 peer SPOKE1 ...

# WireGuard to Spoke2
set interfaces wireguard wg2 address 10.255.2.1/30
set interfaces wireguard wg2 peer SPOKE2 ...

# OSPF on both tunnels
set protocols ospf interface wg1 area 0
set protocols ospf interface wg1 network point-to-point
set protocols ospf neighbor 10.255.1.2

set protocols ospf interface wg2 area 0
set protocols ospf interface wg2 network point-to-point
set protocols ospf neighbor 10.255.2.2

commit

Full Mesh with BGP

# Each site peers with all others via BGP
# More configuration but better path selection

set protocols bgp neighbor 10.255.0.2 remote-as 65002
set protocols bgp neighbor 10.255.0.3 remote-as 65003
set protocols bgp neighbor 10.255.0.4 remote-as 65004

Fast Failover

BFD Over Tunnels

configure

# BFD for fast tunnel failure detection
set protocols bfd peer 10.255.0.2 source address 10.255.0.1
set protocols bfd peer 10.255.0.2 interval transmit 300
set protocols bfd peer 10.255.0.2 interval receive 300
set protocols bfd peer 10.255.0.2 interval multiplier 3

# Link BFD to OSPF
set protocols ospf interface wg0 bfd

# Or link to BGP
set protocols bgp neighbor 10.255.0.2 bfd

commit

Tunnel Keepalives

# WireGuard persistent keepalive
set interfaces wireguard wg0 peer PEER1 persistent-keepalive 25

# GRE keepalives
set interfaces tunnel tun0 parameters ip keepalive interval 10
set interfaces tunnel tun0 parameters ip keepalive failure-count 3

Troubleshooting

Routing Protocol Not Forming

# Check tunnel is up
show interfaces wireguard
ping 10.255.0.2

# Check routing protocol
show ip ospf neighbor
show bgp summary

# Check firewall allows protocol traffic
# OSPF: Protocol 89
# BGP: TCP 179
# BFD: UDP 3784/3785

Routes Not Propagating

# Check route advertisement
show ip ospf database
show bgp ipv4 unicast

# Verify network statements
show configuration commands | grep "protocols ospf area"
show configuration commands | grep "protocols bgp"

Asymmetric Routing

# Traffic goes out tunnel, returns via different path

# Ensure consistent costs
set protocols ospf interface wg0 cost 100
set protocols ospf interface wg1 cost 100

# Or use BGP with consistent metrics

Best Practices

1. Use Point-to-Point Network Type

# For tunnel interfaces
set protocols ospf interface wg0 network point-to-point

# Saves DR election overhead

2. Enable BFD

# Fast failure detection
set protocols ospf interface wg0 bfd
# Or
set protocols bgp neighbor 10.255.0.2 bfd

3. Set Appropriate Costs

# Higher cost for slower/less reliable tunnels
set protocols ospf interface wg0 cost 100  # Fast tunnel
set protocols ospf interface wg1 cost 200  # Backup tunnel

4. Consider MTU

# Ensure routing protocol packets fit
set interfaces wireguard wg0 mtu 1420

# Or enable fragmentation on underlay

The Lesson

Routing over tunnels requires careful planning — but it's worth it for resilient networks.

Benefits:

Automatic failover on tunnel failure
Dynamic path selection
Consistent with non-tunnel routing

Challenges:

Tunnel type affects protocol support
MTU requires attention
Convergence time adds to tunnel detection time

Key decisions:

Protocol: OSPF (simple) or BGP (more control)
Failover: BFD required for fast detection
Topology: Hub-spoke vs full mesh
Tunnel type: Affects multicast support

Done right, routing over tunnels gives you a resilient VPN mesh that handles failures automatically. Done wrong, you get mysterious routing issues and slow failover.

Plan it. Test it. Monitor it.

VXLAN: Scalable L2 Over L3 Overlay

berik@ashimov.com (Berik Ashimov) — Fri, 13 Feb 2026 00:00:00 GMT

VLANs scale to 4094. That's not enough for large datacenters with thousands of tenants. VLAN tags are local to Layer 2 domains. Extending VLANs across L3 boundaries requires complex tricks.

VXLAN (Virtual Extensible LAN) encapsulates Ethernet frames in UDP. 24-bit VNI supports 16 million segments. Runs over any IP network. Decouples overlay from underlay.

VXLAN enables scalable Layer 2 networks over any IP infrastructure.

VXLAN Concepts

How VXLAN Works

[Host A]──[VTEP1]═══ IP Network ═══[VTEP2]──[Host B]
           │                          │
    Encapsulate in UDP         Decapsulate
    (add VXLAN header)         (remove header)

VXLAN Header

Outer Ethernet │ Outer IP │ Outer UDP │ VXLAN │ Inner Ethernet │ Inner IP │ Payload
               │          │  dst 4789 │ VNI   │                │          │

Key Terms

Term	Description
VNI	VXLAN Network Identifier (24-bit, up to 16M)
VTEP	VXLAN Tunnel Endpoint
NVE	Network Virtualization Edge
BUM	Broadcast, Unknown unicast, Multicast

Basic VXLAN Configuration

Static VXLAN (Point-to-Point)

configure

# Create VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 remote 10.0.0.2
set interfaces vxlan vxlan100 port 4789

# Bridge VXLAN with local interface
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit

Remote Side

configure

# Mirror configuration, swap source/remote
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.2
set interfaces vxlan vxlan100 remote 10.0.0.1
set interfaces vxlan vxlan100 port 4789

set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit

Multicast VXLAN

For multi-point VXLAN using multicast for BUM traffic:

configure

# VXLAN with multicast group
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 group 239.1.1.100
set interfaces vxlan vxlan100 port 4789

# Bridge configuration
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit

Multicast Requirements

# Underlay must support multicast routing
# Enable PIM on underlay interfaces
set protocols pim interface eth0

# Or use static IGMP membership

Head-End Replication (Unicast Mode)

No multicast required — VTEP replicates BUM to all remote VTEPs:

configure

# VXLAN with multiple remote VTEPs
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 remote 10.0.0.2
set interfaces vxlan vxlan100 remote 10.0.0.3
set interfaces vxlan vxlan100 remote 10.0.0.4
set interfaces vxlan vxlan100 port 4789

# BUM traffic is replicated to all remotes

commit

Scaling Consideration

Multicast: Efficient BUM delivery, requires multicast underlay
Unicast:   Simple, but BUM traffic multiplied by VTEP count

Small scale (few VTEPs): Unicast fine
Large scale: Multicast or EVPN control plane

VXLAN with EVPN

Best practice for production: EVPN control plane handles:

MAC learning (no data plane flooding)
Remote VTEP discovery (no manual configuration)
BUM optimization

configure

# VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.0.0.1
set interfaces vxlan vxlan100 parameters nolearning

# nolearning: Disable data plane MAC learning (EVPN handles it)

# BGP EVPN configuration
set protocols bgp address-family l2vpn-evpn advertise-all-vni

commit

MTU Considerations

VXLAN Overhead

Outer Ethernet:  14 bytes
Outer IP:        20 bytes
Outer UDP:        8 bytes
VXLAN header:     8 bytes
Total:           50 bytes

Standard 1500 MTU - 50 = 1450 inner MTU

Configure MTU

configure

# Option 1: Increase underlay MTU
set interfaces ethernet eth0 mtu 1550

# Option 2: Reduce overlay MTU
set interfaces bridge br100 mtu 1450

commit

Jumbo Frames

# Better option: Use jumbo frames on underlay
set interfaces ethernet eth0 mtu 9000

# VXLAN inner MTU: 9000 - 50 = 8950
# Standard 1500 MTU traffic fits easily

VXLAN Gateway

L2 Gateway (Bridging Only)

# VXLAN bridges to local VLAN
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1.100

# Local VLAN 100 traffic bridged to VXLAN 100

L3 Gateway (Routing)

configure

# Add IP to bridge for routing
set interfaces bridge br100 address 192.168.100.1/24

# VMs/hosts in VXLAN use this as gateway
# Router handles inter-VXLAN routing

commit

VXLAN Routing Between VNIs

configure

# Two VXLANs
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan200 vni 200

# Two bridges with IPs
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 address 192.168.100.1/24

set interfaces bridge br200 member interface vxlan200
set interfaces bridge br200 address 192.168.200.1/24

# Router routes between 192.168.100.0/24 and 192.168.200.0/24

commit

Viewing VXLAN State

Check Interface

# Show VXLAN interface
show interfaces vxlan

# Show VXLAN details
show interfaces vxlan vxlan100

# Show bridge MAC table
show bridge fdb interface br100

Check Forwarding Database

# View learned MACs
bridge fdb show dev vxlan100

# Output:
# aa:bb:cc:dd:ee:ff dev vxlan100 dst 10.0.0.2 self permanent
# 11:22:33:44:55:66 dev vxlan100 master br100

Troubleshooting VXLAN

Tunnel Not Working

# Check VXLAN interface is up
show interfaces vxlan vxlan100

# Verify underlay connectivity
ping 10.0.0.2  # Remote VTEP

# Check UDP port 4789 is not filtered
nc -vzu 10.0.0.2 4789

# Capture VXLAN traffic
sudo tcpdump -i eth0 udp port 4789

No MAC Learning

# Check bridge FDB
bridge fdb show dev vxlan100

# If empty, check:
# - ARP traffic flowing
# - VXLAN interface in bridge
# - nolearning not set (unless using EVPN)

# Generate traffic to trigger learning
arping -I br100 192.168.100.100

MTU Issues

# Symptoms: Small packets work, large fail

# Test with large ping
ping -s 1400 192.168.100.100

# If fails, check MTU
ip link show vxlan100
ip link show br100

# Verify underlay MTU is sufficient
ping -s 1500 -M do 10.0.0.2

Security Considerations

VXLAN Has No Encryption

Traffic visible to underlay network:
- Inner Ethernet frames
- All payload content

For sensitive data:
- Encrypt at application layer
- Use IPsec on underlay
- Consider alternative (WireGuard overlay)

Firewall VXLAN Traffic

# Allow VXLAN only from known VTEPs
set firewall ipv4 name UNDERLAY-IN rule 100 action accept
set firewall ipv4 name UNDERLAY-IN rule 100 protocol udp
set firewall ipv4 name UNDERLAY-IN rule 100 destination port 4789
set firewall ipv4 name UNDERLAY-IN rule 100 source group network-group VTEPS

VXLAN Design Patterns

Datacenter Fabric

         [Spine1]     [Spine2]
            │            │
    ────────┼────────────┼────────
    │       │       │    │       │
 [Leaf1] [Leaf2] [Leaf3] ...

Underlay: IP routing (OSPF/BGP) on spine-leaf
Overlay:  VXLAN between leaves
Control:  EVPN for MAC learning

DCI (Datacenter Interconnect)

[DC1]══════VXLAN══════[DC2]
       over WAN

Extended L2 between datacenters
Watch out for: Latency, BUM flooding, split-brain

The Lesson

VXLAN enables scalable Layer 2 networks over any IP infrastructure.

VXLAN advantages:

16 million segments (vs 4K VLANs)
Works over any IP network
Decouples overlay from underlay
Foundation for modern DC fabrics

VXLAN considerations:

50-byte overhead (needs MTU planning)
BUM handling (multicast, unicast, or EVPN)
No encryption (plan accordingly)
Control plane important at scale

For small deployments: Static VXLAN with head-end replication works. For scale: EVPN control plane is the answer.

VXLAN is infrastructure. The real magic is in the control plane (EVPN) and how you design the overlay-underlay interaction.

GRE, IPIP, and SIT Tunnels: Simple Point-to-Point Encapsulation

berik@ashimov.com (Berik Ashimov) — Tue, 10 Feb 2026 00:00:00 GMT

VPNs like IPsec and WireGuard provide encryption. But sometimes you don't need encryption — just encapsulation. Connect two private networks over public internet without the complexity of key management.

GRE, IPIP, and SIT are simple tunneling protocols. They wrap packets inside other packets. No encryption, minimal overhead, easy to set up. Use them when encapsulation is enough and encryption is handled elsewhere (or not needed).

Simple tunnels solve simple problems.

Tunnel Types

Type	Full Name	Encapsulates	Overhead
GRE	Generic Routing Encapsulation	Any protocol	24 bytes
IPIP	IP-in-IP	IPv4 only	20 bytes
SIT	Simple Internet Transition	IPv6 in IPv4	20 bytes

When to Use Each

GRE:   Most flexible, multicast support, routing protocols
IPIP:  Minimal overhead, IPv4 only
SIT:   IPv6 tunneling over IPv4

GRE Tunnel Configuration

Basic GRE Tunnel

configure

# Create GRE tunnel
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit

Both Ends Must Match

# Site A (203.0.113.1)
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

# Site B (198.51.100.1)
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 198.51.100.1
set interfaces tunnel tun0 remote 203.0.113.1
set interfaces tunnel tun0 address 10.255.0.2/30

GRE with Key

GRE key identifies tunnel (useful when multiple tunnels to same endpoint):

configure

# Add GRE key
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 parameters ip key 12345
set interfaces tunnel tun0 address 10.255.0.1/30

commit

# Both ends must use same key

GRE Keepalives

Detect tunnel failure:

configure

# Enable keepalives
set interfaces tunnel tun0 parameters ip keepalive interval 10
set interfaces tunnel tun0 parameters ip keepalive failure-count 3

# Tunnel goes down after 30 seconds of no response

commit

IPIP Tunnel Configuration

Minimal overhead for IPv4-only:

configure

# Create IPIP tunnel
set interfaces tunnel tun0 encapsulation ipip
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit

IPIP vs GRE

IPIP: 20 bytes overhead, IPv4 only, no multicast
GRE:  24 bytes overhead, any protocol, multicast support

Use IPIP when:
- Only IPv4 needed
- Minimal overhead matters
- No routing protocols over tunnel

Use GRE when:
- Need multicast (OSPF, etc.)
- Need IPv6 over tunnel
- Need GRE key for identification

SIT Tunnel Configuration

IPv6 over IPv4 tunneling:

configure

# Create SIT tunnel (6in4)
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 2001:db8::1/64

commit

6in4 Tunnel Example

# Site A: IPv4 203.0.113.1, wants IPv6 2001:db8:a::/48
# Site B: IPv4 198.51.100.1, wants IPv6 2001:db8:b::/48
# Tunnel addresses: 2001:db8:ffff::1/126 and ::2

# Site A
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 2001:db8:ffff::1/126

set protocols static route6 2001:db8:b::/48 interface tun0

# Site B
set interfaces tunnel tun0 encapsulation sit
set interfaces tunnel tun0 source-address 198.51.100.1
set interfaces tunnel tun0 remote 203.0.113.1
set interfaces tunnel tun0 address 2001:db8:ffff::2/126

set protocols static route6 2001:db8:a::/48 interface tun0

MTU Considerations

Calculate Tunnel MTU

Outer IP header:  20 bytes
GRE header:       4 bytes (8 with key/seq)
Inner packet:     MTU - overhead

Standard Ethernet (1500):
- GRE: 1500 - 24 = 1476 MTU
- IPIP: 1500 - 20 = 1480 MTU
- SIT: 1500 - 20 = 1480 MTU

Set Tunnel MTU

configure

# Set MTU on tunnel interface
set interfaces tunnel tun0 mtu 1476

# Important: Prevents fragmentation issues

commit

MSS Clamping

# Clamp TCP MSS for traffic over tunnel
set firewall options interface tun0 adjust-mss 1436

# MSS = MTU - 40 (IP + TCP headers)

Routing Over Tunnels

Static Routes

configure

# Route remote network via tunnel
set protocols static route 10.2.0.0/16 interface tun0

commit

Dynamic Routing

configure

# OSPF over GRE (GRE supports multicast)
set protocols ospf interface tun0 area 0

# For IPIP (no multicast), use unicast neighbors
set protocols ospf interface tun0 area 0
set protocols ospf neighbor 10.255.0.2  # Explicit neighbor

commit

GRE over IPsec

GRE for routing + IPsec for encryption:

configure

# IPsec tunnel first
set vpn ipsec interface eth0
set vpn ipsec esp-group ESP-GRE proposal 1 encryption aes256
set vpn ipsec esp-group ESP-GRE proposal 1 hash sha256
set vpn ipsec ike-group IKE-GRE proposal 1 encryption aes256
set vpn ipsec ike-group IKE-GRE proposal 1 hash sha256

set vpn ipsec site-to-site peer 198.51.100.1 authentication mode pre-shared-secret
set vpn ipsec site-to-site peer 198.51.100.1 authentication pre-shared-secret "secret"
set vpn ipsec site-to-site peer 198.51.100.1 ike-group IKE-GRE
set vpn ipsec site-to-site peer 198.51.100.1 local-address 203.0.113.1
set vpn ipsec site-to-site peer 198.51.100.1 tunnel 1 esp-group ESP-GRE
set vpn ipsec site-to-site peer 198.51.100.1 tunnel 1 protocol gre

# GRE inside IPsec
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 source-address 203.0.113.1
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 address 10.255.0.1/30

commit

Troubleshooting Tunnels

Tunnel Not Coming Up

# Check interface status
show interfaces tunnel

# Verify source address is local
ip addr show | grep 203.0.113.1

# Check remote reachability (outer IP)
ping 198.51.100.1

# Check firewall allows GRE/IPIP
# GRE: Protocol 47
# IPIP: Protocol 4

Traffic Not Flowing

# Check routing
show ip route

# Verify routes via tunnel
show ip route 10.2.0.0/16

# Test tunnel connectivity
ping 10.255.0.2  # Tunnel endpoint

# Check MTU
ping -s 1400 -M do 10.255.0.2  # Large packet with DF

Capture Tunnel Traffic

# Capture on physical interface (encapsulated)
sudo tcpdump -i eth0 proto gre
sudo tcpdump -i eth0 proto 4  # IPIP

# Capture on tunnel interface (inner packets)
sudo tcpdump -i tun0

Security Considerations

GRE/IPIP Have No Encryption

Traffic visible to anyone on path:
- Inner IP addresses
- Protocol information
- Payload content

For sensitive data:
- Use GRE over IPsec
- Use WireGuard/IPsec instead
- Encrypt at application layer

Firewall GRE at Ingress

# Only allow GRE from known peer
set firewall ipv4 name WAN-IN rule 100 action accept
set firewall ipv4 name WAN-IN rule 100 protocol gre
set firewall ipv4 name WAN-IN rule 100 source address 198.51.100.1

set firewall ipv4 name WAN-IN rule 999 action drop

Multiple Tunnels

To Same Remote

# Use GRE key to distinguish
set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun0 parameters ip key 1
set interfaces tunnel tun0 address 10.255.0.1/30

set interfaces tunnel tun1 encapsulation gre
set interfaces tunnel tun1 remote 198.51.100.1
set interfaces tunnel tun1 parameters ip key 2
set interfaces tunnel tun1 address 10.255.1.1/30

To Different Remotes

# Different tunnel interfaces
set interfaces tunnel tun0 remote 198.51.100.1
set interfaces tunnel tun1 remote 198.51.100.2
set interfaces tunnel tun2 remote 198.51.100.3

The Lesson

Simple tunnels solve simple problems.

Use GRE/IPIP/SIT when:

Encapsulation is enough (no encryption needed)
Running routing protocols over tunnel (GRE)
IPv6 over IPv4 infrastructure (SIT)
Minimal overhead matters (IPIP)
Already have encryption elsewhere

Don't use when:

Data is sensitive (use IPsec/WireGuard)
Through untrusted networks without encryption
Need advanced features (VPN, multi-homing)

These protocols are old but not obsolete. They're tools in the toolkit. Know when to use them and when to reach for something more capable.

Simple problems deserve simple solutions.

EVPN: Modern Control Plane for L2 and L3 Services

berik@ashimov.com (Berik Ashimov) — Fri, 06 Feb 2026 00:00:00 GMT

VPLS floods unknown unicast. Every PE learns every MAC. Multi-homing is complicated. It works, but it's showing its age.

EVPN (Ethernet VPN) fixes these problems. MAC addresses are distributed via BGP, not learned via data plane flooding. Multi-homing is first-class. Both L2 and L3 services use the same control plane. It works with MPLS or VXLAN underneath.

EVPN is the future of overlay networking.

EVPN vs VPLS

Feature	VPLS	EVPN
MAC learning	Data plane (flooding)	Control plane (BGP)
Unknown unicast	Flood to all PEs	Only to destination PE
Multi-homing	Complex (MC-LAG)	Native (active-active)
IP routing	Separate (L3VPN)	Integrated
Scalability	Limited	Better

EVPN Concepts

How EVPN Works

1. Host A connects to PE1
2. PE1 learns MAC-A locally
3. PE1 advertises MAC-A via BGP EVPN
4. All PEs install MAC-A → PE1
5. Traffic to MAC-A goes directly to PE1
   (no flooding!)

EVPN Route Types

Type	Name	Purpose
1	Ethernet Auto-Discovery	Multi-homing, aliasing
2	MAC/IP Advertisement	MAC and IP bindings
3	Inclusive Multicast	BUM traffic handling
4	Ethernet Segment	Multi-homing ESI
5	IP Prefix	L3 routing (EVPN Type-5)

Type 2: MAC/IP Route

The most common route type:

Route Distinguisher: 10.255.0.1:100
MAC Address: aa:bb:cc:dd:ee:ff
IP Address: 192.168.1.10 (optional)
Label: 1001
Next-hop: 10.255.0.1

"MAC aa:bb:cc:dd:ee:ff (with IP 192.168.1.10) is behind PE 10.255.0.1"

Type 5: IP Prefix Route

For L3 routing in EVPN:

Route Distinguisher: 10.255.0.1:100
IP Prefix: 10.0.0.0/24
Gateway IP: 0.0.0.0 (optional)
Label: 2001

"Route to 10.0.0.0/24 is behind PE 10.255.0.1"

EVPN with VXLAN

Most common deployment: EVPN control plane + VXLAN data plane

[Host A] ─ [Leaf1/VTEP] ═══ VXLAN ═══ [Leaf2/VTEP] ─ [Host B]
              │     IP underlay      │
           EVPN learns MAC/IP     EVPN learns MAC/IP

VyOS VXLAN with EVPN

configure

# VXLAN interface
set interfaces vxlan vxlan100 vni 100
set interfaces vxlan vxlan100 source-address 10.255.0.1
set interfaces vxlan vxlan100 parameters nolearning

# nolearning: Don't use data plane learning (EVPN handles it)

# Bridge
set interfaces bridge br100 member interface vxlan100
set interfaces bridge br100 member interface eth1

commit

BGP EVPN Configuration

configure

# BGP with EVPN address family
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1

# EVPN neighbor
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 address-family l2vpn-evpn

# Advertise all VNIs
set protocols bgp address-family l2vpn-evpn advertise-all-vni

commit

EVPN Multi-Homing

Ethernet Segment (ES)

Multiple PEs connected to same host/switch:

                 [PE1] ─┐
[Server/Switch] ═══════┼═══ EVPN
                 [PE2] ─┘

ES (Ethernet Segment) = The dual-homed connection
ESI (ES Identifier) = Unique ID for the ES

Active-Active Multi-Homing

configure

# Define Ethernet Segment (both PEs)
set interfaces bonding bond0 member interface eth1
set interfaces bonding bond0 evpn ethernet-segment identifier 00:11:22:33:44:55:66:77:88

# Both PEs actively forward
# EVPN handles aliasing and split horizon

commit

Benefits

Active-active forwarding (both links used)
Fast failover (no waiting for MAC learning)
Loop prevention (DF election)

EVPN Integrated Routing and Bridging (IRB)

Same EVPN instance provides L2 and L3:

configure

# Bridge for L2
set interfaces bridge br100 member interface vxlan100

# IRB interface for L3
set interfaces bridge br100 address 192.168.100.1/24

# Host in VXLAN 100 can:
# - L2 switch to other hosts in same VXLAN
# - L3 route to other networks via 192.168.100.1

commit

Symmetric vs Asymmetric IRB

Asymmetric:
- Routing at ingress only
- Frame sent as L2 to egress PE
- Simpler but requires all VNIs everywhere

Symmetric:
- Routing at ingress and egress
- Uses L3 VNI for inter-subnet
- Better for large scale

EVPN Route Targets

Similar to L3VPN:

configure

# EVPN RT configuration
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn rd 10.255.0.1:100
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn route-target export 65000:100
set vrf name TENANT-A protocols bgp address-family l2vpn-evpn route-target import 65000:100

commit

Viewing EVPN State

Show EVPN Routes

# Show all EVPN routes
show bgp l2vpn evpn

# Show specific route types
show bgp l2vpn evpn route type macip
show bgp l2vpn evpn route type multicast
show bgp l2vpn evpn route type prefix

# Show VNI information
show evpn vni

Show MAC Table

# EVPN-learned MACs
show evpn mac vni 100

# Output:
# VNI    MAC              Type    Interface
# 100    aa:bb:cc:dd:ee:ff local   eth1
# 100    11:22:33:44:55:66 remote  10.255.0.2

EVPN Design Patterns

Leaf-Spine with EVPN

           [Spine1]   [Spine2]
              │ ╲     ╱ │
              │  ╲   ╱  │
              │   ╲ ╱   │
              │   ╱ ╲   │
              │  ╱   ╲  │
           [Leaf1]   [Leaf2]
              │         │
           [Host A] [Host B]

eBGP underlay: Spines as route reflectors
EVPN overlay: Leaf-to-leaf via spines

EVPN-VXLAN Fabric

# Leaf configuration
set protocols bgp neighbor <spine1> remote-as <spine-as>
set protocols bgp neighbor <spine1> address-family l2vpn-evpn

# Spines reflect EVPN routes
# VTEPs on leafs

# Underlay provides IP connectivity
# EVPN provides MAC/IP learning
# VXLAN provides encapsulation

Troubleshooting EVPN

No EVPN Routes

# Check BGP session
show bgp l2vpn evpn summary

# Verify EVPN address family negotiated
show bgp neighbor <ip> | grep -i evpn

# Check local VNI
show evpn vni

MAC Not Advertised

# Check local MAC learning
show bridge fdb

# Check EVPN advertisement
show bgp l2vpn evpn route type macip

# Verify VNI-to-VXLAN mapping

Traffic Not Flowing

# Verify VXLAN tunnel
ping <remote-vtep>

# Check encapsulation
sudo tcpdump -i eth0 udp port 4789

# Verify MAC in remote VNI
show evpn mac vni 100 mac <mac-address>

VyOS EVPN Status

VyOS EVPN support is evolving:

# Check current version capabilities
show version

# VyOS 1.4+ has improved EVPN support via FRRouting
# Test features in lab before production

Migration from VPLS to EVPN

Phase 1: Deploy EVPN parallel to VPLS
Phase 2: Migrate traffic gradually
Phase 3: Decommission VPLS

Key difference:
- VPLS: Flood and learn
- EVPN: Advertise and install

Can run both during migration

The Lesson

EVPN is the future of overlay networking.

EVPN advantages:

Control plane MAC learning (no flooding)
Native multi-homing support
Integrated L2 and L3
Works with MPLS or VXLAN
Scales better than VPLS

When to use EVPN:

Data center fabrics
Multi-tenant environments
Stretched L2 domains
Any new overlay deployment

VyOS EVPN support depends on version. For production, verify specific features work. For learning and smaller deployments, VyOS provides a good platform to understand EVPN concepts.

The concepts here apply regardless of platform. EVPN is the direction the industry is moving — understanding it is essential for modern network engineering.

VPLS: Layer 2 VPN Over MPLS

berik@ashimov.com (Berik Ashimov) — Tue, 03 Feb 2026 00:00:00 GMT

L3VPN provides routed connectivity — IP packets forwarded between sites. But sometimes you need Layer 2 — Ethernet frames bridged as if sites were on the same switch. Same broadcast domain, same VLAN, same ARP visibility.

VPLS (Virtual Private LAN Service) provides exactly this. Multiple sites connected via MPLS backbone, appearing as a single Ethernet switch. Frames are encapsulated, labeled, and forwarded. MAC addresses are learned. Broadcast is flooded.

VPLS provides multipoint Layer 2 connectivity over any-to-any MPLS.

VPLS Concepts

What VPLS Does

Without VPLS:
[Site A] ─── separate switches ─── [Site B]
          │                      │
Different Layer 2 domains

With VPLS:
[Site A] ══════ VPLS ══════ [Site B]
          │                │
Same Layer 2 domain (virtual switch)

Components

PE (Provider Edge): Participates in VPLS, bridges customer frames
P (Provider): MPLS core, just label switching
CE (Customer Edge): Regular switch/router, no VPLS awareness
VSI (Virtual Switch Instance): Virtual switch on PE

[CE1] ─ [PE1] ═══════════════════ [PE2] ─ [CE2]
         │        MPLS            │
        VSI ←─ pseudowires ─→ VSI
         │     (label-switched)   │
        [PE3]
         │
        [CE3]

VPLS vs L3VPN

Feature	VPLS (L2VPN)	L3VPN
Forwarding	Bridge (MAC)	Route (IP)
Customer sees	Same switch	Router hop
Protocol	Ethernet	IP
Broadcast	Flooded	Terminated
MAC learning	Yes	No

VPLS Signaling

BGP-Based VPLS (RFC 4761)

# Control plane: BGP
# PE routers exchange VPLS membership via BGP
# Auto-discovery of other PEs in same VPLS

LDP-Based VPLS (RFC 4762)

# Control plane: Targeted LDP
# Pseudowires signaled via LDP
# Requires explicit configuration of remote PEs

VPLS on VyOS

VyOS VPLS support depends on version. Basic pseudowire configuration:

Pseudowire Configuration

configure

# Create pseudowire interface
set interfaces pseudo-ethernet peth0 link eth1
set interfaces pseudo-ethernet peth0 mode private

# Note: Full VPLS requires additional configuration
# VyOS may not support complete VPLS feature set

commit

L2TPv3 for L2VPN (Alternative)

VyOS supports L2TPv3 which can provide similar L2 connectivity:

configure

# L2TPv3 tunnel
set interfaces l2tpv3 l2tpeth0 source-address 10.0.0.1
set interfaces l2tpv3 l2tpeth0 remote 10.0.0.2
set interfaces l2tpv3 l2tpeth0 tunnel-id 100
set interfaces l2tpv3 l2tpeth0 peer-tunnel-id 100
set interfaces l2tpv3 l2tpeth0 session-id 100
set interfaces l2tpv3 l2tpeth0 peer-session-id 100
set interfaces l2tpv3 l2tpeth0 encapsulation udp
set interfaces l2tpv3 l2tpeth0 source-port 5000
set interfaces l2tpv3 l2tpeth0 destination-port 5000

# Bridge with local interface
set interfaces bridge br0 member interface l2tpeth0
set interfaces bridge br0 member interface eth1

commit

VPLS Design Considerations

Full Mesh Pseudowires

N PEs requires N×(N-1)/2 pseudowires

3 PEs: 3 pseudowires
5 PEs: 10 pseudowires
10 PEs: 45 pseudowires
20 PEs: 190 pseudowires

Scales poorly!

Hierarchical VPLS (H-VPLS)

Solution: Hub-spoke at access layer

[CE]─[MTU]────[PE-aggregation]═══full-mesh═══[PE-aggregation]────[MTU]─[CE]

MTU = Multi-Tenant Unit (spoke)
PE = Hub, full mesh only between PEs
Reduces pseudowire count

MAC Address Learning

Frame arrives at PE1 from CE1:
- PE1 learns: MAC-A is behind local interface
- PE1 floods to PE2, PE3 (VPLS peers)
- PE2 learns: MAC-A is behind PE1 (pseudowire)

Frame from CE2 to MAC-A:
- PE2 knows MAC-A → pseudowire to PE1
- PE1 knows MAC-A → local interface
- Frame delivered

Unknown Unicast Flooding

Unknown destination MAC:
- PE floods to all pseudowires
- Like a regular switch with unknown MAC
- All PEs receive, only one delivers

Broadcast/Multicast:
- Always flooded to all pseudowires
- Bandwidth consideration!

VPLS Challenges

Split Horizon

Frame received from pseudowire:
- Don't send back to pseudowires
- Only send to local interfaces

Prevents loops in full-mesh VPLS

MAC Table Scaling

Every PE learns MACs from all sites:
- 10 sites × 1000 MACs = 10,000 MACs per PE
- Large deployments can exhaust MAC table

Solution:
- MAC address limits per VPLS
- MAC aging timers
- H-VPLS to contain scope

Spanning Tree

Customer running STP across VPLS:
- VPLS is loop-free (split horizon)
- STP BPDUs still traverse
- Can cause suboptimal paths

Options:
- Disable STP (VPLS handles loops)
- Tunnel STP (let customer manage)

When to Use VPLS

Good Use Cases

Extending VLANs across WAN
Data center interconnect (legacy)
Customers requiring L2 adjacency
Migration scenarios

Not Ideal For

New greenfield deployments (use EVPN)
Very large scale (MAC table limits)
Multi-homing requirements (EVPN better)

Modern Alternative: EVPN

EVPN provides similar L2 connectivity with:

Better MAC learning (BGP-based)
Active-active multi-homing
Better scalability
Integrated L2 and L3

VPLS → EVPN migration is common trend
New deployments should consider EVPN first

Troubleshooting VPLS

Pseudowire Not Up

# Check MPLS connectivity
show mpls ldp neighbor
ping <remote-PE-loopback>

# Check pseudowire status
# (command depends on implementation)

MAC Not Learned

# Check MAC table
show bridge fdb

# Verify VLAN tagging matches
show interfaces ethernet eth1

# Capture traffic
sudo tcpdump -i eth1 ether host <mac-address>

Flooding Issues

# Monitor pseudowire traffic
# Excessive flooding = possible MAC learning issue

# Check split horizon
# Packets from pseudowire shouldn't go back to pseudowires

The Lesson

VPLS provides multipoint Layer 2 connectivity over MPLS.

VPLS enables:

Multiple sites on same Layer 2
Transparent LAN extension
Bridge instead of route

Considerations:

Pseudowire full-mesh scales poorly
MAC learning at every PE
Broadcast/unknown flooded everywhere

VyOS VPLS support is limited. For L2 extension:

Check current VyOS version features
Consider L2TPv3 as alternative
Evaluate EVPN for new deployments

VPLS served the industry well, but EVPN is the modern evolution. Understand VPLS concepts — they transfer to EVPN — but default to EVPN for new projects.

L3VPN: MPLS VPN for Multi-Site Connectivity

berik@ashimov.com (Berik Ashimov) — Fri, 30 Jan 2026 00:00:00 GMT

Each customer needs their own routing table. Their addresses might overlap with other customers. They need to reach their own sites but not others. Managing separate physical infrastructure doesn't scale.

MPLS L3VPN (Layer 3 VPN) solves this. Each customer gets a Virtual Routing and Forwarding (VRF) instance. Customer routes are distinguished by Route Distinguisher. Route Targets control which VRFs import which routes. The MPLS backbone carries traffic with label stacks identifying VPN and destination.

L3VPN provides scalable multi-tenant connectivity over shared infrastructure.

L3VPN Concepts

Key Components

PE (Provider Edge): Has VRFs, connects to customers
P (Provider): Core router, just MPLS forwarding
CE (Customer Edge): Customer router, no VPN awareness

[CE1] ─── [PE1] ═══ MPLS ═══ [PE2] ─── [CE2]
  Site A           Backbone           Site B

VRF (Virtual Routing and Forwarding)

Separate routing table per customer:

VRF CustomerA: 10.0.0.0/8 → Site A
VRF CustomerB: 10.0.0.0/8 → Site B  (same addresses, different VRF)
Global: Provider infrastructure only

Route Distinguisher (RD)

Makes routes unique in BGP (not for forwarding):

Without RD: 10.0.0.0/8 from Customer A
            10.0.0.0/8 from Customer B  ← Collision!

With RD: 65000:1:10.0.0.0/8 from Customer A
         65000:2:10.0.0.0/8 from Customer B  ← Unique

Route Target (RT)

Controls route import/export between VRFs:

Export RT: "Tag this route for customers who want it"
Import RT: "Import routes with this tag"

CustomerA-VRF exports: 65000:100
CustomerA-VRF imports: 65000:100  (import own routes at other sites)

Basic L3VPN Configuration

PE Router Setup

configure

# Create VRF for customer
set vrf name CUSTOMER-A table 10
set vrf name CUSTOMER-A description "Customer A VPN"

# Route Distinguisher (unique per VRF)
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast rd 65000:1

# Route Targets
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Assign interface to VRF
set interfaces ethernet eth1 vrf CUSTOMER-A
set interfaces ethernet eth1 address 192.168.1.1/24

commit

Enable VPNv4 Address Family

configure

# BGP configuration
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1

# VPNv4 address family with PE peers
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 update-source lo
set protocols bgp neighbor 10.255.0.2 address-family ipv4-vpn

commit

Redistribute Customer Routes

configure

# In the VRF context
set vrf name CUSTOMER-A protocols bgp system-as 65000
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast redistribute connected

# Or with static routes from CE
set vrf name CUSTOMER-A protocols static route 10.1.0.0/16 next-hop 192.168.1.2

commit

PE-CE Routing

Static Routing

configure

# Static routes from CE
set vrf name CUSTOMER-A protocols static route 10.1.0.0/16 next-hop 192.168.1.2
set vrf name CUSTOMER-A protocols static route 10.2.0.0/16 next-hop 192.168.1.2

commit

eBGP PE-CE

configure

# BGP session with CE router
set vrf name CUSTOMER-A protocols bgp neighbor 192.168.1.2 remote-as 65100
set vrf name CUSTOMER-A protocols bgp neighbor 192.168.1.2 address-family ipv4-unicast

commit

OSPF PE-CE

configure

# OSPF with CE
set vrf name CUSTOMER-A protocols ospf interface eth1
set vrf name CUSTOMER-A protocols ospf redistribute bgp

commit

Complete L3VPN Example

Topology

Customer A Site 1          Provider Backbone          Customer A Site 2
[CE1] ─── [PE1] ═══════════════════════════════ [PE2] ─── [CE2]
10.1.0.0/16    VRF:CUST-A                         VRF:CUST-A    10.2.0.0/16
               RD 65000:1                         RD 65000:1
               RT 65000:100                       RT 65000:100

PE1 Configuration

configure

# Loopback for BGP
set interfaces loopback lo address 10.255.0.1/32

# VRF
set vrf name CUST-A table 10
set vrf name CUST-A protocols bgp address-family ipv4-unicast rd 65000:1
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Customer interface
set interfaces ethernet eth1 vrf CUST-A
set interfaces ethernet eth1 address 192.168.1.1/24

# Core interface
set interfaces ethernet eth0 address 10.0.0.1/30

# MPLS LDP
set protocols mpls ldp router-id 10.255.0.1
set protocols mpls ldp interface eth0

# OSPF for backbone
set protocols ospf area 0 network 10.255.0.1/32
set protocols ospf area 0 network 10.0.0.0/30

# BGP
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.1
set protocols bgp neighbor 10.255.0.2 remote-as 65000
set protocols bgp neighbor 10.255.0.2 update-source lo
set protocols bgp neighbor 10.255.0.2 address-family ipv4-vpn

# VRF BGP
set vrf name CUST-A protocols bgp system-as 65000
set vrf name CUST-A protocols bgp neighbor 192.168.1.2 remote-as 65100
set vrf name CUST-A protocols bgp neighbor 192.168.1.2 address-family ipv4-unicast

commit

PE2 Configuration

configure

# Loopback
set interfaces loopback lo address 10.255.0.2/32

# VRF (same RT for same customer)
set vrf name CUST-A table 10
set vrf name CUST-A protocols bgp address-family ipv4-unicast rd 65000:1
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:100

# Customer interface
set interfaces ethernet eth1 vrf CUST-A
set interfaces ethernet eth1 address 192.168.2.1/24

# Core interface
set interfaces ethernet eth0 address 10.0.0.2/30

# MPLS LDP
set protocols mpls ldp router-id 10.255.0.2
set protocols mpls ldp interface eth0

# OSPF
set protocols ospf area 0 network 10.255.0.2/32
set protocols ospf area 0 network 10.0.0.0/30

# BGP
set protocols bgp system-as 65000
set protocols bgp router-id 10.255.0.2
set protocols bgp neighbor 10.255.0.1 remote-as 65000
set protocols bgp neighbor 10.255.0.1 update-source lo
set protocols bgp neighbor 10.255.0.1 address-family ipv4-vpn

# VRF BGP
set vrf name CUST-A protocols bgp system-as 65000
set vrf name CUST-A protocols bgp neighbor 192.168.2.2 remote-as 65100
set vrf name CUST-A protocols bgp neighbor 192.168.2.2 address-family ipv4-unicast

commit

Viewing L3VPN State

Show VPN Routes

# Show VPNv4 routes
show bgp ipv4 vpn

# Output:
# Route Distinguisher: 65000:1
# *>  10.1.0.0/16      192.168.1.2     0   0    65100 i
# *>  10.2.0.0/16      10.255.0.2      0   0    65100 i

# Show specific VRF routes
show ip route vrf CUST-A

Show Labels

# Show VPN labels
show bgp ipv4 vpn labels

# Show MPLS forwarding table
show mpls table

Verify Connectivity

# Ping within VRF
ping 10.2.0.1 vrf CUST-A

# Traceroute within VRF
traceroute 10.2.0.1 vrf CUST-A

Route Target Patterns

Hub-and-Spoke

# Hub imports all, spokes import only from hub
# Hub VRF:
set vrf name HUB protocols bgp address-family ipv4-unicast route-target export 65000:1
set vrf name HUB protocols bgp address-family ipv4-unicast route-target import 65000:2

# Spoke VRF:
set vrf name SPOKE1 protocols bgp address-family ipv4-unicast route-target export 65000:2
set vrf name SPOKE1 protocols bgp address-family ipv4-unicast route-target import 65000:1

# Traffic: Spoke → Hub → Spoke (forced through hub)

Full Mesh

# All sites import from all sites
set vrf name SITE protocols bgp address-family ipv4-unicast route-target export 65000:100
set vrf name SITE protocols bgp address-family ipv4-unicast route-target import 65000:100

# Any-to-any connectivity

Extranet

# Customer A can reach shared services
# Customer A VRF:
set vrf name CUST-A protocols bgp address-family ipv4-unicast route-target import 65000:999  # Shared services RT

# Shared Services VRF:
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:999

Troubleshooting L3VPN

Routes Not Exchanged

# Check VPNv4 session
show bgp ipv4 vpn summary

# Check RT configuration
show vrf CUST-A

# Verify import/export RT match between sites

MPLS Labels Not Working

# Check MPLS is enabled on core interfaces
show interfaces ethernet eth0

# Check LDP neighbor
show mpls ldp neighbor

# Check MPLS table
show mpls table

Traffic Not Flowing

# Verify VRF routing table
show ip route vrf CUST-A

# Check label stack
show bgp ipv4 vpn 10.2.0.0/16

# Trace path
traceroute 10.2.0.1 vrf CUST-A

The Lesson

L3VPN provides scalable multi-tenant connectivity over shared infrastructure.

Without L3VPN:

Separate physical networks per customer
Address overlap impossible
Doesn't scale

With L3VPN:

Single MPLS backbone serves all customers
VRFs provide isolation
Overlapping addresses supported (different RDs)
Route Targets control connectivity

Key concepts:

VRF: Separate routing table per customer
RD: Makes routes globally unique
RT: Controls import/export between VRFs
VPNv4: BGP carrying VPN routes with labels

VyOS L3VPN support requires MPLS. Verify feature support in your version before production deployment.

BGP-LU: Labeled Unicast for Scalable MPLS Networks

berik@ashimov.com (Berik Ashimov) — Tue, 27 Jan 2026 00:00:00 GMT

LDP distributes labels within an autonomous system. But LDP doesn't cross AS boundaries. For MPLS across multiple ASes, you need something else.

BGP Labeled Unicast (BGP-LU) distributes MPLS labels via BGP — the same protocol already handling inter-AS routing. Labels follow prefixes across AS boundaries, enabling end-to-end MPLS paths spanning multiple networks.

BGP-LU replaces LDP in modern scalable designs.

Why BGP-LU

LDP Limitations

AS 65001              AS 65002
[PE1]──[P1]──[ASBR1]  [ASBR2]──[P2]──[PE2]
     LDP               │         LDP
                       │
                   eBGP session
                   (no labels!)

LDP sessions don't cross AS boundaries. MPLS stops at the border.

BGP-LU Solution

AS 65001              AS 65002
[PE1]──[P1]──[ASBR1]══[ASBR2]──[P2]──[PE2]
                   BGP-LU
              (prefixes + labels)

BGP carries labels along with prefixes. MPLS continues across ASes.

BGP-LU Basics

How It Works

1. PE1 allocates label for its loopback (10.255.0.1/32)
2. PE1 advertises via BGP: 10.255.0.1/32, label 1001
3. Intermediate routers receive prefix+label
4. Traffic to 10.255.0.1 gets labeled 1001 at ingress
5. Label-switched to PE1

BGP-LU vs LDP

Feature	LDP	BGP-LU
Scope	Single AS	Multi-AS
Protocol	Separate (LDP)	Existing (BGP)
Label binding	FEC-based	Prefix-based
Scalability	IGP-limited	BGP-limited
Operational	Two protocols	One protocol

Configuring BGP-LU

Enable Labeled Unicast Address Family

configure

# Enable labeled unicast for IPv4
set protocols bgp address-family ipv4-labeled-unicast

# Redistribute connected/loopback with labels
set protocols bgp address-family ipv4-labeled-unicast redistribute connected

commit

Configure BGP-LU Neighbor

configure

# iBGP neighbor with labeled unicast
set protocols bgp neighbor 10.255.0.2 remote-as 65001
set protocols bgp neighbor 10.255.0.2 address-family ipv4-labeled-unicast

# eBGP neighbor with labeled unicast
set protocols bgp neighbor 192.0.2.1 remote-as 65002
set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast

commit

Network Statement with Labels

configure

# Advertise loopback with label
set protocols bgp address-family ipv4-labeled-unicast network 10.255.0.1/32

commit

Viewing BGP-LU State

Show Labeled Routes

# Show BGP-LU routes
show bgp ipv4 labeled-unicast

# Output:
# Network          Next Hop        Label      Path
# 10.255.0.1/32    0.0.0.0         1001       i
# 10.255.0.2/32    10.0.0.2        2001       i
# 10.255.0.3/32    192.0.2.1       3001       65002 i

Show Label Bindings

# Show MPLS table (FRR)
show mpls table

# Show specific prefix label
show bgp ipv4 labeled-unicast 10.255.0.2/32

Inter-AS MPLS Options

Option A: Back-to-Back VRF

No BGP-LU needed:
[PE1]──MPLS──[ASBR1]─────VRF─────[ASBR2]──MPLS──[PE2]
                    IP routing
                    (MPLS restarts)

Simple but doesn't provide end-to-end MPLS.

Option B: eBGP Labeled Unicast

BGP-LU across ASBRs:
[PE1]──MPLS──[ASBR1]═══BGP-LU═══[ASBR2]──MPLS──[PE2]

# ASBR1 config
set protocols bgp neighbor 192.0.2.1 remote-as 65002
set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast

# ASBR1 must swap labels at boundary

Option C: Multihop eBGP + Labels

BGP-LU between PEs (via route reflector):
[PE1]════════════BGP-LU═════════════[PE2]
        (reflected through ASBRs)

Most scalable, but complex.

Seamless MPLS

Seamless MPLS uses BGP-LU to create unified MPLS network:

┌─ Access ─┐   ┌─ Aggregation ─┐   ┌─ Core ─┐
AN ── AGN ── ABR ── CR ── ABR ── AGN ── AN

AN = Access Node
AGN = Aggregation Node
ABR = Area Border Router
CR = Core Router

BGP-LU provides end-to-end label path
No LDP needed in core

Benefits

Single label stack (not stacked LDP + BGP labels)
Scales to millions of prefixes
Consistent forwarding behavior

BGP-LU with VPN Services

L3VPN over BGP-LU

configure

# BGP-LU for transport
set protocols bgp address-family ipv4-labeled-unicast

# VPNv4 for customer routes
set protocols bgp address-family ipv4-vpn

# VPNv4 uses BGP-LU next-hop for label stack

commit

Label Stack

Outer label: BGP-LU label (transport)
Inner label: VPN label (service)

[VPN Label | BGP-LU Label | IP Packet]

BGP-LU Best Practices

1. Use Route Reflectors

# BGP-LU at scale needs route reflectors
# Same as regular BGP

set protocols bgp neighbor 10.255.0.100 remote-as 65001
set protocols bgp neighbor 10.255.0.100 address-family ipv4-labeled-unicast
set protocols bgp neighbor 10.255.0.100 update-source lo

2. Filter at Boundaries

# Don't accept labeled routes from customers
set policy prefix-list INFRA-ONLY rule 10 prefix 10.255.0.0/16 le 32
set policy prefix-list INFRA-ONLY rule 10 action permit

set policy route-map LU-IN rule 10 match ip address prefix-list INFRA-ONLY
set policy route-map LU-IN rule 10 action permit
set policy route-map LU-IN rule 20 action deny

set protocols bgp neighbor 192.0.2.1 address-family ipv4-labeled-unicast route-map import LU-IN

3. Consistent Label Allocation

# All routers should use consistent label allocation policy
# Usually per-prefix labeling for PE loopbacks

Troubleshooting BGP-LU

Labels Not Exchanged

# Check capability negotiation
show bgp neighbors 10.0.0.2

# Look for:
# IPv4 Labeled Unicast: advertised and received

# If not:
# - Check address-family configuration
# - Check both sides support labeled unicast

Wrong Label

# Show label for specific prefix
show bgp ipv4 labeled-unicast 10.255.0.2/32

# Verify label in MPLS table
show mpls table

# Check label is being used for forwarding

Path Not Using Labels

# Verify next-hop is reachable via MPLS
show bgp ipv4 labeled-unicast 10.255.0.2/32

# Check next-hop resolution
# BGP-LU requires next-hop to have label

Migration from LDP to BGP-LU

Parallel Operation

# Run both LDP and BGP-LU during migration
# LDP handles existing paths
# BGP-LU handles new paths

# Gradually shift traffic to BGP-LU
# Remove LDP when stable

Order of Operations

1. Enable BGP-LU on all routers
2. Verify BGP-LU paths working
3. Prefer BGP-LU over LDP (if needed)
4. Disable LDP
5. Clean up LDP configuration

The Lesson

BGP-LU replaces LDP in modern scalable designs.

LDP works within a single AS but:

Doesn't cross AS boundaries
Requires separate protocol maintenance
Scales with IGP (limited)

BGP-LU advantages:

Works across AS boundaries
Uses existing BGP infrastructure
Scales with BGP (better)
Enables Seamless MPLS designs

When to use BGP-LU:

Multi-AS MPLS networks
Large-scale service provider networks
New MPLS deployments

VyOS BGP-LU support is functional for basic scenarios. For production SP networks, verify feature completeness in your VyOS version.

MPLS Introduction: Labels, LDP, and Packet Forwarding

berik@ashimov.com (Berik Ashimov) — Fri, 23 Jan 2026 00:00:00 GMT

IP routing makes forwarding decisions at every hop. Each router looks up the destination, checks its routing table, forwards the packet. Repeat at every hop. Works fine, but expensive at scale.

MPLS (Multi-Protocol Label Switching) adds a label at network edge. Interior routers forward based on label only — a simple table lookup, no IP processing. Labels are swapped at each hop until the edge, where the label is removed and IP routing resumes.

MPLS is still relevant for service provider networks — enabling VPNs, traffic engineering, and fast forwarding at scale.

MPLS Concepts

How MPLS Works

Without MPLS (IP forwarding):
[IP Header: dst=10.0.0.1] → Router A → [route lookup] → Router B → [route lookup] → Router C

With MPLS:
[IP Header] → Edge Router → adds [Label: 100] → Core Router → swaps [Label: 200] → Edge Router → removes label → [IP Header]

Key Terms

Term	Description
Label	20-bit identifier for forwarding
LSP	Label Switched Path (tunnel through network)
LDP	Label Distribution Protocol (assigns labels)
Push	Add label to packet
Pop	Remove label from packet
Swap	Replace label with new one
PHP	Penultimate Hop Popping

MPLS Header

┌────────────────────────────────────────────┐
│ Label (20 bits) │ TC │ S │ TTL │           │
│     (0-1048575) │ 3b │1b │ 8b  │           │
└────────────────────────────────────────────┘

TC: Traffic Class (QoS)
S: Bottom of Stack (1 if last label)
TTL: Time to Live

VyOS MPLS Support

VyOS supports MPLS through FRRouting:

# Check MPLS support
cat /proc/sys/net/mpls/platform_labels
# If exists, MPLS kernel support is available

# Load MPLS modules
modprobe mpls_router
modprobe mpls_iptunnel

Basic MPLS Configuration

Enable MPLS on Interfaces

configure

# Enable MPLS kernel support
set system sysctl parameter net.mpls.platform_labels value 100000

# Enable MPLS input on interfaces (via sysctl)
set system sysctl parameter net.mpls.conf.eth0.input value 1
set system sysctl parameter net.mpls.conf.eth1.input value 1

commit

Configure LDP

configure

# Enable LDP router ID
set protocols mpls ldp router-id 10.255.0.1

# Configure LDP interfaces
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

# Optional: Discovery hello interval
set protocols mpls ldp discovery hello-interval 5
set protocols mpls ldp discovery hello-holdtime 15

commit

LDP with Targeted Neighbors

configure

# For non-adjacent neighbors (over tunnels)
set protocols mpls ldp targeted-neighbor ipv4 address 10.255.0.2

commit

LDP Operation

LDP Session Establishment

1. Router discovers neighbors via Hello packets (UDP 646)
2. TCP session established to neighbor (port 646)
3. Label mappings exchanged
4. LSPs formed

Viewing LDP Status

# Show LDP neighbors
show mpls ldp neighbor

# Output:
# Peer LDP Identifier: 10.255.0.2:0
#   TCP connection: 10.255.0.1:646 - 10.255.0.2:54321
#   State: Operational
#   Messages sent/received: 1234/5678

# Show LDP bindings (label mappings)
show mpls ldp binding

# Show MPLS forwarding table
show mpls table

MPLS Forwarding

Label Operations

# View MPLS forwarding table
show mpls table

# Output:
# Inbound Label  Type    Nexthop           Outbound Label
# 100            LDP     10.0.0.2          200
# 101            LDP     10.0.0.2          201
# 102            LDP     10.0.0.2          implicit-null

# implicit-null = PHP (penultimate hop popping)

Penultimate Hop Popping (PHP)

Without PHP:
[Label:100] → Router A → [Label:200] → Router B → [Label:300] → Router C → [no label] → Destination
                                                                    ↑ Two operations: pop label + IP lookup

With PHP:
[Label:100] → Router A → [Label:200] → Router B → [no label] → Router C → Destination
                                          ↑ Pop here (second-to-last hop)
                                                              ↑ Only IP lookup needed

Router C signals "implicit-null" label to Router B, telling it to pop the label.

MPLS with IGP

MPLS + OSPF

configure

# Configure OSPF
set protocols ospf area 0 network 10.0.0.0/24
set protocols ospf passive-interface default
set protocols ospf passive-interface lo disable
set protocols ospf interface eth0 passive disable
set protocols ospf interface eth1 passive disable

# LDP follows OSPF paths
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit

MPLS + IS-IS

configure

# Configure IS-IS
set protocols isis interface eth0
set protocols isis interface eth1
set protocols isis net 49.0001.0100.0100.0001.00

# LDP follows IS-IS paths
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit

Simple MPLS Network Example

Topology

[CE1] ── [PE1] ═══ [P1] ═══ [PE2] ── [CE2]
         10.255.0.1   10.255.0.2   10.255.0.3

PE = Provider Edge (MPLS edge)
P = Provider (MPLS core)
CE = Customer Edge (no MPLS)

PE1 Configuration

configure

# Loopback for router-id
set interfaces loopback lo address 10.255.0.1/32

# WAN interface toward P1
set interfaces ethernet eth0 address 10.0.0.1/30

# Customer interface (no MPLS)
set interfaces ethernet eth1 address 192.168.1.1/24

# OSPF
set protocols ospf router-id 10.255.0.1
set protocols ospf area 0 network 10.255.0.1/32
set protocols ospf area 0 network 10.0.0.0/30

# LDP
set protocols mpls ldp router-id 10.255.0.1
set protocols mpls ldp interface eth0

commit

P1 Configuration

configure

# Loopback
set interfaces loopback lo address 10.255.0.2/32

# Interfaces
set interfaces ethernet eth0 address 10.0.0.2/30
set interfaces ethernet eth1 address 10.0.0.5/30

# OSPF
set protocols ospf router-id 10.255.0.2
set protocols ospf area 0 network 10.255.0.2/32
set protocols ospf area 0 network 10.0.0.0/30
set protocols ospf area 0 network 10.0.0.4/30

# LDP
set protocols mpls ldp router-id 10.255.0.2
set protocols mpls ldp interface eth0
set protocols mpls ldp interface eth1

commit

Troubleshooting MPLS

LDP Neighbor Not Forming

# Check interface MPLS is enabled
show interfaces ethernet eth0

# Check LDP is listening
ss -tulnp | grep 646

# Check for LDP hellos
sudo tcpdump -i eth0 udp port 646

# Check OSPF/IGP adjacency (LDP follows IGP)
show ip ospf neighbor

Labels Not Assigned

# Check LDP bindings
show mpls ldp binding

# Check MPLS forwarding table
show mpls table

# Verify MPLS modules loaded
lsmod | grep mpls

Packets Not Label-Switched

# Verify ingress interface has MPLS enabled
cat /proc/sys/net/mpls/conf/eth0/input

# Should be 1, if 0:
echo 1 > /proc/sys/net/mpls/conf/eth0/input

# Check kernel MPLS support
cat /proc/sys/net/mpls/platform_labels

MPLS MTU Considerations

MPLS label adds 4 bytes per label:

# Standard Ethernet MTU: 1500
# With one MPLS label: 1500 - 4 = 1496 effective payload
# With two labels (VPN): 1500 - 8 = 1492 effective payload

# Option 1: Increase MTU on MPLS interfaces
set interfaces ethernet eth0 mtu 1508

# Option 2: Fragment at ingress (less efficient)

MPLS Security

Control Plane Security

# Restrict LDP sessions
# Only accept from known neighbors
set protocols mpls ldp neighbor 10.255.0.2 password "secret"

# Filter LDP discovery
# (Use firewall to limit UDP 646)

Data Plane Considerations

# MPLS doesn't encrypt traffic
# Anyone on the path can read label and content

# For encryption, use:
# - IPsec over MPLS
# - MACSec at Layer 2

The Lesson

MPLS is still relevant for service provider networks.

MPLS provides:

Fast forwarding (label lookup vs. IP lookup)
VPN services (L2VPN, L3VPN)
Traffic engineering (explicit paths)
QoS capabilities

VyOS MPLS support is functional but limited:

Basic LDP works
Advanced features (RSVP-TE, Segment Routing) may be limited
Check VyOS version for specific feature support

For modern networks:

Small networks: IP routing is fine
Large SP networks: MPLS still valuable
Newer alternative: Segment Routing (SR-MPLS, SRv6)

Understand MPLS fundamentals even if you don't use it daily — many service provider networks and VPN services depend on it.

BGP Dampening: Suppressing Route Flapping

berik@ashimov.com (Berik Ashimov) — Tue, 20 Jan 2026 00:00:00 GMT

A remote router has a flaky connection. Route appears, disappears, appears again — 10 times per minute. Each flap propagates through BGP. Your router processes updates. Your peers process updates. The entire internet processes updates. All for a route that will flap again in seconds.

BGP dampening penalizes routes that flap frequently. After enough flaps, the route is suppressed — temporarily hidden until it proves stable. This protects your network from chasing unstable routes.

Dampening prevents unstable routes from destabilizing your network.

How Dampening Works

Route flap detected:
  → Penalty added (e.g., +1000)
  → If penalty > suppress threshold: route suppressed
  → Penalty decays over time
  → If penalty < reuse threshold: route unsuppressed

Key Parameters

Parameter	Description	Typical Value
Half-life	Time for penalty to decay by half	15 minutes
Reuse	Penalty below which route is reused	750
Suppress	Penalty above which route is suppressed	2000
Max-suppress	Maximum suppression time	60 minutes

Example Timeline

Time 0:00 - Route withdrawn  → Penalty: 1000
Time 0:01 - Route announced  → Penalty: 2000 (flap)
Time 0:02 - Route withdrawn  → Penalty: 3000 → SUPPRESSED (> 2000)
Time 0:03 - Route announced  → Still suppressed, penalty +1000 = 4000

Time 15:00 - Penalty decayed to 2000 (half-life)
Time 30:00 - Penalty decayed to 1000
Time 35:00 - Penalty decayed to ~750 → REUSED (< 750)

Basic Dampening Configuration

Enable Dampening Globally

configure

# Enable with default parameters
set protocols bgp address-family ipv4-unicast dampening

commit

Custom Parameters

configure

# Custom dampening parameters
set protocols bgp address-family ipv4-unicast dampening half-life 15
set protocols bgp address-family ipv4-unicast dampening re-use 750
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 2000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 60

commit

Per-Neighbor Dampening

# Different dampening for different peers
# Typically done via route-map

configure

# More aggressive for untrusted peers
set policy route-map DAMPENING-AGGRESSIVE rule 10 action permit
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening half-life 10
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening re-use 500
set policy route-map DAMPENING-AGGRESSIVE rule 10 set dampening start-suppress-time 1500

# Apply to peer
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map import DAMPENING-AGGRESSIVE

commit

Viewing Dampening Status

Show Dampened Routes

# Show all dampened routes
show bgp ipv4 unicast dampening dampened-paths

# Output:
# Network          From             Reuse    Path
# 203.0.113.0/24   10.0.0.2        00:25:00 65002 65003
# 198.51.100.0/24  10.0.0.2        00:45:00 65002 65004

Show Flap Statistics

# Show routes with flap history
show bgp ipv4 unicast dampening flap-statistics

# Output:
# Network          From             Flaps Duration Reuse    Path
# 203.0.113.0/24   10.0.0.2        15    01:30:00 00:25:00 65002
# 198.51.100.0/24  10.0.0.2        8     00:45:00 00:45:00 65002

Show Dampening Parameters

# Show configured parameters
show bgp ipv4 unicast dampening parameters

# Output:
# Half-life time: 15 minutes
# Reuse penalty: 750
# Suppress penalty: 2000
# Max suppress time: 60 minutes

Clearing Dampening

Clear Specific Route

# Clear dampening for specific prefix
clear bgp ipv4 unicast dampening 203.0.113.0/24

Clear All Dampened Routes

# Clear all dampening history
clear bgp ipv4 unicast dampening

Parameter Tuning

Aggressive Dampening

For untrusted peers or known-unstable sources:

configure

# Quick to suppress, slow to recover
set protocols bgp address-family ipv4-unicast dampening half-life 10
set protocols bgp address-family ipv4-unicast dampening re-use 500
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 1000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 60

# Suppress after ~1-2 flaps
# Stay suppressed for up to 1 hour
# Be very stable to return

commit

Lenient Dampening

For trusted peers or critical routes:

configure

# Harder to suppress, quick to recover
set protocols bgp address-family ipv4-unicast dampening half-life 20
set protocols bgp address-family ipv4-unicast dampening re-use 1000
set protocols bgp address-family ipv4-unicast dampening start-suppress-time 3000
set protocols bgp address-family ipv4-unicast dampening max-suppress-time 30

# Suppress after ~3+ flaps
# Return to use faster
# Maximum 30 minutes suppression

commit

Calculating Flaps to Suppress

Suppress threshold / Penalty per flap = Flaps to suppress

Example:
2000 / 1000 = 2 flaps → suppressed

With decay during flapping, actual number varies

Selective Dampening

Dampen Only Specific Prefixes

configure

# Only dampen longer prefixes (more specific routes)
set policy prefix-list DAMPEN-TARGETS rule 10 prefix 0.0.0.0/0 ge 24

set policy route-map SELECTIVE-DAMPEN rule 10 match ip address prefix-list DAMPEN-TARGETS
set policy route-map SELECTIVE-DAMPEN rule 10 action permit
set policy route-map SELECTIVE-DAMPEN rule 10 set dampening half-life 15

set policy route-map SELECTIVE-DAMPEN rule 20 action permit
# No dampening for other routes

commit

Exclude Critical Routes

configure

# Don't dampen default route or critical prefixes
set policy prefix-list NO-DAMPEN rule 10 prefix 0.0.0.0/0
set policy prefix-list NO-DAMPEN rule 20 prefix 8.8.8.0/24  # Critical DNS

set policy route-map SAFE-DAMPEN rule 10 match ip address prefix-list NO-DAMPEN
set policy route-map SAFE-DAMPEN rule 10 action permit
# No dampening for matched routes

set policy route-map SAFE-DAMPEN rule 20 action permit
set policy route-map SAFE-DAMPEN rule 20 set dampening half-life 15

commit

Dampening Considerations

When to Use Dampening

✓ Edge routers receiving external routes ✓ Networks with known flapping sources ✓ Protection against propagating instability

When Not to Use Dampening

✗ Internal BGP (iBGP) — hides real problems ✗ Critical routes where availability is paramount ✗ Networks with known slow convergence (dampening adds to it)

Dampening vs BFD

BFD: Detect failures FAST
Dampening: Suppress UNSTABLE routes

They solve different problems:
- BFD: "Quickly know when peer is dead"
- Dampening: "Don't trust peers that keep dying"

Use both together for:
- Fast failure detection (BFD)
- Protection from flapping (dampening)

Monitoring and Alerting

Monitor Dampening Events

# Log when routes are suppressed/reused
show log | grep -i dampen

# Track frequently dampened prefixes
show bgp ipv4 unicast dampening flap-statistics

Alert on Persistent Dampening

# If route stays dampened, investigate source
# Persistent dampening = persistent instability somewhere

# Check which peer is source
show bgp ipv4 unicast 203.0.113.0/24
# Note the "from" peer

Troubleshooting

Route Stays Suppressed

# Check current penalty and reuse time
show bgp ipv4 unicast dampening dampened-paths

# If penalty not decaying:
# - Recent flaps reset penalty
# - Half-life too long

# Manual clear if needed
clear bgp ipv4 unicast dampening 203.0.113.0/24

Expected Route Not Appearing

# Might be dampened
show bgp ipv4 unicast dampening dampened-paths | grep <prefix>

# If dampened, wait for reuse or clear

Dampening Not Working

# Verify dampening is enabled
show bgp ipv4 unicast dampening parameters

# Check route-map is applied
show configuration commands | grep dampening
show configuration commands | grep route-map

Best Practices

1. Start Conservative

# Default parameters are reasonable starting point
set protocols bgp address-family ipv4-unicast dampening

# Monitor before tuning

2. Different Parameters for Different Sources

# Tier 1 transit: Lenient (trusted)
# Tier 2 transit: Standard
# Peers: Standard
# Customers: Lenient (you control their stability)

3. Monitor Dampening Statistics

# Regular review of what's being dampened
# Persistent dampening = investigate root cause

4. Don't Dampen Everything

# Critical routes shouldn't be dampened
# Internal routes shouldn't be dampened
# Only dampen external routes from untrusted/unknown sources

The Lesson

Dampening prevents unstable routes from destabilizing your network.

Without dampening:

Flapping route → constant updates
Updates propagate to all peers
CPU and memory consumed processing junk
Potentially affects routing for stable routes

With dampening:

Flapping route → penalty accumulates
After threshold → route suppressed
Network ignores unstable route
Stable routes unaffected

The tradeoff: Dampening delays convergence for routes that are legitimately changing. A real path change looks like a flap. Too aggressive dampening can hide valid routes.

Balance: Conservative dampening on external routes, no dampening on internal/critical routes.

ECMP and Multipath: Load Balancing at the Routing Layer

berik@ashimov.com (Berik Ashimov) — Fri, 16 Jan 2026 00:00:00 GMT

Two paths to the same destination. Same cost. Traditional routing picks one. ECMP (Equal-Cost Multi-Path) uses both, spreading traffic across available paths.

The concept is simple: multiple equal routes, traffic distributed. The implementation has nuances: how traffic is distributed, what makes routes "equal," and why some flows always use the same path.

ECMP is simple but requires understanding to use effectively.

How ECMP Works

Without ECMP:
                    [Path A: cost 10] → (used)
Host → Router →
                    [Path B: cost 10] → (ignored)

With ECMP:
                    [Path A: cost 10] → (50% traffic)
Host → Router →
                    [Path B: cost 10] → (50% traffic)

Traffic is distributed per-flow, not per-packet. All packets for the same flow use the same path (preventing reordering).

Basic ECMP Configuration

Static Routes ECMP

configure

# Two equal-cost static routes
set protocols static route 10.0.0.0/8 next-hop 192.168.1.1
set protocols static route 10.0.0.0/8 next-hop 192.168.1.2

# VyOS automatically installs both if costs equal
commit

Verify ECMP Routes

show ip route 10.0.0.0/8

# Output:
# S>* 10.0.0.0/8 [1/0] via 192.168.1.1, eth0, weight 1, 00:05:00
#                 via 192.168.1.2, eth1, weight 1, 00:05:00

Multiple next-hops shown = ECMP active.

ECMP with BGP

Enable Multipath

configure

# Enable ECMP for eBGP
set protocols bgp address-family ipv4-unicast maximum-paths ebgp 4

# Enable ECMP for iBGP
set protocols bgp address-family ipv4-unicast maximum-paths ibgp 4

commit

BGP Path Requirements

For BGP paths to be ECMP-eligible, they must have:

Same AS_PATH length
Same origin (IGP/EGP/incomplete)
Same MED (or MED comparison disabled)
Same local preference

# Compare paths
show bgp ipv4 unicast 10.0.0.0/8

# If paths differ in AS_PATH length, not ECMP-eligible
# Path 1: AS_PATH 65001 65002 (length 2)
# Path 2: AS_PATH 65001 (length 1)  ← shorter, wins alone

Allow Multipath from Same AS

# For multiple connections to same AS
set protocols bgp address-family ipv4-unicast maximum-paths ebgp 4
set protocols bgp address-family ipv4-unicast multipath-relax

# multipath-relax: Allows ECMP even if AS_PATH differs (same length)

ECMP with OSPF

Enable OSPF ECMP

configure

# OSPF supports ECMP by default
# Configure maximum paths
set protocols ospf parameters maximum-paths 4

commit

OSPF naturally creates ECMP when multiple paths have equal cost.

Hash Algorithm

Traffic distribution uses hash of packet headers. Same hash = same path.

Hash Inputs

Default hash inputs:
- Source IP
- Destination IP
- Source port
- Destination port
- Protocol

Hash result → selects path

Configure Hash Algorithm

# VyOS uses kernel's fib_multipath_hash_policy
# 0 = Layer 3 only (src/dst IP)
# 1 = Layer 4 (src/dst IP + ports)
# 2 = Layer 3 or inner for tunnels

configure
set system sysctl parameter net.ipv4.fib_multipath_hash_policy value 1
commit

Layer 3 vs Layer 4 Hash

# Layer 3 only:
# Same src/dst IP pair always uses same path
# Different src IPs spread across paths

# Layer 4:
# Same src/dst IP but different ports can use different paths
# Better distribution for single-host scenarios

Troubleshooting ECMP

Issue: Uneven Distribution

# One path getting most traffic

# Causes:
# 1. Hash algorithm + traffic pattern = uneven
# 2. Not actually ECMP (one path preferred)
# 3. Few unique flows (small sample)

# Check if actually ECMP
show ip route 10.0.0.0/8
# Must show multiple next-hops

# Monitor per-path traffic
# Use interface counters
watch -n 1 'show interfaces ethernet eth0; show interfaces ethernet eth1'

Issue: Single Flow Always Same Path

# This is expected behavior!
# ECMP hashes per-flow, not per-packet

# Same src/dst/port always hashes to same path
# Prevents packet reordering

# For testing, use different source ports
nc -p 10001 server 80
nc -p 10002 server 80
# May use different paths

Issue: Paths Not Equal

# BGP paths not becoming ECMP

show bgp ipv4 unicast 10.0.0.0/8 bestpath

# Check what makes them unequal:
# - AS_PATH length different?
# - MED different?
# - Local preference different?

# Fix the inequality or enable multipath-relax

Issue: Route Flapping

# One path keeps appearing/disappearing

# ECMP recalculates when paths change
# Can cause flow redistribution

# Solution: Stabilize the flapping path
# Or implement dampening

Weighted ECMP

Not all paths are equal? Use weights:

configure

# Higher weight = more traffic
set protocols static route 10.0.0.0/8 next-hop 192.168.1.1 distance 1
set protocols static route 10.0.0.0/8 next-hop 192.168.1.2 distance 1

# Unfortunately, VyOS static routes don't have weight directly
# Use different administrative distance for preference (not ECMP)

# For weighted distribution, consider:
# - BGP with different link bandwidths
# - Policy routing with firewall marks

ECMP Failure Handling

When One Path Fails

# ECMP automatically removes failed path
# Traffic redistributes to remaining paths

# Flow rehashing happens:
# - Some flows move to different paths
# - Brief reordering possible during transition

BFD for Fast Failure Detection

configure

# Use BFD to quickly detect path failure
set protocols bfd peer 192.168.1.1
set protocols bfd peer 192.168.1.2

# When BFD detects failure, route withdrawn immediately
# ECMP recalculates faster than waiting for routing protocol

ECMP Best Practices

1. Match Bandwidth

# ECMP assumes equal paths
# 10G + 1G ECMP = poor utilization

# Either:
# - Use paths with equal bandwidth
# - Use weighted/unequal ECMP if available
# - Different approach (LAG, policy routing)

2. Enable Layer 4 Hash

# Better distribution for typical traffic
set system sysctl parameter net.ipv4.fib_multipath_hash_policy value 1

3. Monitor Both Paths

# Dashboard showing:
# - Traffic per path
# - Errors per path
# - ECMP route status

4. Test Failover

# Regularly test:
# 1. Disable one path
# 2. Verify traffic flows via remaining path
# 3. Re-enable path
# 4. Verify ECMP resumes

ECMP vs LAG

Feature	ECMP	LAG (Bond)
Layer	3 (routing)	2 (switching)
Protocols	Different paths, routers	Same path, one hop
Failure detection	Routing protocol	LACP
Configuration	Routing config	Interface config
Scalability	Many paths	Limited ports

# ECMP: Different next-hop routers
# LAG: Same router, bundled interfaces

# Use LAG for link aggregation to single device
# Use ECMP for path diversity across network

Checking ECMP Status

# View kernel routing table
ip route show 10.0.0.0/8

# Show with ECMP details
ip route show 10.0.0.0/8 | grep -i nexthop

# Count ECMP paths
ip route show 10.0.0.0/8 | grep -c nexthop

# Test which path a flow would take
ip route get 10.0.0.100 from 192.168.1.50

The Lesson

ECMP is simple but requires understanding to use effectively.

What ECMP gives you:

Automatic load distribution across equal paths
Redundancy (path failure → automatic reroute)
Increased aggregate bandwidth

What ECMP doesn't give you:

Per-packet load balancing (would cause reordering)
Intelligent traffic distribution (hash-based, may be uneven)
Weighted distribution (standard ECMP is equal-cost)

Key understanding:

Paths must be truly equal (cost, metrics)
Distribution is per-flow, not per-packet
Hash algorithm determines distribution
Layer 4 hash usually gives better distribution
Uneven traffic is normal with few flows

Configure it, understand it, monitor it. ECMP works well when you know what to expect.

Route Leaking Between VRFs: Controlled Connectivity

berik@ashimov.com (Berik Ashimov) — Tue, 13 Jan 2026 00:00:00 GMT

VRFs isolate routing tables. Customer A can't reach Customer B. But what about shared services? The DNS server, the management network, the internet gateway — they need to be reachable from all VRFs.

Route leaking imports routes from one VRF into another. Not all routes — just the ones you explicitly allow. DNS server in VRF SHARED becomes reachable from VRF CUSTOMER-A without full interconnection.

Route leaking provides controlled cross-VRF connectivity.

Why Route Leaking

Without Route Leaking

VRF CUSTOMER-A:  10.1.0.0/16
VRF CUSTOMER-B:  10.2.0.0/16
VRF SHARED:      10.100.0.0/16 (DNS, NTP, etc.)

Problem: Customers can't reach shared services
Solution 1: NAT (complex, breaks some apps)
Solution 2: Route leaking (cleaner)

With Route Leaking

VRF CUSTOMER-A:
  - 10.1.0.0/16 (local)
  - 10.100.0.0/16 (leaked from SHARED)

VRF SHARED:
  - 10.100.0.0/16 (local)
  - 10.1.0.0/16 (leaked from CUSTOMER-A, if needed)

Basic Route Leaking

VRF Configuration

configure

# Create VRFs
set vrf name CUSTOMER-A table 10
set vrf name SHARED table 20

# Assign interfaces
set interfaces ethernet eth1 vrf CUSTOMER-A
set interfaces ethernet eth2 vrf SHARED

# Configure addresses
set interfaces ethernet eth1 address 10.1.0.1/24
set interfaces ethernet eth2 address 10.100.0.1/24

commit

Leak Routes from SHARED to CUSTOMER-A

configure

# Define what to leak (prefix list)
set policy prefix-list SHARED-SERVICES rule 10 prefix 10.100.0.0/16
set policy prefix-list SHARED-SERVICES rule 10 action permit

# Route map for import
set policy route-map IMPORT-SHARED rule 10 match ip address prefix-list SHARED-SERVICES
set policy route-map IMPORT-SHARED rule 10 action permit

# Import into CUSTOMER-A from SHARED
set vrf name CUSTOMER-A protocols static route 10.100.0.0/16 interface eth2 vrf SHARED

commit

Using BGP for Route Leaking

configure

# BGP in each VRF
set vrf name CUSTOMER-A protocols bgp system-as 65000
set vrf name SHARED protocols bgp system-as 65000

# Route distinguisher (unique per VRF)
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast rd 65000:10
set vrf name SHARED protocols bgp address-family ipv4-unicast rd 65000:20

# Route targets for import/export
# CUSTOMER-A imports from SHARED
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:20
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:10

# SHARED exports to all customers
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:20

commit

Selective Route Leaking

Leak Only Specific Routes

configure

# Only leak DNS server
set policy prefix-list DNS-ONLY rule 10 prefix 10.100.0.10/32
set policy prefix-list DNS-ONLY rule 10 action permit

# Route map for selective import
set policy route-map IMPORT-DNS rule 10 match ip address prefix-list DNS-ONLY
set policy route-map IMPORT-DNS rule 10 action permit

# Apply to import
# (Implementation depends on leaking method)

commit

Leak with Modified Attributes

configure

# Import but with lower preference
set policy route-map IMPORT-BACKUP rule 10 action permit
set policy route-map IMPORT-BACKUP rule 10 set local-preference 50

# Routes leaked as backup paths

commit

Bidirectional Leaking

Customer Needs to Reach Shared, Shared Needs to Reach Customer

configure

# CUSTOMER-A → SHARED
set vrf name SHARED protocols static route 10.1.0.0/24 interface eth1 vrf CUSTOMER-A

# SHARED → CUSTOMER-A
set vrf name CUSTOMER-A protocols static route 10.100.0.0/24 interface eth2 vrf SHARED

commit

With Route Targets (Symmetric)

configure

# CUSTOMER-A
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target import 65000:20
set vrf name CUSTOMER-A protocols bgp address-family ipv4-unicast route-target export 65000:10

# SHARED
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target import 65000:10
set vrf name SHARED protocols bgp address-family ipv4-unicast route-target export 65000:20

# Now bidirectional

commit

Common Patterns

Pattern 1: Shared Services VRF

        ┌─────────────────┐
        │   VRF SHARED    │
        │  DNS, NTP, etc  │
        └────────┬────────┘
                 │ (leaked to all)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘

# Leak SHARED to all customer VRFs
# Each customer VRF imports 65000:SHARED
# SHARED exports to all

Pattern 2: Internet Gateway

        ┌─────────────────┐
        │   VRF INTERNET  │
        │   (default gw)  │
        └────────┬────────┘
                 │ (default route leaked)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘

configure

# Internet VRF has default route
set vrf name INTERNET protocols static route 0.0.0.0/0 next-hop 203.0.113.1

# Leak default to customers
set vrf name CUSTOMER-A protocols static route 0.0.0.0/0 next-hop 10.100.0.1 vrf INTERNET

commit

Pattern 3: Management VRF

        ┌─────────────────┐
        │   VRF MGMT      │
        │  (admin access) │
        └────────┬────────┘
                 │ (limited leak)
    ┌────────────┼────────────┐
    │            │            │
┌───┴───┐   ┌───┴───┐   ┌───┴───┐
│ VRF A │   │ VRF B │   │ VRF C │
└───────┘   └───────┘   └───────┘

Management can reach all VRFs
VRFs cannot reach management

# Asymmetric: MGMT can reach customers
set vrf name MGMT protocols static route 10.1.0.0/16 interface eth1 vrf CUSTOMER-A
set vrf name MGMT protocols static route 10.2.0.0/16 interface eth2 vrf CUSTOMER-B

# But customers don't have route to MGMT
# (no reverse leak)

Preventing Leakage Problems

Problem: Overlapping Addresses

# CUSTOMER-A: 10.0.0.0/8
# CUSTOMER-B: 10.0.0.0/8  (same!)

# Can't leak between them - address collision

# Solution: NAT before leaking, or don't overlap

Problem: Routing Loops

# VRF A leaks to VRF B
# VRF B leaks to VRF C
# VRF C leaks to VRF A

# If routes propagate, possible loop

# Solution: Mark leaked routes, don't re-leak

Problem: Unintended Connectivity

# Leaked too much - customers can now reach each other

# Solution: Use strict prefix lists
set policy prefix-list SHARED-ONLY rule 10 prefix 10.100.0.0/24
set policy prefix-list SHARED-ONLY rule 10 action permit
set policy prefix-list SHARED-ONLY rule 999 action deny

# Only leak exactly what's needed

Verifying Route Leaking

Check VRF Routes

# Show routes in specific VRF
show ip route vrf CUSTOMER-A

# Look for routes with different VRF next-hop
# 10.100.0.0/24 via 10.100.0.1, eth2 (vrf SHARED)

Check BGP VPN Routes

# If using BGP for leaking
show bgp ipv4 vpn
show bgp ipv4 vpn rd 65000:10

Test Connectivity

# Ping across VRFs
ping 10.100.0.10 vrf CUSTOMER-A

# Traceroute across VRFs
traceroute 10.100.0.10 vrf CUSTOMER-A

Troubleshooting

Routes Not Appearing

# Check source VRF has the route
show ip route vrf SHARED

# Check route target configuration
show vrf CUSTOMER-A
# Look for RT import/export

# Check prefix list matches
show policy prefix-list SHARED-SERVICES

Connectivity Not Working

# Routes exist but traffic fails

# Check return path exists
show ip route vrf SHARED 10.1.0.0/24
# Shared must have route back to customer

# Check firewall allows cross-VRF
show firewall

BGP VPN Not Working

# Check RD uniqueness
show bgp ipv4 vpn summary

# Check RT import/export match
# Export RT on one side must match import RT on other

Best Practices

1. Document Leaking Policy

# Route Leaking Policy

## VRF SHARED (10.100.0.0/16)
- Leaked TO: All customer VRFs
- Leaked FROM: None (customers can't reach each other via SHARED)

## VRF INTERNET (0.0.0.0/0)
- Leaked TO: Customer VRFs with internet package
- Leaked FROM: All (for return traffic)

2. Use Prefix Lists

# Never leak "everything"
# Explicit prefix lists only

set policy prefix-list LEAK-TO-CUSTOMER rule 10 prefix 10.100.0.0/24
set policy prefix-list LEAK-TO-CUSTOMER rule 20 prefix 10.100.1.0/24
set policy prefix-list LEAK-TO-CUSTOMER rule 999 action deny

3. Consider Security

# Leaked routes bypass VRF isolation
# Firewall rules should exist at leak points

set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 action accept
set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 destination port 53
set firewall ipv4 name SHARED-TO-CUSTOMER rule 10 protocol udp
# Only DNS allowed

4. Monitor Leaked Routes

# Regular check that only expected routes are leaked
show ip route vrf CUSTOMER-A | grep "vrf SHARED"

# Alert if unexpected routes appear

The Lesson

Route leaking provides controlled cross-VRF connectivity.

VRFs isolate by default. Sometimes you need controlled exceptions:

Shared DNS server
Internet gateway
Management access

Route leaking gives you:

Selective connectivity (only what you allow)
Clear separation (customers still isolated from each other)
Flexibility (import/export per VRF)

The key word is controlled. Leak only what's necessary. Use prefix lists. Verify bidirectional if needed.

VRFs without route leaking: Perfect isolation VRFs with careless leaking: No isolation VRFs with careful leaking: Controlled exceptions

Design your leaking policy before implementing. Document what goes where and why.

BGP Communities: Signaling Intent Across Networks

berik@ashimov.com (Berik Ashimov) — Fri, 09 Jan 2026 00:00:00 GMT

BGP routes carry more than just prefix and next-hop. Communities are tags attached to routes, signaling intent to other networks. "This route is for backup only." "Prepend this route 3 times to peers." "Don't announce this outside the region."

Without communities, you'd need separate sessions, manual filters, or constant coordination. With communities, you tag once, everyone who understands acts accordingly.

Communities are the language networks speak to each other.

Community Types

Standard Communities

32-bit value, formatted as ASN:value:

65000:100  - AS 65000, value 100
65000:1000 - AS 65000, value 1000

Well-Known Communities

Predefined, universal meaning:

Community	Value	Meaning
no-export	65535:65281	Don't export outside AS
no-advertise	65535:65282	Don't advertise to any peer
local-as	65535:65283	Don't export outside local confederation
no-peer	65535:65284	Don't advertise to EBGP peers

Extended Communities

64-bit, more structure:

rt:65000:100     - Route Target
soo:65000:100    - Site of Origin

Large Communities

96-bit for 4-byte ASNs:

4200000000:1:100  - Global Admin : Local Data 1 : Local Data 2

Matching Communities

Define Community List

configure

# Match single community
set policy community-list BACKUP rule 10 regex "65000:100"

# Match any from set
set policy community-list CUSTOMER rule 10 regex "65000:1[0-9][0-9]"

# Match well-known
set policy community-list NO-EXPORT rule 10 regex "no-export"

commit

Use in Route Map

configure

# Match routes with community
set policy route-map FILTER-IN rule 10 match community community-list BACKUP
set policy route-map FILTER-IN rule 10 action permit
set policy route-map FILTER-IN rule 10 set local-preference 50

commit

Setting Communities

Add Community to Route

configure

# Set community when advertising
set policy route-map SET-COMMUNITY rule 10 action permit
set policy route-map SET-COMMUNITY rule 10 set community "65000:100"

# Add community (keep existing)
set policy route-map ADD-COMMUNITY rule 10 action permit
set policy route-map ADD-COMMUNITY rule 10 set community "65000:200 additive"

# Set multiple communities
set policy route-map MULTI-COMMUNITY rule 10 action permit
set policy route-map MULTI-COMMUNITY rule 10 set community "65000:100 65000:200"

commit

Apply to Neighbor

# Apply route-map to neighbor
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map export SET-COMMUNITY

Common Use Cases

Use Case 1: Traffic Engineering

Tell upstream how to prefer your routes:

# Community convention with ISP:
# 65000:90 = set local-pref 90 (less preferred)
# 65000:100 = set local-pref 100 (normal)
# 65000:110 = set local-pref 110 (more preferred)

configure

# Mark backup link routes as less preferred
set policy route-map TO-ISP-BACKUP rule 10 action permit
set policy route-map TO-ISP-BACKUP rule 10 set community "65000:90"

set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-map export TO-ISP-BACKUP

commit

Use Case 2: Prepending Control

Ask upstream to prepend your routes:

# Community convention:
# 65000:3001 = prepend 1x to all peers
# 65000:3002 = prepend 2x to all peers
# 65000:3003 = prepend 3x to all peers

configure

# Request 2x prepend on backup routes
set policy route-map PREPEND-REQUEST rule 10 action permit
set policy route-map PREPEND-REQUEST rule 10 set community "65000:3002"

commit

Use Case 3: Regional Filtering

Announce only within region:

# Community convention:
# 65000:1000 = US region
# 65000:2000 = EU region
# 65000:3000 = APAC region

configure

# Mark route as US-only
set policy route-map US-ONLY rule 10 action permit
set policy route-map US-ONLY rule 10 set community "65000:1000 no-export"

commit

Use Case 4: Customer vs Peer vs Transit

Tag routes by source:

# Internal convention:
# 65000:100 = customer route
# 65000:200 = peer route
# 65000:300 = transit route

configure

# Tag customer routes
set policy route-map FROM-CUSTOMER rule 10 action permit
set policy route-map FROM-CUSTOMER rule 10 set community "65000:100"

# Use for policy decisions
set policy route-map TO-PEER rule 10 match community community-list CUSTOMER
set policy route-map TO-PEER rule 10 action permit
# Only advertise customer routes to peers

set policy route-map TO-PEER rule 20 action deny
# Deny transit routes to peers (no transit)

commit

Use Case 5: Blackhole

Signal upstream to blackhole traffic:

# Standard blackhole community (check with provider)
# Many ISPs use: ISP_ASN:666

configure

# Mark route for blackholing
set policy route-map BLACKHOLE rule 10 action permit
set policy route-map BLACKHOLE rule 10 match ip address prefix-list ATTACK-PREFIX
set policy route-map BLACKHOLE rule 10 set community "65000:666"

commit

Extended Communities

Route Targets (for VRF/VPN)

configure

# Import routes with specific RT
set protocols bgp address-family ipv4-vpn
set vrf name CUSTOMER-A rd 65000:1
set vrf name CUSTOMER-A route-target import 65000:1
set vrf name CUSTOMER-A route-target export 65000:1

commit

Site of Origin

# Prevent routing loops in multi-homed sites
# Routes from site won't be sent back to same site

set policy route-map SET-SOO rule 10 action permit
set policy route-map SET-SOO rule 10 set extcommunity soo "65000:100"

Large Communities

For networks with 4-byte ASNs or needing more structure:

configure

# Define large community list
set policy large-community-list CUSTOMER rule 10 regex "4200000000:1:.*"

# Set large community
set policy route-map SET-LARGE rule 10 action permit
set policy route-map SET-LARGE rule 10 set large-community "4200000000:1:100"

# Match large community
set policy route-map MATCH-LARGE rule 10 match large-community large-community-list CUSTOMER

commit

Viewing Communities

Show Communities on Routes

# Show BGP routes with communities
show bgp ipv4 unicast community

# Show specific prefix with communities
show bgp ipv4 unicast 203.0.113.0/24

# Output includes:
# Community: 65000:100 65000:200

# Filter by community
show bgp ipv4 unicast community 65000:100

Show Community Lists

# Show defined community lists
show policy community-list

Stripping Communities

Remove Specific Communities

configure

# Delete specific community
set policy route-map STRIP-INTERNAL rule 10 action permit
set policy route-map STRIP-INTERNAL rule 10 set community delete community-list INTERNAL

commit

Remove All Communities

configure

# Remove all communities (nuclear option)
set policy route-map STRIP-ALL rule 10 action permit
set policy route-map STRIP-ALL rule 10 set community none

commit

Community Design Principles

1. Document Your Scheme

# Community Scheme for AS65000

## Route Type (65000:1xx)
- 65000:100 = Customer route
- 65000:110 = Peer route
- 65000:120 = Transit route

## Traffic Engineering (65000:2xx)
- 65000:200 = Normal preference
- 65000:210 = Higher preference
- 65000:220 = Lower preference

## Regional (65000:3xx)
- 65000:300 = All regions
- 65000:310 = US only
- 65000:320 = EU only

## Action Requests (65000:4xx)
- 65000:410 = Prepend 1x
- 65000:420 = Prepend 2x
- 65000:430 = Prepend 3x
- 65000:499 = Blackhole

2. Use Consistent Patterns

# Good: Predictable scheme
# 65000:1xxx = route type
# 65000:2xxx = preference
# 65000:3xxx = regional
# 65000:4xxx = actions

# Bad: Random assignment
# 65000:42 = customer
# 65000:7 = blackhole
# 65000:1234 = US

3. Don't Trust External Communities

# Strip customer communities on ingress
set policy route-map FROM-CUSTOMER rule 1 action permit
set policy route-map FROM-CUSTOMER rule 1 set community delete community-list ALL-INTERNAL
set policy route-map FROM-CUSTOMER rule 1 set community "65000:100 additive"

# Then apply customer tag

The Lesson

Communities are the language networks speak to each other.

Without communities:

Manual coordination for traffic engineering
Separate sessions for different policies
No way to signal intent across ASes

With communities:

Tag routes with meaning
Upstream acts on tags automatically
Complex policies become simple

Design your community scheme before you need it. Document it. Use consistent numbering. Make it extensible.

Communities scale your network's communication without scaling your operational overhead.

Network Automation with Ansible: From Manual CLI to Infrastructure as Code

berik@ashimov.com (Berik Ashimov) — Thu, 08 Jan 2026 00:00:00 GMT

After years of manually configuring network devices through CLI, I made the switch to network automation. The transformation wasn't just about saving time — it fundamentally changed how we approach network operations. Here's what I learned implementing Ansible-based automation across enterprise networks.

Why Network Automation Matters

The traditional approach to network management has serious limitations:

Configuration drift: Manual changes accumulate inconsistencies
Human error: Typos in CLI commands cause outages
Audit challenges: No clear record of who changed what and when
Slow disaster recovery: Rebuilding configurations from scratch takes hours

Automation addresses all of these while enabling practices that were previously impractical.

Starting with Ansible for Networks

Ansible works differently for network devices than for servers. Most network gear doesn't run Python, so Ansible connects via SSH and sends commands directly. This makes it accessible even for legacy equipment.

Basic Inventory Structure

# inventory/hosts.ini
[core_switches]
core-sw-01 ansible_host=10.0.1.1
core-sw-02 ansible_host=10.0.1.2

[access_switches]
access-sw-[01:24] ansible_host=10.0.2.{{ item }}

[firewalls]
fw-primary ansible_host=10.0.0.1
fw-secondary ansible_host=10.0.0.2

[all:vars]
ansible_network_os=ios
ansible_connection=network_cli
ansible_user=automation
ansible_ssh_private_key_file=~/.ssh/network_automation

Group Variables for Configuration Standards

# group_vars/all.yml
---
ntp_servers:
  - 10.0.0.10
  - 10.0.0.11

dns_servers:
  - 10.0.0.20
  - 10.0.0.21

syslog_servers:
  - 10.0.0.30

snmp_community: "{{ vault_snmp_community }}"

banner_motd: |
  ********************************************
  *  AUTHORIZED ACCESS ONLY                  *
  *  All activity is monitored and logged    *
  ********************************************

Practical Playbook Examples

Configuration Backup

This is often the first automation task — backing up all device configurations nightly:

# playbooks/backup_configs.yml
---
- name: Backup Network Device Configurations
  hosts: all
  gather_facts: no

  vars:
    backup_root: "/backup/network/{{ inventory_hostname }}"

  tasks:
    - name: Create backup directory
      delegate_to: localhost
      file:
        path: "{{ backup_root }}"
        state: directory

    - name: Get current configuration
      ios_command:
        commands: show running-config
      register: config_output

    - name: Save configuration to file
      delegate_to: localhost
      copy:
        content: "{{ config_output.stdout[0] }}"
        dest: "{{ backup_root }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.cfg"

    - name: Compare with previous backup
      delegate_to: localhost
      shell: |
        latest=$(ls -t {{ backup_root }}/*.cfg | head -2 | tail -1)
        current={{ backup_root }}/{{ inventory_hostname }}_{{ ansible_date_time.date }}.cfg
        if [ -f "$latest" ] && [ "$latest" != "$current" ]; then
          diff -u "$latest" "$current" || true
        fi
      register: config_diff
      changed_when: config_diff.stdout != ""

    - name: Send notification if config changed
      delegate_to: localhost
      mail:
        to: netops@company.com
        subject: "Config change detected: {{ inventory_hostname }}"
        body: "{{ config_diff.stdout }}"
      when: config_diff.changed

Standardizing NTP Configuration

Ensuring all devices use the correct time servers:

# playbooks/configure_ntp.yml
---
- name: Standardize NTP Configuration
  hosts: all
  gather_facts: no

  tasks:
    - name: Remove existing NTP servers
      ios_config:
        lines:
          - no ntp server {{ item }}
      loop: "{{ existing_ntp_servers | default([]) }}"
      when: existing_ntp_servers is defined

    - name: Configure standard NTP servers
      ios_config:
        lines:
          - ntp server {{ item }} prefer
      loop: "{{ ntp_servers }}"

    - name: Set timezone
      ios_config:
        lines:
          - clock timezone UTC 0

    - name: Verify NTP synchronization
      ios_command:
        commands: show ntp status
      register: ntp_status

    - name: Report NTP status
      debug:
        msg: "NTP synchronized: {{ 'Clock is synchronized' in ntp_status.stdout[0] }}"

VLAN Deployment Across Multiple Switches

When adding a new VLAN to multiple devices:

# playbooks/deploy_vlan.yml
---
- name: Deploy VLAN to Access Switches
  hosts: access_switches
  gather_facts: no

  vars_prompt:
    - name: vlan_id
      prompt: "Enter VLAN ID"
      private: no
    - name: vlan_name
      prompt: "Enter VLAN name"
      private: no

  tasks:
    - name: Create VLAN
      ios_vlans:
        config:
          - vlan_id: "{{ vlan_id | int }}"
            name: "{{ vlan_name }}"
            state: active
        state: merged

    - name: Verify VLAN creation
      ios_command:
        commands: "show vlan id {{ vlan_id }}"
      register: vlan_check

    - name: Display result
      debug:
        var: vlan_check.stdout_lines

Compliance Checking

Beyond configuration, Ansible can verify devices meet security standards:

# playbooks/compliance_check.yml
---
- name: Security Compliance Audit
  hosts: all
  gather_facts: no

  vars:
    compliance_report: []

  tasks:
    - name: Check SSH version 2
      ios_command:
        commands: show ip ssh
      register: ssh_config

    - name: Verify SSHv2 enabled
      set_fact:
        compliance_report: "{{ compliance_report + [{'check': 'SSH Version', 'status': 'PASS' if 'SSH version 2' in ssh_config.stdout[0] else 'FAIL'}] }}"

    - name: Check password encryption
      ios_command:
        commands: show running-config | include service password
      register: password_config

    - name: Verify password encryption
      set_fact:
        compliance_report: "{{ compliance_report + [{'check': 'Password Encryption', 'status': 'PASS' if 'service password-encryption' in password_config.stdout[0] else 'FAIL'}] }}"

    - name: Check login banner
      ios_command:
        commands: show running-config | section banner
      register: banner_config

    - name: Verify login banner exists
      set_fact:
        compliance_report: "{{ compliance_report + [{'check': 'Login Banner', 'status': 'PASS' if banner_config.stdout[0] | length > 10 else 'FAIL'}] }}"

    - name: Check unused ports disabled
      ios_command:
        commands: show interfaces status | include notconnect
      register: unused_ports

    - name: Generate compliance report
      delegate_to: localhost
      template:
        src: compliance_report.j2
        dest: "/reports/compliance/{{ inventory_hostname }}_{{ ansible_date_time.date }}.html"

Handling Different Vendors

Real networks have equipment from multiple vendors. Ansible handles this with platform-specific modules:

# playbooks/multi_vendor_backup.yml
---
- name: Multi-Vendor Configuration Backup
  hosts: all
  gather_facts: no

  tasks:
    - name: Backup Cisco IOS
      ios_command:
        commands: show running-config
      register: config
      when: ansible_network_os == 'ios'

    - name: Backup Cisco NX-OS
      nxos_command:
        commands: show running-config
      register: config
      when: ansible_network_os == 'nxos'

    - name: Backup Juniper JunOS
      junos_command:
        commands: show configuration
      register: config
      when: ansible_network_os == 'junos'

    - name: Backup Arista EOS
      eos_command:
        commands: show running-config
      register: config
      when: ansible_network_os == 'eos'

    - name: Save configuration
      delegate_to: localhost
      copy:
        content: "{{ config.stdout[0] }}"
        dest: "/backup/{{ inventory_hostname }}.cfg"

Error Handling and Rollback

Network changes need careful error handling:

# playbooks/safe_config_change.yml
---
- name: Safe Configuration Change with Rollback
  hosts: "{{ target_device }}"
  gather_facts: no
  serial: 1

  tasks:
    - name: Backup current config
      ios_command:
        commands: show running-config
      register: backup_config

    - name: Save backup locally
      delegate_to: localhost
      copy:
        content: "{{ backup_config.stdout[0] }}"
        dest: "/tmp/{{ inventory_hostname }}_rollback.cfg"

    - name: Apply configuration changes
      ios_config:
        src: "{{ config_template }}"
        save_when: never
      register: config_result

    - name: Verify connectivity after change
      wait_for:
        host: "{{ ansible_host }}"
        port: 22
        timeout: 30
      delegate_to: localhost
      register: connectivity
      ignore_errors: yes

    - name: Rollback if connectivity lost
      block:
        - name: Wait for device recovery
          wait_for:
            host: "{{ ansible_host }}"
            port: 22
            timeout: 300
          delegate_to: localhost

        - name: Restore previous configuration
          ios_config:
            src: "/tmp/{{ inventory_hostname }}_rollback.cfg"

        - name: Notify about rollback
          mail:
            to: netops@company.com
            subject: "ROLLBACK: {{ inventory_hostname }}"
            body: "Configuration change failed and was rolled back"
          delegate_to: localhost
      when: connectivity is failed

    - name: Save configuration if successful
      ios_command:
        commands: write memory
      when: connectivity is succeeded

Integration with CI/CD

Network changes can flow through the same CI/CD pipelines as application code:

# .gitlab-ci.yml
stages:
  - validate
  - test
  - deploy

validate_syntax:
  stage: validate
  script:
    - ansible-playbook --syntax-check playbooks/*.yml
    - ansible-lint playbooks/

test_in_lab:
  stage: test
  script:
    - ansible-playbook -i inventory/lab playbooks/deploy_changes.yml
    - ansible-playbook -i inventory/lab playbooks/run_tests.yml
  environment:
    name: lab

deploy_production:
  stage: deploy
  script:
    - ansible-playbook -i inventory/production playbooks/deploy_changes.yml
  environment:
    name: production
  when: manual
  only:
    - main

Lessons Learned

After implementing automation across several networks:

Start small: Begin with read-only tasks like backups and compliance checks. Build confidence before making changes.

Version control everything: Playbooks, inventory, variables — all in Git. This provides audit trail and enables code review for network changes.

Test in lab first: Even simple playbooks can have unexpected effects. A lab environment (even virtual) is essential.

Use check mode: Always run with --check first in production to see what would change.

Document the manual fallback: Automation will fail eventually. Document how to perform critical tasks manually.

Monitor playbook execution: Log all runs, track success rates, alert on failures.

The Bigger Picture

Network automation isn't just about efficiency — it's about treating network infrastructure with the same rigor as application code. When configurations are versioned, tested, and deployed through pipelines, networks become more reliable and easier to manage.

The shift from "network engineer who uses CLI" to "network engineer who writes code" isn't always comfortable, but it's increasingly necessary. The skills transfer — understanding of networking fundamentals remains essential, you're just expressing that knowledge differently.

Start with backups. Move to compliance checking. Then tackle configuration standardization. Each step builds confidence for the next. Before long, you'll wonder how you managed without it.

Building Reliable Infrastructure: Lessons from 15 Years in Operations

berik@ashimov.com (Berik Ashimov) — Wed, 07 Jan 2026 00:00:00 GMT

import Callout from '@/components/mdx/Callout.astro';

After 15 years of managing production infrastructure — from small business servers to enterprise payment systems with 99.99% uptime requirements — I've learned that reliability isn't about preventing failures. It's about designing systems that handle failures gracefully.

The Three Pillars of Reliable Infrastructure

Every reliable system I've built or maintained rests on three pillars:

1. Eliminate Single Points of Failure

The question isn't "will this component fail?" but "when it fails, what happens?"

Network layer:

Dual uplinks with BGP or VRRP failover
Redundant switches in stack or MLAG configuration
Multiple paths to critical services

Compute layer:

VM anti-affinity rules across hypervisor hosts
Database replicas in different failure domains
Load balancers in active-passive or active-active pairs

Storage layer:

RAID for local storage (RAID 10 for databases)
Replicated storage backends
Off-site backups with tested recovery procedures

<Callout type="warning" title="Common Mistake"> Redundancy without automatic failover is just expensive complexity. If someone needs to SSH in at 3 AM to switch traffic, your redundancy has failed. </Callout>

2. Monitor What Matters

I've seen monitoring setups with 10,000 metrics where teams still miss critical outages. The problem isn't lack of data — it's lack of focus.

Effective monitoring hierarchy:

Business metrics (revenue, user transactions)
    ↓
Service metrics (latency, error rates, throughput)
    ↓
Infrastructure metrics (CPU, memory, disk, network)

Start from the top. If business metrics are healthy, infrastructure alerts can wait. If business metrics drop, you need to know immediately — even if all infrastructure metrics look green.

Key principles:

Alert on symptoms, not causes
Every alert should be actionable
If you ignore an alert twice, fix or delete it
On-call should get fewer than 5 pages per week

3. Standardize Everything

The most reliable environments I've managed weren't the most sophisticated — they were the most boring. Same OS version everywhere. Same configuration management. Same deployment process.

What to standardize:

Base OS images with security hardening
Network configurations (use templates)
Monitoring and logging agents
Backup schedules and retention
Patch management windows

Benefits:

Faster troubleshooting (you've seen this before)
Easier automation (one playbook fits all)
Reduced cognitive load (less to remember)
Simpler compliance (consistent baselines)

Real-World Example: Payment System Architecture

One system I helped design processes financial transactions across two data centers. Here's what makes it reliable:

Component	Primary	Failover	RTO
Database	DC1 (sync replica)	DC2 (async replica)	< 30s
Application	Active-active	N/A	0
Load Balancer	DC1	DC2 (DNS failover)	< 60s
Network	ISP A + ISP B	BGP rerouting	< 10s

Key design decisions:

Synchronous replication for transactions (data consistency over availability)
Asynchronous replication for reporting databases (availability over consistency)
Health checks every 5 seconds with 3 failures before failover
Automated failover for network and load balancers
Manual failover for database (intentional — prevents split-brain)

<Callout type="info" title="The 99.99% Reality"> 99.99% uptime means less than 53 minutes of downtime per year. That's about 4 minutes per month. Every design decision must account for this budget. </Callout>

Operational Practices That Actually Work

Beyond architecture, these practices have saved me countless times:

Change Management

No changes on Fridays (or before holidays)
All changes documented and reversible
Staged rollouts: dev → staging → canary → production
Post-change monitoring period (15-30 minutes minimum)

Incident Response

Clear escalation paths documented
Runbooks for common failures (not just "restart the service")
Blameless postmortems focused on systemic improvements
Regular disaster recovery drills (at least quarterly)

Capacity Planning

Track growth trends monthly
Provision for 2x expected peak load
Set alerts at 70% capacity (time to plan expansion)
Review capacity quarterly with business stakeholders

What I've Learned

If I could summarize 15 years into a few principles:

Simple systems fail less — Every component is a potential failure point
Automation saves you at 3 AM — If it's not automated, it won't happen correctly under pressure
Documentation is for future you — Write it like you'll be on vacation when things break
Test your backups — Untested backups are just hopes
Learn from every incident — The same failure twice is an organizational failure

This is the first post on my new blog. I'll be sharing more operational knowledge — monitoring setups, network automation, security practices, and lessons from real incidents.

Questions or topics you'd like me to cover? Reach out on LinkedIn or Telegram.

Graceful Restart: Maintaining Forwarding During Protocol Restarts

berik@ashimov.com (Berik Ashimov) — Tue, 06 Jan 2026 00:00:00 GMT

You need to restart the routing daemon. Maybe for upgrade, maybe for config reload. Normal behavior: neighbors detect restart, withdraw routes, traffic reroutes. Convergence takes seconds to minutes.

Graceful restart keeps forwarding while the control plane restarts. Neighbors know you're restarting (not dead) and keep routes. Data plane continues forwarding. After restart, routing state resynchronizes. No traffic loss.

Graceful restart prevents traffic loss during planned maintenance.

How Graceful Restart Works

Normal Restart (Without GR)

1. Router A routing daemon restarts
2. Router B detects adjacency down
3. Router B withdraws all routes from A
4. Traffic reconverges to alternate paths
5. Router A comes back up
6. Adjacency re-established
7. Routes re-learned
8. Traffic returns to original path

Impact: Minutes of reconvergence, possible blackhole

Graceful Restart

1. Router A signals "entering graceful restart"
2. Router A daemon restarts, forwarding plane continues
3. Router B (helper) keeps routes, marks them "stale"
4. Router A comes back up quickly
5. Router A re-establishes adjacency, refreshes routes
6. Router B removes "stale" flag
7. No route withdrawal, no reconvergence

Impact: Near-zero traffic disruption

BGP Graceful Restart

Basic Configuration

configure

# Enable graceful restart for BGP
set protocols bgp parameters graceful-restart

# Optional: Set restart time (how long helper waits)
set protocols bgp parameters graceful-restart restart-time 120

# Optional: Set stalepath time (how long to keep stale routes)
set protocols bgp parameters graceful-restart stalepath-time 360

commit

Per-Neighbor Configuration

# Enable/disable per neighbor
set protocols bgp neighbor 10.0.0.2 capability graceful-restart

# Some neighbors might not support GR
# Disable for specific neighbor:
set protocols bgp neighbor 10.0.0.3 capability graceful-restart disable

BGP GR Timers

Timer	Purpose	Default	Range
restart-time	Time helper waits for restart	120s	1-4095s
stalepath-time	Time to keep stale routes	360s	1-4095s

# Adjust timers
set protocols bgp parameters graceful-restart restart-time 180
set protocols bgp parameters graceful-restart stalepath-time 600

Verify BGP GR

# Check neighbor capabilities
show bgp neighbors 10.0.0.2

# Look for:
# Graceful Restart Capability: advertised and received
# Remote Restart timer is 120 seconds
# Address families by peer:
#   IPv4 Unicast(Preserved)

# Check current GR state
show bgp neighbors 10.0.0.2 graceful-restart

OSPF Graceful Restart

Basic Configuration

configure

# Enable OSPF graceful restart
set protocols ospf graceful-restart

# Set grace period
set protocols ospf graceful-restart grace-period 180

commit

OSPF GR Helper Mode

# Helper mode (support other routers restarting)
set protocols ospf graceful-restart helper enable

# Can restrict helper mode
set protocols ospf graceful-restart helper strict-lsa-checking
# If LSA changes during restart, exit GR (safer)

OSPF GR Timers

Timer	Purpose	Default
grace-period	Time to complete restart	180s

# Adjust grace period
set protocols ospf graceful-restart grace-period 300

Verify OSPF GR

# Check OSPF graceful restart status
show ip ospf graceful-restart

# Check neighbor state during restart
show ip ospf neighbor

# During GR, neighbor might show special state

Testing Graceful Restart

Test BGP GR

# Terminal 1: Watch BGP neighbor
watch -n 1 'vtysh -c "show bgp neighbors 10.0.0.2"'

# Terminal 2: Restart BGP
systemctl restart frr

# Observe:
# - Neighbor should stay established (or show "Restart")
# - Routes should not be withdrawn
# - Quick re-establishment

Test OSPF GR

# Terminal 1: Watch OSPF neighbor
watch -n 1 'vtysh -c "show ip ospf neighbor"'

# Terminal 2: Restart OSPF
systemctl restart frr

# Observe:
# - Neighbor should not go Down
# - Routes should persist

Verify Forwarding Continues

# From another host, continuous ping through router
ping -i 0.1 destination-through-router

# During restart:
# Without GR: Packet loss during convergence
# With GR: Zero or minimal packet loss

Long-Lived Graceful Restart (LLGR)

For BGP, LLGR extends stale route retention:

configure

# Enable LLGR
set protocols bgp parameters graceful-restart long-lived

# Set LLGR stale time (much longer than regular)
set protocols bgp parameters graceful-restart long-lived stale-time 86400

commit

LLGR keeps routes even longer, with lower preference (community added). Useful for edge cases where restart takes very long.

Graceful Restart vs BFD

They serve different purposes:

Feature	Graceful Restart	BFD
Purpose	Survive planned restarts	Detect failures fast
Trigger	Control plane restart	Link/peer failure
Response	Keep routes	Withdraw routes fast
Use together	Yes	Yes

# Use both!
# BFD: Detect actual failures quickly
# GR: Survive planned restarts

set protocols bgp neighbor 10.0.0.2 bfd
set protocols bgp parameters graceful-restart

When GR Doesn't Help

Unplanned Failures

# Router crashes (not graceful)
# Forwarding plane also fails
# GR signal never sent

# Solution: BFD detects quickly, traffic reroutes

Forwarding Plane Restart

# If forwarding (kernel/hardware) restarts:
# GR won't help - traffic still disrupted

# GR only helps when:
# - Control plane (FRR) restarts
# - Forwarding (kernel routes) continues

Configuration Changes

# Major config change might require route refresh anyway
# GR preserves old routes, but new config applies

# Be careful: GR might keep stale config briefly

Troubleshooting GR

GR Not Working

# Check if GR capability exchanged
show bgp neighbors 10.0.0.2 | grep -i graceful

# If "not received":
# - Peer doesn't support GR
# - Peer has GR disabled

# Check OSPF GR status
show ip ospf graceful-restart

# If disabled, check config:
show configuration commands | grep graceful

Routes Withdrawn Anyway

# Possible causes:
# 1. Restart took too long (exceeded timer)
# 2. Helper router cleared routes
# 3. GR not properly negotiated

# Check timers
show bgp neighbors 10.0.0.2 | grep -i timer
show bgp neighbors 10.0.0.2 | grep -i restart

# Increase restart-time if needed

Helper Not Preserving Routes

# Check helper configuration
show configuration commands | grep helper

# OSPF might need explicit helper mode
set protocols ospf graceful-restart helper enable

Best Practices

1. Enable on All Routers

# GR is peer-to-peer negotiation
# Both sides should have it enabled

# Without GR on peer:
# - Your restart withdraws routes from peer
# - Peer's restart withdraws routes from you

2. Test Before Production

# Test GR in lab/staging
# Verify:
# - Capabilities exchanged
# - Routes preserved during restart
# - Forwarding continues

3. Monitor During Maintenance

# During planned restart, monitor:
show bgp summary
show ip ospf neighbor

# Watch for state changes
# Verify quick re-establishment

4. Tune Timers for Your Environment

# Fast restart (SSD, modern hardware)
set protocols bgp parameters graceful-restart restart-time 60

# Slow restart (older hardware, large config)
set protocols bgp parameters graceful-restart restart-time 300

Configuration Summary

BGP Graceful Restart

configure

# Basic GR
set protocols bgp parameters graceful-restart

# Timers
set protocols bgp parameters graceful-restart restart-time 120
set protocols bgp parameters graceful-restart stalepath-time 360

# Per-neighbor (optional)
set protocols bgp neighbor 10.0.0.2 capability graceful-restart

commit

OSPF Graceful Restart

configure

# Basic GR
set protocols ospf graceful-restart
set protocols ospf graceful-restart grace-period 180

# Helper mode
set protocols ospf graceful-restart helper enable

commit

The Lesson

Graceful restart prevents traffic loss during planned maintenance.

Without GR:

Daemon restart = all routes withdrawn
Traffic reconverges (seconds to minutes)
Users see disruption

With GR:

Daemon restart signaled to neighbors
Neighbors keep routes (marked stale)
Forwarding continues
Daemon comes back, routes refreshed
Users notice nothing

Every production router should have graceful restart enabled. It's free insurance for maintenance windows.

The 30 seconds you spend configuring GR saves minutes of disruption every time you restart the routing daemon.

BFD: Fast Failover Detection for Routing Protocols

berik@ashimov.com (Berik Ashimov) — Fri, 02 Jan 2026 00:00:00 GMT

BGP default keepalive: 60 seconds. Hold time: 180 seconds. That's 3 minutes before your router notices a peer is dead. Three minutes of blackholing traffic.

OSPF default dead interval: 40 seconds. Better, but still 40 seconds of packets going nowhere.

BFD (Bidirectional Forwarding Detection) runs alongside routing protocols, detecting failures in milliseconds. When BFD sees the neighbor is dead, it tells BGP/OSPF immediately. Failover happens in under a second.

Routing protocol keepalives are too slow. BFD fixes this.

How BFD Works

Normal state:
Router A ←→ Router B
BFD packets every 300ms
Both routers: "Peer is alive"

Failure:
Router A → X → Router B (link fails)
Router A: No BFD response for 900ms (3 × 300ms)
Router A: "Peer is dead, notify BGP/OSPF"
BGP/OSPF: Immediately withdraw routes
Total detection time: ~1 second

BFD is protocol-independent. It just says "neighbor reachable" or "neighbor unreachable." Routing protocols react to this signal.

BFD Timers

Parameter	Description	Typical Value
interval	Time between BFD packets	300ms
min-rx	Minimum receive interval	300ms
multiplier	Missed packets before failure	3

Detection time = interval × multiplier = 300ms × 3 = 900ms

Basic BFD Configuration

Enable BFD Globally

configure

# Define BFD profile
set protocols bfd profile FAST interval 300
set protocols bfd profile FAST min-rx 300
set protocols bfd profile FAST multiplier 3

commit

BFD with BGP

configure

# Configure BGP neighbor
set protocols bgp neighbor 10.0.0.2 remote-as 65002
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast

# Enable BFD for this neighbor
set protocols bgp neighbor 10.0.0.2 bfd

# Or with specific profile
set protocols bgp neighbor 10.0.0.2 bfd profile FAST

commit

BFD with OSPF

configure

# Configure OSPF
set protocols ospf area 0 network 10.0.0.0/24

# Enable BFD for all OSPF neighbors (interface level)
set protocols ospf interface eth0 bfd

# Or enable globally for all OSPF interfaces
set protocols ospf parameters bfd all-interfaces

commit

Multihop BFD

For eBGP peers not directly connected:

configure

# Multihop BGP neighbor
set protocols bgp neighbor 192.0.2.1 remote-as 65100
set protocols bgp neighbor 192.0.2.1 ebgp-multihop 3

# Multihop BFD (specify source)
set protocols bfd peer 192.0.2.1 source address 198.51.100.1
set protocols bfd peer 192.0.2.1 multihop
set protocols bfd peer 192.0.2.1 profile FAST

# Link BGP to BFD peer
set protocols bgp neighbor 192.0.2.1 bfd

commit

BFD Profiles

Create profiles for different use cases:

configure

# Aggressive (datacenter, low latency links)
set protocols bfd profile AGGRESSIVE interval 100
set protocols bfd profile AGGRESSIVE min-rx 100
set protocols bfd profile AGGRESSIVE multiplier 3
# Detection: 300ms

# Standard (most links)
set protocols bfd profile STANDARD interval 300
set protocols bfd profile STANDARD min-rx 300
set protocols bfd profile STANDARD multiplier 3
# Detection: 900ms

# Conservative (unstable links, prevent flapping)
set protocols bfd profile CONSERVATIVE interval 1000
set protocols bfd profile CONSERVATIVE min-rx 1000
set protocols bfd profile CONSERVATIVE multiplier 5
# Detection: 5 seconds

commit

Apply Profiles

# BGP neighbor with specific profile
set protocols bgp neighbor 10.0.0.2 bfd profile AGGRESSIVE

# OSPF interface with specific profile
set protocols ospf interface eth0 bfd profile STANDARD

Monitoring BFD

View BFD Status

# Show all BFD peers
show bfd peers

# Output:
# BFD Peers:
#     peer 10.0.0.2
#         ID: 1234567890
#         Status: up
#         Uptime: 2 hours 15 minutes
#         Diagnostics: ok
#         Local timers:
#             Interval: 300ms
#             Echo interval: disabled
#             Multiplier: 3
#         Remote timers:
#             Interval: 300ms
#             Multiplier: 3

View BFD with Routing Protocol

# BGP neighbor with BFD status
show bgp neighbors 10.0.0.2

# Look for:
# BFD: enabled
# BFD status: Up

# OSPF neighbor with BFD
show ip ospf neighbor

# BFD column shows: Up/Down

BFD Counters

# Show BFD statistics
show bfd peers counters

# Control packet statistics
# Session state change count

Echo Mode

BFD echo mode reduces CPU load by having the remote peer echo packets back:

configure

# Enable echo mode
set protocols bfd peer 10.0.0.2 echo-mode

# Set echo interval
set protocols bfd peer 10.0.0.2 echo-interval 50

commit

Echo Mode Considerations

Lower CPU usage (echo packets handled in fast path)
Requires symmetric forwarding
May not work across some network devices
Not available for multihop BFD

BFD and High Availability

BFD in Redundant Setup

         ISP A
           |
     [10.0.0.2]
           |
    VyOS Router (BFD to both)
           |
     [10.0.1.2]
           |
         ISP B

configure

# Primary ISP - aggressive detection
set protocols bgp neighbor 10.0.0.2 remote-as 65001
set protocols bgp neighbor 10.0.0.2 bfd profile AGGRESSIVE

# Backup ISP - also fast detection
set protocols bgp neighbor 10.0.1.2 remote-as 65002
set protocols bgp neighbor 10.0.1.2 bfd profile AGGRESSIVE

commit

When primary fails, BFD detects in ~300ms, BGP converges, backup takes over.

BFD with VRRP

BFD can trigger faster VRRP failover:

# Not directly integrated, but:
# - BFD detects link failure
# - Track script checks BFD status
# - VRRP priority adjusted based on BFD

Troubleshooting BFD

BFD Session Not Establishing

# Check if BFD packets are exchanged
sudo tcpdump -i eth0 udp port 3784

# BFD control: UDP port 3784
# BFD echo: UDP port 3785

# Common issues:
# - Firewall blocking BFD ports
# - Source address mismatch
# - Timer mismatch (negotiation fails)

BFD Flapping

# Session up/down repeatedly
show log | grep -i bfd

# Causes:
# - Timers too aggressive for link quality
# - Congestion causing packet loss
# - MTU issues

# Solution: Increase timers
set protocols bfd profile STABLE interval 500
set protocols bfd profile STABLE multiplier 5

One-Way BFD

# BFD shows "Down" but packets sent

# Check for asymmetric routing
# BFD packets might take different return path

# For multihop BFD, ensure:
# - Source address configured correctly
# - Routing is symmetric
# - TTL is sufficient

BFD Firewall Rules

If firewall is enabled, allow BFD:

configure

# Allow BFD control packets
set firewall ipv4 name ROUTER-IN rule 20 action accept
set firewall ipv4 name ROUTER-IN rule 20 protocol udp
set firewall ipv4 name ROUTER-IN rule 20 destination port 3784
set firewall ipv4 name ROUTER-IN rule 20 description "BFD Control"

# Allow BFD echo packets
set firewall ipv4 name ROUTER-IN rule 21 action accept
set firewall ipv4 name ROUTER-IN rule 21 protocol udp
set firewall ipv4 name ROUTER-IN rule 21 destination port 3785
set firewall ipv4 name ROUTER-IN rule 21 description "BFD Echo"

commit

Best Practices

1. Match Timers on Both Sides

# Both routers should have compatible timers
# BFD negotiates, but similar values work best

# Router A
set protocols bfd profile STANDARD interval 300
set protocols bfd profile STANDARD min-rx 300
set protocols bfd profile STANDARD multiplier 3

# Router B - same settings

2. Consider Link Quality

# High-quality datacenter links
# → Aggressive timers (100-300ms)

# WAN/Internet links
# → Conservative timers (500ms-1s)

# Satellite/high-latency links
# → Very conservative (1s+, higher multiplier)

3. Don't Be Too Aggressive

# 50ms timers sound great until:
# - Minor congestion triggers failover
# - Route flapping destabilizes network
# - CPU can't keep up with BFD packets

# Start conservative, tune down if needed

4. Monitor BFD State

# Alert on BFD state changes
# Track BFD flapping frequency
# Correlate with network events

BFD Timer Calculation

Detection Time = interval × multiplier

Examples:
100ms × 3 = 300ms detection
300ms × 3 = 900ms detection
500ms × 5 = 2.5s detection
1000ms × 3 = 3s detection

Compare to:
BGP default: 180 seconds
OSPF default: 40 seconds

The Lesson

Routing protocol keepalives are too slow. BFD fixes this.

Without BFD:

BGP: 180 seconds to detect dead peer
OSPF: 40 seconds to detect dead neighbor
Traffic blackholed during detection

With BFD:

Detection in sub-second (300ms-1s typical)
Routing protocols react immediately
Failover happens before users notice

BFD is simple to configure, low overhead, and dramatically improves convergence time. Every production BGP session and OSPF adjacency should have BFD enabled.

The only question is timer values: aggressive for reliable links, conservative for flaky links. Start with 300ms/3, adjust based on your network.

Policy Routing Debug: Why Traffic Takes the Wrong Path

berik@ashimov.com (Berik Ashimov) — Tue, 30 Dec 2025 00:00:00 GMT

Policy routing configured. Traffic still takes the default route. You add more rules. Still doesn't work. You start guessing.

Policy-based routing (PBR) is simple in concept but has multiple points of failure. Each must be correct: match criteria, firewall marks, routing tables, rule priority. Miss one, and traffic ignores your policy.

PBR debugging needs systematic verification, not guessing.

PBR Components

Policy routing has four parts. All must be correct:

1. Policy: Match traffic and set mark
   └── firewall rules with mark action

2. Mark: Identify traffic for routing decision
   └── fwmark value (0x1, 0x2, etc.)

3. Table: Alternative routing table
   └── custom routes separate from main

4. Rule: Match mark and use table
   └── ip rule connecting mark to table

Verification Workflow

Step 1: Verify Policy Matches

# Check policy is applied to interface
show configuration commands | grep policy

# Expected:
# set policy route VPN-TRAFFIC rule 10 set mark 0x1
# set interfaces ethernet eth1 policy route VPN-TRAFFIC

# Verify policy rules
show configuration commands | grep "policy route"

Step 2: Verify Traffic Gets Marked

# Check firewall counters (if logging enabled)
show firewall

# Check if marking is happening with iptables
sudo iptables -t mangle -L -v -n

# Output should show packet counts on MARK rules:
# pkts bytes target     prot  in     out  source   destination
# 1234 5678K MARK       all   eth1   *    0.0.0.0/0  10.0.0.0/24   MARK set 0x1
#                                                                      ↑ Packets matched

# If counter is zero → policy not matching traffic

Step 3: Verify Routing Table Exists

# Show custom routing table
show ip route table 10

# Or directly:
ip route show table 10

# Should show routes:
# default via 10.10.0.1 dev tun0
# 10.10.0.0/24 dev tun0 proto kernel scope link src 10.10.0.2

Step 4: Verify Rule Connects Mark to Table

# Show all rules
ip rule show

# Expected output:
# 0:      from all lookup local
# 32765:  from all fwmark 0x1 lookup 10    ← Your rule
# 32766:  from all lookup main
# 32767:  from all lookup default

# If your fwmark rule is missing → rule not created
# If rule priority is wrong → might be evaluated after main table

Step 5: Test End-to-End

# Simulate marked packet lookup
ip route get 8.8.8.8 mark 0x1

# Expected:
# 8.8.8.8 via 10.10.0.1 dev tun0 table 10 mark 0x1

# If it shows main table route → mark not working

Common Problems

Problem 1: Policy Not Applied to Interface

# Symptom: Traffic not marked

# Check interface has policy
show interfaces ethernet eth1

# Should show:
# policy {
#     route VPN-TRAFFIC
# }

# If missing:
configure
set interfaces ethernet eth1 policy route VPN-TRAFFIC
commit

Problem 2: Wrong Match Criteria

# Symptom: Policy exists but doesn't match traffic

# Show policy details
show configuration commands | grep "policy route VPN-TRAFFIC"

# Common mistakes:
# - Source instead of destination (or vice versa)
# - Wrong subnet mask
# - Wrong protocol/port
# - Rule disabled

# Test what traffic should match:
# Rule says: source 192.168.1.0/24, destination 10.0.0.0/8
# Traffic is: source 192.168.2.100 → Won't match!

Problem 3: Mark Not Set

# Symptom: Rule matches but no mark

# Check iptables for mark rules
sudo iptables -t mangle -L PREROUTING -v -n

# Look for MARK target
# If MARK target shows 0 packets → not matching
# If no MARK rule → policy not generating iptables rules

# Verify mark is in policy:
show configuration commands | grep "set mark"
# set policy route VPN-TRAFFIC rule 10 set mark 0x1

Problem 4: Table Missing or Empty

# Symptom: Marked traffic uses main table

# Check table exists
ip route show table 10

# If empty or missing:
configure

# Add table (VyOS creates automatically with protocol static)
set protocols static table 10 route 0.0.0.0/0 next-hop 10.10.0.1

# Or for interface-based:
set protocols static table 10 route 0.0.0.0/0 interface tun0

commit

Problem 5: Rule Priority Wrong

# Symptom: Table has routes but not used

ip rule show
# 32765:  from all lookup main
# 32766:  from all fwmark 0x1 lookup 10    ← Too late!
# 32767:  from all lookup default

# Main table is checked before fwmark rule
# Traffic matches in main, never reaches your rule

# VyOS should set correct priority, but verify
# Lower number = higher priority
# fwmark rules should be before main (32766)

Problem 6: Return Traffic Not Marked

# Symptom: Outbound works, return traffic takes wrong path

# PBR typically marks only initiating direction
# Return traffic must be handled by:
# - Conntrack (automatic if stateful)
# - Separate marking rule for return

# Check if conntrack is preserving marks
sudo conntrack -L | grep mark

# Enable connection mark restore:
# Usually automatic with VyOS, but can verify in iptables
sudo iptables -t mangle -L -v

Debugging Commands

Check What Route Traffic Would Take

# Without mark (normal routing)
ip route get 8.8.8.8

# With mark (policy routing)
ip route get 8.8.8.8 mark 0x1

# Compare outputs to see if PBR is working

Check Packet Counts

# How many packets matched policy?
sudo iptables -t mangle -L PREROUTING -v -n | grep MARK

# Reset counters and test
sudo iptables -t mangle -Z
# Generate test traffic
curl http://10.0.0.100/
# Check counters again
sudo iptables -t mangle -L PREROUTING -v -n | grep MARK

Trace Packet Path

# Enable netfilter trace (temporary debug)
sudo modprobe nf_log_ipv4
sudo sysctl -w net.netfilter.nf_log.2=nf_log_ipv4

# Add trace rule for specific traffic
sudo iptables -t raw -A PREROUTING -s 192.168.1.100 -j TRACE

# Watch kernel log
dmesg -w

# Remove trace when done
sudo iptables -t raw -D PREROUTING -s 192.168.1.100 -j TRACE

Check Firewall Flow

# See where packet is in firewall processing
sudo iptables -t mangle -L -v -n  # Marking happens here
sudo iptables -t nat -L -v -n      # NAT happens here
sudo iptables -t filter -L -v -n   # Filtering happens here

# PBR marks in mangle PREROUTING
# Routing decision happens after mangle

VyOS PBR Configuration Reference

Complete Working Example

configure

# 1. Create routing table with routes
set protocols static table 10 route 0.0.0.0/0 next-hop 10.10.0.1

# 2. Create policy with mark
set policy route PBR-TO-VPN rule 10 destination address 10.0.0.0/8
set policy route PBR-TO-VPN rule 10 set mark 0x1
set policy route PBR-TO-VPN rule 10 set table 10

# 3. Apply policy to interface
set interfaces ethernet eth1 policy route PBR-TO-VPN

commit

Verify Each Component

# 1. Table has routes
ip route show table 10
# → Should show default via 10.10.0.1

# 2. Policy creates iptables rules
sudo iptables -t mangle -L PREROUTING -v -n | grep -i mark
# → Should show MARK rule

# 3. IP rule connects mark to table
ip rule show | grep fwmark
# → Should show: fwmark 0x1 lookup 10

# 4. Test packet routing
ip route get 10.0.0.100 mark 0x1
# → Should show: via 10.10.0.1 table 10

Advanced Debugging

Multiple Tables

# If using multiple tables:
ip route show table 10
ip route show table 20

# Verify rules don't conflict:
ip rule show

# Each mark should have unique table
# 32765:  fwmark 0x1 lookup 10
# 32764:  fwmark 0x2 lookup 20

Source-Based Routing

# If routing by source:
set policy route BY-SOURCE rule 10 source address 192.168.1.0/24
set policy route BY-SOURCE rule 10 set table 10

# Verify source matches
sudo iptables -t mangle -L PREROUTING -v -n
# Should show source match

DSCP/TOS Marking

# If matching on DSCP:
set policy route QOS-ROUTING rule 10 dscp 46
set policy route QOS-ROUTING rule 10 set table 10

# Verify packet has expected DSCP
sudo tcpdump -i eth1 -v | grep "tos"

Testing Strategy

Minimal Test

# 1. Create simple policy
set policy route TEST rule 10 destination address 8.8.8.8/32
set policy route TEST rule 10 set table 10
set protocols static table 10 route 0.0.0.0/0 blackhole

# 2. Apply to interface
set interfaces ethernet eth1 policy route TEST

# 3. Test
ping 8.8.8.8  # Should fail (blackhole)
ping 8.8.4.4  # Should work (not matched)

# 4. Clean up
delete policy route TEST
delete interfaces ethernet eth1 policy route
delete protocols static table 10

Incremental Testing

# Test each component in order:

# Test 1: Does table work?
ip route add blackhole 8.8.8.8 table 10
ip route get 8.8.8.8  # Uses main table → should work
# Clean: ip route del blackhole 8.8.8.8 table 10

# Test 2: Does rule work?
ip rule add fwmark 0x99 table 10
ip route add blackhole 8.8.8.8 table 10
ip route get 8.8.8.8 mark 0x99  # Should show table 10, blackhole
# Clean: ip rule del fwmark 0x99; ip route del blackhole 8.8.8.8 table 10

# Test 3: Does policy create mark?
# Apply policy, check iptables counters

The Lesson

PBR debugging needs systematic verification, not guessing.

When policy routing doesn't work:

Verify policy applied to correct interface
Verify traffic matches policy rules
Verify mark is set (check iptables counters)
Verify table exists with correct routes
Verify rule connects mark to table
Test with ip route get ... mark

Each step depends on the previous. One failure breaks everything after it.

Don't add more rules hoping it helps. Verify each component. Find the broken step. Fix that one thing.

PBR is a chain. Find the broken link.

ARP and Neighbor Discovery: Troubleshooting Layer 2 Problems

berik@ashimov.com (Berik Ashimov) — Fri, 26 Dec 2025 00:00:00 GMT

Routing is correct. Firewall allows traffic. Ping fails. You spend an hour checking Layer 3 and 4. The problem is Layer 2.

ARP (IPv4) and Neighbor Discovery (IPv6) map IP addresses to MAC addresses. When this mapping fails, packets can't be delivered — even though routing looks perfect.

Layer 2 problems look like Layer 3 failures. Always check ARP.

Understanding ARP

Host A wants to send packet to 192.168.1.100:

1. Check ARP cache: "Do I know the MAC for 192.168.1.100?"
2. If no: Send ARP request (broadcast)
   "Who has 192.168.1.100? Tell 192.168.1.1"
3. 192.168.1.100 replies (unicast)
   "192.168.1.100 is at aa:bb:cc:dd:ee:ff"
4. Cache entry created, packet sent

If ARP fails, IP packet can't be sent. Looks like routing problem, but it's MAC resolution.

Viewing ARP Table

VyOS Commands

# Show ARP table
show arp

# Output:
# IP Address      HW Address         Flags  Interface
# 192.168.1.100   aa:bb:cc:dd:ee:ff  C      eth1
# 192.168.1.101   bb:cc:dd:ee:ff:00  C      eth1
# 192.168.1.1     (incomplete)              eth1    ← Problem!

# C = Complete (resolved)
# (incomplete) = ARP request sent, no reply

Detailed ARP Info

# Using ip command
ip neigh show

# Output:
# 192.168.1.100 dev eth1 lladdr aa:bb:cc:dd:ee:ff REACHABLE
# 192.168.1.101 dev eth1 lladdr bb:cc:dd:ee:ff:00 STALE
# 192.168.1.102 dev eth1  FAILED

# States:
# REACHABLE - Recently verified
# STALE - Not verified recently
# DELAY - Verification pending
# PROBE - Actively verifying
# FAILED - ARP resolution failed
# PERMANENT - Static entry

Filter by Interface

# ARP entries for specific interface
ip neigh show dev eth1

# ARP entries for specific IP
ip neigh show 192.168.1.100

ARP Problems and Solutions

Problem 1: Incomplete ARP Entry

# Symptom:
show arp
# 192.168.1.100   (incomplete)        eth1

# Causes:
# - Target host is down
# - Target host has wrong IP
# - Target host on different VLAN
# - Network issue between hosts

# Debug:
# 1. Capture ARP traffic
sudo tcpdump -i eth1 arp

# 2. See if requests go out, responses come back
# 08:30:01 ARP, Request who-has 192.168.1.100 tell 192.168.1.1
# (no reply = host unreachable at Layer 2)

# 3. Verify VLAN tagging
show interfaces ethernet eth1

Problem 2: Wrong MAC Address

# Symptom: Traffic goes to wrong host

# Check ARP for expected IP
show arp | grep 192.168.1.100

# If MAC doesn't match expected host:
# - Duplicate IP (two hosts same IP)
# - IP moved to different host
# - ARP spoofing attack

# Clear entry and let it re-resolve
ip neigh del 192.168.1.100 dev eth1
ping 192.168.1.100
show arp | grep 192.168.1.100

Problem 3: Stale ARP Entries

# Symptom: Intermittent connectivity after IP change

# Old MAC cached, traffic goes to wrong place
ip neigh show
# 192.168.1.100 dev eth1 lladdr aa:bb:cc:dd:ee:ff STALE

# Flush stale entry
ip neigh flush 192.168.1.100

# Or flush all on interface
ip neigh flush dev eth1

Problem 4: ARP Table Full

# Symptom: New hosts can't connect

# Check table size
cat /proc/sys/net/ipv4/neigh/default/gc_thresh3
# Default: 1024

# If many hosts, increase:
configure
set system sysctl parameter net.ipv4.neigh.default.gc_thresh3 value 4096
commit

# Or via sysctl directly:
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

Static ARP Entries

For critical hosts, use static ARP to prevent spoofing:

configure

# Add static ARP entry
set protocols static arp 192.168.1.100 hwaddr aa:bb:cc:dd:ee:ff

commit

When to Use Static ARP

Critical servers (DNS, gateway)
Security-sensitive hosts
Environments with ARP spoofing risk
Fixed infrastructure (won't change MAC)

Proxy ARP

Router answers ARP on behalf of other networks:

# Check if proxy ARP is enabled
cat /proc/sys/net/ipv4/conf/eth1/proxy_arp

# Enable proxy ARP on interface
configure
set interfaces ethernet eth1 ip enable-proxy-arp
commit

# Use case: When hosts on different subnets share same VLAN
# Router answers ARP for remote subnet, forwards traffic

Proxy ARP Risks

Breaks subnet boundaries
Can cause routing confusion
Security implications (answers for others)
Usually sign of network design problem

IPv6 Neighbor Discovery

IPv6 uses ICMPv6 Neighbor Discovery instead of ARP:

View Neighbor Table

# Show IPv6 neighbors
ip -6 neigh show

# Output:
# fe80::1 dev eth1 lladdr aa:bb:cc:dd:ee:ff REACHABLE
# 2001:db8::100 dev eth1 lladdr bb:cc:dd:ee:ff:00 STALE

Neighbor Discovery Types

NDP Message Types:
- Neighbor Solicitation (NS): "Who has this IPv6?"
- Neighbor Advertisement (NA): "I have this IPv6"
- Router Solicitation (RS): "Are there any routers?"
- Router Advertisement (RA): "I'm a router, here's the prefix"

Debug ND Issues

# Capture NDP traffic
sudo tcpdump -i eth1 icmp6

# Filter for specific types
sudo tcpdump -i eth1 'icmp6 and ip6[40] == 135'  # Neighbor Solicitation
sudo tcpdump -i eth1 'icmp6 and ip6[40] == 136'  # Neighbor Advertisement

IPv6 ND Problems

# Problem: Duplicate Address Detection fails
# Host won't configure IPv6 address

# Check for duplicate:
sudo tcpdump -i eth1 'icmp6 and ip6[40] == 136'

# Problem: No router advertisements
# Hosts can't find gateway

# Check RA on interface:
sudo tcpdump -i eth1 'icmp6 and ip6[40] == 134'

# VyOS sends RA if configured:
set interfaces ethernet eth1 ipv6 router-advert prefix 2001:db8::/64

Duplicate IP Detection

Detecting Duplicates

# Check for multiple MACs responding to same IP
arping -D -I eth1 192.168.1.100

# If duplicate exists, arping gets responses from multiple MACs

# Or capture ARP and look for different MAC sources
sudo tcpdump -i eth1 arp and host 192.168.1.100

Gratuitous ARP

# Send gratuitous ARP (announce IP)
arping -A -I eth1 192.168.1.1 -c 1

# Use after IP address change or failover
# Updates ARP caches network-wide

Common Scenarios

Scenario 1: New Server Not Reachable

# Server configured, can't reach from router
ping 192.168.1.100
# PING 192.168.1.100: 56 data bytes
# (no response)

show arp | grep 192.168.1.100
# 192.168.1.100   (incomplete)

# ARP not resolving:
# - Server on wrong VLAN?
# - Server IP configured wrong?
# - Server interface down?

# From server side:
# ip addr show
# Check if IP is on correct interface

Scenario 2: Traffic Goes to Wrong Host

# Application connecting to wrong server
show arp | grep 10.0.0.50
# 10.0.0.50   aa:bb:cc:dd:ee:ff   C   eth1

# But expected MAC was bb:cc:dd:ee:ff:00
# Duplicate IP! Two hosts have 10.0.0.50

# Solution:
# 1. Find both hosts
# 2. Remove duplicate IP from wrong host
# 3. Flush ARP
ip neigh flush 10.0.0.50

Scenario 3: Connectivity Works Then Fails

# Works initially, fails after some time

# Check ARP timeout
cat /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms
# 30000 (30 seconds)

# Entry goes STALE, then needs refresh
# If refresh fails → connectivity lost

# Debug:
watch -n 1 'ip neigh show 192.168.1.100'
# Watch state transition

Scenario 4: After Failover, Old IP Unreachable

# Failover happened, but clients still sending to old MAC

# Need gratuitous ARP from new server:
arping -A -I eth1 192.168.1.100 -c 3

# Or clear ARP cache on clients/routers:
ip neigh flush 192.168.1.100

Monitoring ARP

Watch ARP Table

# Continuous monitoring
watch -n 2 'ip neigh show dev eth1'

Log ARP Changes

# Linux doesn't log ARP by default
# Use arpwatch for monitoring:

apt install arpwatch
arpwatch -i eth1 -f /var/lib/arpwatch/eth1.dat

# Logs to syslog:
# new station 192.168.1.100 aa:bb:cc:dd:ee:ff
# changed ethernet address 192.168.1.100 old:mac new:mac

Best Practices

1. Static ARP for Critical Infrastructure

# Gateway, DNS, critical servers
set protocols static arp 192.168.1.1 hwaddr aa:bb:cc:dd:ee:ff

2. Monitor for Duplicates

# Regular scan for duplicates
for ip in $(seq 1 254); do
    arping -D -c 1 -I eth1 192.168.1.$ip 2>/dev/null
done

3. Clear ARP During Troubleshooting

# When changing IPs or after failover
ip neigh flush dev eth1

4. Check ARP First

# Before deep Layer 3 debugging
show arp | grep <problem-ip>

The Lesson

Layer 2 problems look like Layer 3 failures. Always check ARP.

When ping fails:

Is there an ARP entry?
Is it complete or incomplete?
Is the MAC address correct?
Is the entry REACHABLE or STALE/FAILED?

Layer 2 issues cause:

Intermittent connectivity (stale entries)
Wrong destination (wrong MAC)
Complete failure (no entry)
Slow performance (ARP delays)

ARP is simple but foundational. When it breaks, nothing above it works. Check it first, not last.

Packet Capture on VyOS: tcpdump Techniques for Real Debugging

berik@ashimov.com (Berik Ashimov) — Tue, 23 Dec 2025 00:00:00 GMT

Logs say everything is fine. Routing table looks correct. Firewall rules seem right. But traffic still doesn't flow. What's actually happening?

Packet capture shows you the truth. The actual packets. Not what the logs say happened, not what should happen according to config — what actually happens on the wire.

Packets never lie.

Basic Packet Capture

VyOS Monitor Traffic Command

# VyOS provides a wrapper around tcpdump
monitor traffic interface eth0

# Stop with Ctrl+C

Direct tcpdump

# More control with tcpdump directly
sudo tcpdump -i eth0

# Common options
sudo tcpdump -i eth0 -n          # Don't resolve names (faster)
sudo tcpdump -i eth0 -v          # Verbose
sudo tcpdump -i eth0 -vv         # More verbose
sudo tcpdump -i eth0 -c 100      # Capture 100 packets then stop

Capture Filters

By IP Address

# Traffic from specific source
sudo tcpdump -i eth0 -n src 192.168.1.100

# Traffic to specific destination
sudo tcpdump -i eth0 -n dst 8.8.8.8

# Traffic to or from host
sudo tcpdump -i eth0 -n host 192.168.1.100

# Traffic between two hosts
sudo tcpdump -i eth0 -n host 192.168.1.100 and host 8.8.8.8

By Network

# Traffic to/from subnet
sudo tcpdump -i eth0 -n net 192.168.1.0/24

# Traffic NOT from subnet
sudo tcpdump -i eth0 -n not net 192.168.1.0/24

By Protocol

# ICMP only
sudo tcpdump -i eth0 -n icmp

# TCP only
sudo tcpdump -i eth0 -n tcp

# UDP only
sudo tcpdump -i eth0 -n udp

# ARP
sudo tcpdump -i eth0 -n arp

# OSPF
sudo tcpdump -i eth0 -n proto ospf

# BGP (TCP 179)
sudo tcpdump -i eth0 -n tcp port 179

By Port

# Specific port
sudo tcpdump -i eth0 -n port 80
sudo tcpdump -i eth0 -n port 443

# Source port
sudo tcpdump -i eth0 -n src port 22

# Destination port
sudo tcpdump -i eth0 -n dst port 80

# Port range
sudo tcpdump -i eth0 -n portrange 1-1024

Combined Filters

# HTTP traffic from specific host
sudo tcpdump -i eth0 -n host 192.168.1.100 and port 80

# SSH excluding specific host
sudo tcpdump -i eth0 -n port 22 and not host 192.168.1.1

# All traffic except SSH (while connected via SSH)
sudo tcpdump -i eth0 -n not port 22

# TCP SYN packets (new connections)
sudo tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-syn) != 0'

# TCP RST packets (connection resets)
sudo tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst) != 0'

Saving Captures

Write to File

# Save to pcap file
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap

# Save with rotation (10 files of 100MB each)
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap -C 100 -W 10

# Save with timestamp in filename
sudo tcpdump -i eth0 -n -w /tmp/capture-$(date +%Y%m%d-%H%M%S).pcap

Read from File

# Read saved capture
sudo tcpdump -r /tmp/capture.pcap

# Read with filter
sudo tcpdump -r /tmp/capture.pcap tcp port 80

# Transfer to workstation for Wireshark analysis
scp admin@router:/tmp/capture.pcap .

Capture Strategies

Strategy 1: Two-Point Capture

Capture at both ends to see what's happening:

# On router ingress
sudo tcpdump -i eth0 -n host 192.168.1.100 -w /tmp/ingress.pcap

# On router egress
sudo tcpdump -i eth1 -n host 192.168.1.100 -w /tmp/egress.pcap

# Compare: Did packet arrive? Did it leave?

Strategy 2: All-Interface Capture

# Capture on all interfaces
sudo tcpdump -i any -n host 192.168.1.100

# Shows which interface packets traverse
# Note: May see packet twice (in and out)

Strategy 3: Before/After NAT

# Inside interface (pre-NAT source IP)
sudo tcpdump -i eth1 -n src 192.168.1.100

# Outside interface (post-NAT source IP)
sudo tcpdump -i eth0 -n src <public-ip>

# Verify NAT translation is happening

Strategy 4: Firewall Debug

# Capture traffic that should be allowed
sudo tcpdump -i eth0 -n dst port 443 and dst 192.168.1.100

# If packets arrive but connection fails:
# - Firewall blocking
# - No return route
# - Server not listening

Reading tcpdump Output

TCP Three-Way Handshake

# Normal connection:
10:15:01 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [S], seq 1000
10:15:01 IP 8.8.8.8.80 > 192.168.1.100.54321: Flags [S.], seq 2000, ack 1001
10:15:01 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [.], ack 2001

# [S] = SYN
# [S.] = SYN-ACK
# [.] = ACK
# [P.] = PUSH-ACK (data)
# [F.] = FIN-ACK
# [R.] = RST-ACK

Connection Refused

# RST immediately after SYN
10:15:01 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [S], seq 1000
10:15:01 IP 8.8.8.8.80 > 192.168.1.100.54321: Flags [R.], ack 1001

# Means: Port closed or filtered

Connection Timeout

# SYN retransmits, no response
10:15:01 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [S], seq 1000
10:15:02 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [S], seq 1000
10:15:04 IP 192.168.1.100.54321 > 8.8.8.8.80: Flags [S], seq 1000

# Means: Packets not reaching destination or response not returning

Protocol-Specific Captures

DNS Debugging

# Capture DNS queries and responses
sudo tcpdump -i eth0 -n port 53

# Verbose to see query details
sudo tcpdump -i eth0 -n -v port 53

# Example output:
# 192.168.1.100.12345 > 8.8.8.8.53: 12345+ A? example.com.
# 8.8.8.8.53 > 192.168.1.100.12345: 12345 1/0/0 A 93.184.216.34

HTTP Debugging

# Capture HTTP traffic
sudo tcpdump -i eth0 -n -A port 80

# -A shows ASCII content (HTTP headers)
# WARNING: May capture sensitive data

BGP Session

# Capture BGP traffic
sudo tcpdump -i eth0 -n tcp port 179

# See BGP OPEN, UPDATE, KEEPALIVE messages
sudo tcpdump -i eth0 -n -v tcp port 179

OSPF

# Capture OSPF traffic
sudo tcpdump -i eth0 -n proto ospf

# See Hello, LSA, DBD packets

IPsec

# Capture IKE negotiation (UDP 500/4500)
sudo tcpdump -i eth0 -n udp port 500 or udp port 4500

# Capture ESP packets
sudo tcpdump -i eth0 -n proto esp

Common Troubleshooting Scenarios

Scenario 1: Traffic Not Reaching Destination

# Step 1: Capture at source
sudo tcpdump -i eth1 -n src 192.168.1.100 and dst 8.8.8.8

# Step 2: Capture at exit interface
sudo tcpdump -i eth0 -n src 192.168.1.100 and dst 8.8.8.8
# (use NAT source if applicable)

# If packets on eth1 but not eth0:
# - Firewall blocking
# - Routing issue

Scenario 2: Asymmetric Routing

# Capture on both interfaces
sudo tcpdump -i eth0 -n host 192.168.1.100
sudo tcpdump -i eth1 -n host 192.168.1.100

# If request on eth0, response on eth1:
# - Asymmetric routing
# - Might be dropped by stateful firewall

Scenario 3: Connection Resets

# Find who sends RST
sudo tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst) != 0'

# If RST from destination: Port closed or application error
# If RST from middle: Firewall, IPS, or timeout

Scenario 4: MTU Problems

# Look for fragmentation
sudo tcpdump -i eth0 -n 'ip[6:2] & 0x1fff != 0'

# Look for ICMP fragmentation needed
sudo tcpdump -i eth0 -n 'icmp[0] = 3 and icmp[1] = 4'

# If you see these, MTU/MSS issue

Advanced Techniques

Capture Only Headers

# Capture only first 96 bytes (headers)
sudo tcpdump -i eth0 -n -s 96 -w /tmp/headers-only.pcap

# Reduces file size, still useful for analysis

Ring Buffer for Continuous Capture

# Keep last 100MB of traffic
sudo tcpdump -i eth0 -n -w /tmp/capture.pcap -C 10 -W 10

# 10 files × 10MB = 100MB rotating buffer
# Useful for catching intermittent issues

Trigger-Based Capture

#!/bin/bash
# Start capture when problem detected
# /config/scripts/triggered-capture.sh

while true; do
    # Check for symptom (e.g., high conntrack)
    if [ $(cat /proc/sys/net/netfilter/nf_conntrack_count) -gt 50000 ]; then
        timeout 60 tcpdump -i eth0 -n -w /tmp/triggered-$(date +%s).pcap
    fi
    sleep 10
done

Best Practices

1. Filter Early

# Bad: Capture everything, filter later
sudo tcpdump -i eth0 -w /tmp/huge.pcap

# Good: Filter during capture
sudo tcpdump -i eth0 -n host 192.168.1.100 and port 80 -w /tmp/small.pcap

2. Exclude SSH (When Connected via SSH)

# Avoid capturing your own session
sudo tcpdump -i eth0 -n not port 22

3. Use Names for Saved Files

# Include date, interface, purpose
sudo tcpdump -i eth0 -n host 192.168.1.100 \
  -w /tmp/eth0-192.168.1.100-$(date +%Y%m%d-%H%M%S).pcap

4. Know When to Stop

# Set packet count limit
sudo tcpdump -i eth0 -n -c 1000 -w /tmp/capture.pcap

# Set time limit
timeout 60 sudo tcpdump -i eth0 -n -w /tmp/capture.pcap

The Lesson

Packets never lie.

When troubleshooting fails with logs and commands, packet capture shows exactly what's happening:

Is the traffic arriving?
Is it leaving?
Is the firewall changing it?
Is NAT translating it?
Is the destination responding?

The capture tells you what logs cannot: the actual bytes on the wire.

Every network engineer should be comfortable with tcpdump. Not Wireshark on a desktop — tcpdump on the router where the problem is.

Start simple: tcpdump -i eth0 -n host problem-ip. Build filters from there. Save captures for complex analysis. But start by looking at the packets.

They never lie.

Conntrack Deep Dive: Connection Tables, Limits, and Debugging

berik@ashimov.com (Berik Ashimov) — Fri, 19 Dec 2025 00:00:00 GMT

Your stateful firewall silently tracks every connection. Allowing "established, related" traffic to return requires remembering that you initiated the connection. That's conntrack.

When conntrack table fills up, new connections fail mysteriously. No error, just dropped. The firewall has no room to track new connections, so it drops them.

Conntrack is the invisible stateful firewall engine. When it fails, everything fails.

What Conntrack Does

Client → Router → Server

1. Client sends SYN to server
2. Router creates conntrack entry: NEW
3. Server responds with SYN-ACK
4. Router updates entry: ESTABLISHED
5. Traffic flows, entry tracks state
6. Connection closes
7. Entry times out, removed

Without conntrack, the firewall can't know that a packet from the server is a response to your request vs. unsolicited traffic.

Viewing Conntrack Table

Basic Commands

# Show all connections
show conntrack table

# Show IPv4 connections only
show conntrack table ipv4

# Show IPv6 connections only
show conntrack table ipv6

# Show connection count
show conntrack table statistics

Filtering Conntrack

# Show connections to specific IP
show conntrack table ipv4 | grep "192.168.1.100"

# Show connections by protocol
show conntrack table ipv4 | grep "tcp"
show conntrack table ipv4 | grep "udp"

# Show connections in specific state
show conntrack table ipv4 | grep "ESTABLISHED"
show conntrack table ipv4 | grep "TIME_WAIT"

Direct Conntrack Commands

# Using conntrack tool directly
sudo conntrack -L                    # List all
sudo conntrack -L -p tcp             # TCP only
sudo conntrack -L -s 192.168.1.100   # Source IP
sudo conntrack -L -d 8.8.8.8         # Destination IP
sudo conntrack -C                    # Count entries

Conntrack Statistics

# View statistics
show conntrack table statistics

# Or directly
sudo conntrack -S

# Output:
# cpu=0     found=12345 invalid=0 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0
# cpu=1     found=12340 invalid=2 ignore=0 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=0

# Key metrics:
# drop: Packets dropped due to conntrack issues
# early_drop: Entries dropped to make room for new ones
# insert_failed: Failed to create new entry (table full)

Conntrack Table Size

Check Current Settings

# Current maximum entries
cat /proc/sys/net/netfilter/nf_conntrack_max

# Current count
cat /proc/sys/net/netfilter/nf_conntrack_count

# Hash table size
cat /proc/sys/net/netfilter/nf_conntrack_buckets

Increase Table Size

configure

# Increase max connections
set system conntrack table-size 262144  # Default is often 65536

# Adjust hash table size (should be ~25% of table-size)
set system conntrack hash-size 65536

commit

Calculate Requirements

Rule of thumb:
- Each connection uses ~350 bytes
- 65536 entries ≈ 22 MB RAM
- 262144 entries ≈ 90 MB RAM

# For NAT gateway with 1000 users:
# Assume 100 connections per user
# 1000 × 100 = 100,000 connections
# Set table-size to at least 150,000 (with headroom)

Conntrack Timeouts

View Current Timeouts

# TCP timeouts
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait
cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_close_wait

# UDP timeouts
cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream

# ICMP timeout
cat /proc/sys/net/netfilter/nf_conntrack_icmp_timeout

Configure Timeouts in VyOS

configure

# TCP timeouts
set system conntrack timeout tcp established 7200  # 2 hours (default 5 days!)
set system conntrack timeout tcp close 10
set system conntrack timeout tcp close-wait 60
set system conntrack timeout tcp fin-wait 120
set system conntrack timeout tcp last-ack 30
set system conntrack timeout tcp syn-recv 60
set system conntrack timeout tcp syn-sent 120
set system conntrack timeout tcp time-wait 120

# UDP timeouts
set system conntrack timeout udp other 30
set system conntrack timeout udp stream 180

# ICMP timeout
set system conntrack timeout icmp 30

commit

Aggressive Timeouts (For Busy NAT)

# For high-traffic NAT gateways with limited table size
set system conntrack timeout tcp established 3600   # 1 hour
set system conntrack timeout tcp time-wait 30       # Clear quickly
set system conntrack timeout udp other 30
set system conntrack timeout udp stream 60

# Warning: Too aggressive can break long-lived connections

Conntrack Problems

Problem 1: Table Full

Symptoms:

New connections fail randomly
Established connections work
Log shows "nf_conntrack: table full"

# Check table status
show conntrack table statistics

# Look for:
# early_drop > 0
# insert_failed > 0

# Check current vs max
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Fix: Increase size
set system conntrack table-size 524288

Problem 2: Memory Exhaustion

Large conntrack tables consume RAM:

# Calculate memory needed
# 262144 entries × 350 bytes ≈ 90 MB

# If short on memory, reduce table or add RAM
# Or reduce timeouts to expire entries faster

Problem 3: Stale Entries

Connections closed but entries remain:

# Clear specific entry
sudo conntrack -D -s 192.168.1.100 -d 8.8.8.8

# Clear all entries (dangerous in production!)
sudo conntrack -F

# Clear entries by protocol
sudo conntrack -D -p udp

Problem 4: Conntrack Not Tracking

# Some traffic bypasses conntrack (NOTRACK)
# Check if tracking is enabled

show firewall

# Look for notrack rules:
# set firewall ipv4 name RAW rule 10 action notrack

NOTRACK Rules

For high-bandwidth traffic that doesn't need stateful inspection:

configure

# Skip tracking for specific traffic
set firewall ipv4 name RAW default-action accept
set firewall ipv4 name RAW rule 10 action notrack
set firewall ipv4 name RAW rule 10 destination address 224.0.0.0/4
set firewall ipv4 name RAW rule 10 description "Skip tracking for multicast"

# Apply to raw table
set firewall ipv4 input filter raw RAW

commit

When to Use NOTRACK

Multicast/broadcast traffic
High-bandwidth services that don't need state
Traffic between trusted internal segments
When conntrack table is bottleneck

Risks of NOTRACK

No stateful filtering for that traffic
"Established, related" rules won't work
Must use stateless rules for that traffic

Conntrack Helpers

Conntrack helpers track multi-connection protocols (FTP, SIP):

# View loaded helpers
lsmod | grep nf_conntrack

# Configure FTP helper
set system conntrack modules ftp

# Configure SIP helper
set system conntrack modules sip

commit

FTP Active Mode Fix

# FTP active mode requires helper to track data connection
set system conntrack modules ftp

# Helper allows firewall to recognize FTP data connections
# as related to control connection

Monitoring Conntrack

Watch Table Fill

# Monitor in real-time
watch -n 1 'cat /proc/sys/net/netfilter/nf_conntrack_count'

# Alert script
#!/bin/bash
MAX=$(cat /proc/sys/net/netfilter/nf_conntrack_max)
CURRENT=$(cat /proc/sys/net/netfilter/nf_conntrack_count)
THRESHOLD=80

USAGE=$((CURRENT * 100 / MAX))

if [ $USAGE -gt $THRESHOLD ]; then
    echo "Conntrack ${USAGE}% full (${CURRENT}/${MAX})"
fi

Prometheus Metrics

# Node exporter exposes conntrack metrics:
# node_nf_conntrack_entries
# node_nf_conntrack_entries_limit

# Alert when > 80% full
# expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8

Conntrack by Connection Type

Heavy NAT Users

# Find IPs with most connections
sudo conntrack -L | awk '{print $5}' | cut -d= -f2 | sort | uniq -c | sort -rn | head

# Output:
# 5234 192.168.1.50
# 3421 192.168.1.51
# ...

Connection State Distribution

# Count by state
sudo conntrack -L | grep -o 'tcp .* [A-Z_]*' | awk '{print $NF}' | sort | uniq -c | sort -rn

# Output:
# 45234 ESTABLISHED
#  2341 TIME_WAIT
#   123 SYN_SENT

Best Practices

1. Size for Peak Load

# Calculate peak connections
# Add 50% headroom
set system conntrack table-size 262144  # More is safer

2. Tune Timeouts for Traffic

# Long-lived connections (database, SSH)
set system conntrack timeout tcp established 86400  # 24h

# Short-lived connections (HTTP)
set system conntrack timeout tcp established 3600   # 1h

3. Monitor Always

# Alert before table fills
# Dashboard showing:
# - Current entries
# - Max entries
# - Entries/second rate

4. NOTRACK Where Safe

# Reduce load by not tracking:
# - Internal trusted traffic
# - High-bandwidth transfers
# - Multicast

The Lesson

Conntrack is the invisible stateful firewall engine. When it fails, everything fails.

Every "allow established, related" rule depends on conntrack. Every NAT translation depends on conntrack. Without it, no stateful firewalling.

When conntrack table fills:

New connections silently fail
No error message to user
Existing connections keep working
Very confusing symptoms

Prevention:

Size table for expected load + headroom
Tune timeouts for your traffic patterns
Monitor table usage constantly
Alert before it fills, not after

The connection table is limited. Plan for it.

TCP MSS Clamping: When and Why to Adjust Segment Size

berik@ashimov.com (Berik Ashimov) — Tue, 16 Dec 2025 00:00:00 GMT

TCP connections work. Then you add a VPN. Suddenly large transfers fail while small ones work. SSH connects, but SCP stalls. Websites load headers but not content.

The culprit: MTU mismatch. Your tunnel has overhead. Packets get fragmented or dropped. ICMP "fragmentation needed" messages get filtered. TCP never learns the path MTU.

MSS clamping fixes this by telling TCP to use smaller segments in the first place. No fragmentation needed, no ICMP required.

MTU vs MSS

MTU (Maximum Transmission Unit): Maximum IP packet size an interface can send.

Ethernet default: 1500 bytes
Includes IP header (20) and TCP header (20)

MSS (Maximum Segment Size): Maximum TCP payload size.

Announced during TCP handshake
MSS = MTU - 40 (IPv4) or MTU - 60 (IPv6)
Ethernet default MSS: 1460 bytes (IPv4)

MTU 1500:
┌─────────────────────────────────────────────────┐
│ IP Header │ TCP Header │      TCP Data         │
│   20 B    │    20 B    │       1460 B          │
└─────────────────────────────────────────────────┘
                           ↑ This is MSS

Why Tunnels Break Large Transfers

Tunnels add encapsulation overhead:

Normal Ethernet (MTU 1500):
[ IP | TCP | Data 1460 bytes ]  = 1500 bytes ✓

GRE Tunnel (24 bytes overhead):
[ Outer IP | GRE | Inner IP | TCP | Data 1460 bytes ] = 1524 bytes ✗
                                                         ↑ Exceeds MTU!

Options:

Fragment packets (slow, can fail)
Lower tunnel MTU (requires end-to-end coordination)
MSS clamping (transparent, works without coordination)

MSS Clamping in VyOS

Interface-Based Clamping

configure

# Clamp MSS on specific interface
set firewall options interface eth0 adjust-mss 1360

# For tunnel interfaces
set firewall options interface tun0 adjust-mss 1360
set firewall options interface wg0 adjust-mss 1380

commit

Calculating Correct MSS

# Formula: MSS = Tunnel_MTU - 40 (IPv4)
# Or:      MSS = Tunnel_MTU - 60 (IPv6)

# Common scenarios:
# PPPoE (MTU 1492):    MSS = 1492 - 40 = 1452
# GRE (MTU 1476):      MSS = 1476 - 40 = 1436
# IPsec (~MTU 1400):   MSS = 1400 - 40 = 1360
# WireGuard (MTU 1420): MSS = 1420 - 40 = 1380
# VXLAN (MTU 1450):    MSS = 1450 - 40 = 1410

Global MSS Clamping

# Apply to all interfaces (less targeted but simpler)
set firewall options all-interfaces adjust-mss 1360

commit

MSS Clamping by Zone/Direction

# Clamp only for traffic leaving via tunnel
set firewall options interface tun0 adjust-mss 1360

# Clamp for traffic entering from LAN toward tunnel
set firewall options interface eth1 adjust-mss 1360  # LAN interface

PPPoE Configuration

PPPoE is the most common MSS clamping scenario:

configure

# PPPoE interface setup
set interfaces ethernet eth0 pppoe 0 default-route auto
set interfaces ethernet eth0 pppoe 0 mtu 1492
set interfaces ethernet eth0 pppoe 0 name-server auto

# MSS clamping for PPPoE
set firewall options interface pppoe0 adjust-mss 1452

# Or use 'clamp-mss-to-pmtu' to auto-calculate
set firewall options interface pppoe0 adjust-mss clamp-mss-to-pmtu

commit

VPN Tunnel Configuration

IPsec

# IPsec has variable overhead depending on encryption
# ESP header + encryption padding: ~50-80 bytes

# Conservative MSS for IPsec
set firewall options interface vti0 adjust-mss 1360

# Or on LAN-facing interface for traffic going to VPN
set firewall options interface eth1 adjust-mss 1360

WireGuard

# WireGuard overhead: 60 bytes (IPv4) or 80 bytes (IPv6)
# Default WireGuard MTU: 1420

set firewall options interface wg0 adjust-mss 1380

GRE

# GRE overhead: 24 bytes (basic) to 28+ (with key/sequence)
# GRE over IPsec: even more overhead

set interfaces tunnel tun0 encapsulation gre
set interfaces tunnel tun0 mtu 1400
set firewall options interface tun0 adjust-mss 1360

Troubleshooting MSS Issues

Symptom: Large Transfers Fail

# Small files/pings work
ping -s 64 remote-host    # Works

# Large transfers fail/hang
ping -s 1400 remote-host  # Fails or hangs

# Solution: Add MSS clamping

Detecting Current MSS

# Capture TCP SYN packets to see advertised MSS
tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn) != 0' -v

# Look for "mss 1460" or similar in output
# 14:32:15 IP host.port > dest.port: Flags [S], ... mss 1460 ...

Verify Clamping is Working

# Before clamping:
# Client sends: mss 1460
# After clamping to 1360:
# Router modifies: mss 1360

# Capture on LAN side
tcpdump -i eth1 'tcp[tcpflags] & (tcp-syn) != 0' -v

# Capture on tunnel side
tcpdump -i tun0 'tcp[tcpflags] & (tcp-syn) != 0' -v

# Compare MSS values

Test Path MTU

# Find actual path MTU
tracepath remote-host

# Manual test with DF bit
ping -M do -s 1372 remote-host  # Adjust size until works

Common Scenarios

Scenario 1: Site-to-Site VPN Users Can't Access Some Sites

Problem: VPN tunnel MTU is 1400
         Client sends MSS 1460
         Large packets can't traverse tunnel

Solution:
set firewall options interface vti0 adjust-mss 1360

Scenario 2: PPPoE Users Have Random Website Issues

Problem: PPPoE MTU 1492
         Some sites have PMTUD blackhole
         They never learn about lower MTU

Solution:
set firewall options interface pppoe0 adjust-mss 1452

Scenario 3: GRE Tunnel Works for Ping, Not SCP

Problem: GRE overhead not accounted for
         Large SSH/SCP packets fragmented/dropped

Solution:
set interfaces tunnel tun0 mtu 1400
set firewall options interface tun0 adjust-mss 1360

Scenario 4: Double-Tunnel (GRE over IPsec)

Problem: Outer tunnel already reduces MTU
         Inner tunnel reduces it more
         Need very low MSS

Solution:
# Outer IPsec: MTU ~1400
# GRE inside: MTU 1400 - 24 = 1376
# MSS: 1376 - 40 = 1336

set firewall options interface tun0 adjust-mss 1336

Advanced Configuration

IPv6 MSS Clamping

# IPv6 header is 40 bytes (vs 20 for IPv4)
# MSS = MTU - 60

set firewall options interface eth0 adjust-mss6 1340

# Or combined for both protocols
set firewall options interface eth0 adjust-mss 1360
set firewall options interface eth0 adjust-mss6 1340

Asymmetric Clamping

# Different MSS for different directions
# Not directly supported, but can use firewall zones

# Traffic entering from Internet, leaving to tunnel
set firewall options interface eth0 adjust-mss 1360

# Traffic entering from tunnel, leaving to LAN
# (usually not needed - responses use same MSS)

Clamping with NAT

# MSS clamping works with NAT
# Apply before or after NAT (usually doesn't matter)

set nat source rule 100 outbound-interface eth0
set nat source rule 100 translation address masquerade

set firewall options interface eth0 adjust-mss 1360

# Both NAT and MSS modification happen

Why Not Just Lower MTU?

You could lower the MTU instead of MSS clamping:

# Option A: Lower MTU on all devices
set interfaces ethernet eth0 mtu 1400
# Requires changing MTU on ALL hosts in network
# DHCP can help but not all clients respect it

# Option B: MSS clamping
set firewall options interface eth0 adjust-mss 1360
# Only affects TCP
# Transparent to endpoints
# No client changes needed

MSS clamping advantages:

Transparent to endpoints
No client configuration needed
Only affects problematic TCP path
Works even when you don't control endpoints

MTU change advantages:

Affects all protocols (UDP, etc.)
No packet modification needed
More "correct" solution

Best practice: Use MSS clamping for TCP-heavy tunnels. Lower MTU for UDP-heavy applications or when you control all devices.

Quick Reference

Tunnel Type	Overhead	Safe MTU	Safe MSS (IPv4)
PPPoE	8	1492	1452
GRE	24	1476	1436
IPsec ESP	50-80	1400	1360
WireGuard	60	1420	1380
VXLAN	50	1450	1410
L2TP	40	1460	1420

The Lesson

MSS clamping fixes problems MTU changes cannot.

When you don't control the endpoints, can't guarantee ICMP reaches them, and can't change their MTU — MSS clamping is your only option.

The router intercepts TCP handshakes and modifies the MSS value. Endpoints never know it happened. They just use smaller segments that fit through your tunnel.

Every tunnel should have MSS clamping configured. It costs nothing when not needed and saves hours of troubleshooting when it is.

The symptoms are always vague: "Large files fail, small ones work." The fix is always the same: calculate correct MSS, clamp it, done.

MTR, Tracepath, and PMTUD: Diagnosing Path Problems

berik@ashimov.com (Berik Ashimov) — Fri, 12 Dec 2025 00:00:00 GMT

"Can you ping it?" Yes. "Then why isn't it working?" Because ping tests ICMP, not your application. Because one successful ping doesn't show the packet loss happening every 30 seconds. Because ping doesn't show which hop is the problem.

Ping is a smoke test, not a diagnostic. Real troubleshooting needs tools that show the path, measure loss over time, and identify exactly where problems occur.

Ping alone is never enough.

Understanding the Tools

Tool	What it shows	When to use
ping	Basic reachability	Quick test
traceroute	Path to destination	Find route
mtr	Path + statistics over time	Find where loss occurs
tracepath	Path + MTU discovery	Find MTU issues

MTR Deep Dive

MTR combines traceroute with continuous ping statistics.

Basic MTR

# From VyOS operational mode
mtr 8.8.8.8

# Output:
# Host                   Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. gateway.local        0.0%   100    1.2   1.5   0.8   3.2   0.5
# 2. isp-router.net       0.0%   100    8.3   9.1   7.2  15.3   1.8
# 3. core-router.isp      0.5%   100   12.1  13.2  11.0  45.2   5.2
# 4. ???
# 5. google-peer.net      0.0%   100   15.3  16.1  14.2  22.1   1.2
# 6. 8.8.8.8              0.0%   100   14.8  15.5  14.0  21.3   1.1

MTR Options

# Specify count
mtr -c 100 8.8.8.8

# Report mode (non-interactive)
mtr -r 8.8.8.8

# Wide report (show full hostnames)
mtr -rw 8.8.8.8

# Use TCP instead of ICMP
mtr -T -P 443 8.8.8.8

# Use UDP
mtr -u 8.8.8.8

# Set packet size
mtr -s 1400 8.8.8.8

# No DNS resolution (faster)
mtr -n 8.8.8.8

# Show AS numbers
mtr -z 8.8.8.8

Interpreting MTR Output

Host                   Loss%   Snt   Last   Avg  Best  Wrst StDev
1. 192.168.1.1          0.0%   100    1.2   1.5   0.8   3.2   0.5
2. 10.0.0.1            15.0%   100    8.3   9.1   7.2  15.3   1.8  ← Problem here?
3. 172.16.0.1          15.0%   100   12.1  13.2  11.0  45.2   5.2  ← Or here?
4. 8.8.8.8             0.0%    100   14.8  15.5  14.0  21.3   1.1  ← Destination OK

Key insight: Loss at hop 2 that continues to hop 3 but clears by destination means hop 2 is rate-limiting ICMP replies, not dropping traffic. Real loss would persist to destination.

Reading Loss Patterns

# Pattern 1: Real loss
Hop 3:  15% loss
Hop 4:  15% loss
Hop 5:  15% loss  ← Loss persists to destination = real problem at hop 3

# Pattern 2: ICMP rate limiting
Hop 3:  15% loss
Hop 4:  15% loss
Hop 5:   0% loss  ← Clears at destination = hop 3 rate-limits ICMP, not real loss

# Pattern 3: Asymmetric routing
Hop 3:  high latency spike
Hop 4:  normal
Hop 5:  normal   ← Return path different, not a problem

MTR for Different Protocols

# ICMP might be filtered, try TCP
mtr -T -P 80 example.com

# Test actual service port
mtr -T -P 443 example.com
mtr -T -P 22 example.com

# UDP services
mtr -u -P 53 8.8.8.8

Traceroute Variants

Standard Traceroute

# ICMP traceroute
traceroute 8.8.8.8

# UDP traceroute (default on Linux)
traceroute -U 8.8.8.8

# TCP traceroute (bypasses some filters)
traceroute -T -p 443 8.8.8.8

# Don't resolve hostnames
traceroute -n 8.8.8.8

Traceroute Options

# Set max hops
traceroute -m 30 8.8.8.8

# Set packet size
traceroute 8.8.8.8 1400

# Wait time per probe
traceroute -w 2 8.8.8.8

# Probes per hop
traceroute -q 5 8.8.8.8

# Source interface
traceroute -i eth0 8.8.8.8

Interpreting Traceroute

# Normal output
1  192.168.1.1 (192.168.1.1)  1.234 ms  1.123 ms  1.345 ms
2  10.0.0.1 (10.0.0.1)  8.234 ms  8.123 ms  8.345 ms
3  * * *                        ← Hop doesn't respond
4  8.8.8.8 (8.8.8.8)  15.234 ms  15.123 ms  15.345 ms

# Stars (*) don't always mean problem
# Many routers don't respond to traceroute probes
# Final destination matters most

Tracepath and PMTUD

Path MTU Discovery finds the maximum packet size that can traverse a path without fragmentation.

Using Tracepath

# Tracepath discovers MTU along path
tracepath 8.8.8.8

# Output includes MTU at each hop
# 1:  192.168.1.1         1.234ms pmtu 1500
# 2:  10.0.0.1            8.234ms pmtu 1500
# 3:  tunnel-endpoint    12.234ms pmtu 1400  ← MTU drops here
# 4:  8.8.8.8            15.234ms reached

Manual PMTUD

# Find MTU using ping with DF flag
# Start with 1500, decrease until works

# Linux/VyOS
ping -M do -s 1472 8.8.8.8  # 1472 + 28 = 1500

# If fails, decrease size
ping -M do -s 1400 8.8.8.8
ping -M do -s 1372 8.8.8.8  # For tunnels with 1400 MTU

Common MTU Values

Scenario	MTU
Ethernet	1500
PPPoE	1492
GRE tunnel	1476
IPsec (AES)	~1400
WireGuard	1420
VXLAN	1450

PMTUD Problems

# Symptoms of MTU problems:
# - Small packets work, large fail
# - SSH works, SCP hangs
# - Web pages partially load
# - TLS handshake fails

# Diagnose:
tracepath problematic-host.com

# Fix on VyOS:
# Option 1: Lower interface MTU
set interfaces ethernet eth0 mtu 1400

# Option 2: MSS clamping (better for VPN)
set firewall options interface eth0 adjust-mss 1360

Diagnosing Common Problems

Problem 1: Intermittent Packet Loss

# Run MTR for extended period
mtr -c 1000 destination.com

# Look for:
# - Consistent loss at specific hop
# - Loss that varies with time
# - Loss only during certain hours

# If loss at hop N continues to destination:
# Problem is at hop N or before

Problem 2: High Latency Spikes

# Check MTR StDev column
# High StDev = inconsistent latency

# Possible causes:
# - Congestion (check time of day)
# - Buffer bloat (test with different packet sizes)
# - Routing changes (check Wrst column)

Problem 3: Path Changes

# Run traceroute multiple times
for i in {1..10}; do traceroute -n 8.8.8.8; sleep 60; done

# Compare paths
# Different paths = load balancing or instability

Problem 4: Blackhole

# Traceroute stops at specific hop, no destination
1  192.168.1.1
2  10.0.0.1
3  172.16.0.1
4  * * *
5  * * *

# Possible causes:
# - Firewall blocking
# - Routing problem (no return path)
# - MTU blackhole (try smaller packets)

# Test with different methods:
traceroute -T -p 80 destination   # TCP
traceroute -I destination         # ICMP

VyOS Specific Diagnostics

Check Local Routing

# Verify route to destination
show ip route 8.8.8.8

# Check for multiple paths
show ip route 8.8.8.8 longer-prefixes

# BGP-specific path info
show ip bgp 8.8.8.8

Check Interface Stats

# Look for errors
show interfaces ethernet eth0

# Key metrics:
# - RX/TX errors
# - RX/TX drops
# - Collisions (shouldn't happen on modern networks)

Check Firewall Drops

# Enable logging on drop rules
set firewall ipv4 name WAN-IN rule 999 log enable

# Check logs
show log | grep DROP

# Might reveal blocked traffic

Scripting Diagnostics

Continuous Monitoring Script

#!/bin/bash
# /config/scripts/path-monitor.sh

DESTINATION=$1
LOG_FILE="/var/log/path-monitor.log"

while true; do
    echo "=== $(date) ===" >> $LOG_FILE
    mtr -r -c 10 $DESTINATION >> $LOG_FILE
    sleep 300
done

Alert on High Loss

#!/bin/bash
# /config/scripts/check-loss.sh

DESTINATION="8.8.8.8"
THRESHOLD=5

LOSS=$(mtr -r -c 100 $DESTINATION | tail -1 | awk '{print $3}' | sed 's/%//')

if [ $(echo "$LOSS > $THRESHOLD" | bc) -eq 1 ]; then
    echo "High loss detected: ${LOSS}%" | mail -s "Network Alert" admin@example.com
fi

Best Practices

1. Always Test Bidirectionally

# Problem might be in return path
# Test from both ends when possible

# From VyOS:
mtr remote-host

# From remote host:
mtr vyos-router

2. Test the Actual Service

# ICMP might work when TCP doesn't
mtr -T -P 443 website.com
mtr -T -P 22 server.com

3. Collect Data Over Time

# One-time test might miss intermittent issues
# Run extended tests during problem periods
mtr -c 500 problematic-host.com

4. Document Baseline

# Know what "normal" looks like
# Run MTR when everything works
# Compare during problems

The Lesson

Ping alone is never enough.

Ping tells you: "Something responded to ICMP."

MTR tells you:

Which hops are on the path
Where packet loss occurs
Whether loss is real or ICMP rate-limiting
Latency variations and patterns

When troubleshooting:

Start with MTR, not ping
Run long enough to catch patterns
Test with relevant protocol (TCP/UDP)
Test bidirectionally
Compare to known baseline

The path is usually the problem, not the endpoint. MTR shows you the path.

RADIUS and TACACS+: Centralized Authentication for Network Devices

berik@ashimov.com (Berik Ashimov) — Tue, 09 Dec 2025 00:00:00 GMT

Managing users on 50 routers means changing passwords in 50 places. Someone leaves the company — 50 deletions. New hire — 50 accounts to create. Password policy change — 50 updates.

RADIUS and TACACS+ solve this. Users authenticate against a central server. Create once, authenticate everywhere. Revoke once, locked out everywhere.

At scale, central authentication is non-negotiable.

AAA Concepts

Authentication: Who are you? (username/password, keys) Authorization: What can you do? (privilege levels, commands) Accounting: What did you do? (logging, audit)

Feature	RADIUS	TACACS+
Port	UDP 1812/1813	TCP 49
Encryption	Password only	Full packet
Authorization	Limited	Per-command
Best for	Network access	Device management

For router management, TACACS+ is generally preferred because it supports per-command authorization.

RADIUS Configuration

Basic RADIUS Setup

configure

# Add RADIUS server
set system login radius server 10.0.0.100 key "RadiusSecretKey123"
set system login radius server 10.0.0.100 port 1812

# Optional: Set timeout and retries
set system login radius server 10.0.0.100 timeout 3

# Enable RADIUS authentication
set system login radius source-address 192.168.1.1

commit

Multiple RADIUS Servers

# Primary server
set system login radius server 10.0.0.100 key "RadiusKey"
set system login radius server 10.0.0.100 priority 1

# Backup server
set system login radius server 10.0.0.101 key "RadiusKey"
set system login radius server 10.0.0.101 priority 2

# Third server
set system login radius server 10.0.0.102 key "RadiusKey"
set system login radius server 10.0.0.102 priority 3

# VyOS tries servers in priority order

RADIUS with Local Fallback

# If RADIUS fails, fall back to local accounts
# Always keep at least one local admin account!

set system login user local-admin full-name "Emergency Local Admin"
set system login user local-admin authentication public-keys emergency key "..."
set system login user local-admin authentication public-keys emergency type ssh-ed25519

# Local accounts are tried after RADIUS fails

TACACS+ Configuration

Basic TACACS+ Setup

configure

# Add TACACS+ server
set system login tacacs server 10.0.0.100 key "TacacsSecretKey456"
set system login tacacs server 10.0.0.100 port 49

# Set source address
set system login tacacs source-address 192.168.1.1

commit

Multiple TACACS+ Servers

# Primary
set system login tacacs server 10.0.0.100 key "TacacsKey"
set system login tacacs server 10.0.0.100 priority 1

# Backup
set system login tacacs server 10.0.0.101 key "TacacsKey"
set system login tacacs server 10.0.0.101 priority 2

commit

TACACS+ Timeout

# Adjust timeout (default is usually fine)
set system login tacacs server 10.0.0.100 timeout 5

FreeRADIUS Server Setup

Install FreeRADIUS (Ubuntu/Debian)

apt update
apt install freeradius freeradius-utils

Configure Client (VyOS Router)

# /etc/freeradius/3.0/clients.conf

client vyos-router {
    ipaddr = 192.168.1.1
    secret = RadiusSecretKey123
    shortname = vyos-main
}

Configure Users

# /etc/freeradius/3.0/users

# Admin user
admin-user Cleartext-Password := "AdminPassword123"
    Service-Type = Administrative-User,
    Cisco-AVPair = "shell:priv-lvl=15"

# Operator user
operator-user Cleartext-Password := "OperatorPassword456"
    Service-Type = NAS-Prompt-User,
    Cisco-AVPair = "shell:priv-lvl=1"

Start FreeRADIUS

# Test configuration
freeradius -X  # Debug mode

# Start service
systemctl enable freeradius
systemctl start freeradius

# Test authentication
radtest admin-user AdminPassword123 localhost 0 testing123

TACACS+ Server Setup (tac_plus)

Install tac_plus

apt install tacacs+

Configure tac_plus

# /etc/tacacs+/tac_plus.conf

key = "TacacsSecretKey456"

accounting file = /var/log/tac_plus.acct

user = admin-user {
    member = admins
    login = cleartext "AdminPassword123"
}

user = operator-user {
    member = operators
    login = cleartext "OperatorPassword456"
}

group = admins {
    default service = permit
    service = exec {
        priv-lvl = 15
    }
}

group = operators {
    default service = deny
    service = exec {
        priv-lvl = 1
    }
    cmd = show {
        permit .*
    }
    cmd = ping {
        permit .*
    }
    cmd = traceroute {
        permit .*
    }
}

Start tac_plus

systemctl enable tacacs+
systemctl start tacacs+

VyOS User Levels via AAA

RADIUS Attributes for VyOS

VyOS uses standard RADIUS attributes. To set privilege level:

# In FreeRADIUS users file
admin-user Cleartext-Password := "password"
    Service-Type = Administrative-User
# Maps to admin level

operator-user Cleartext-Password := "password"
    Service-Type = NAS-Prompt-User
# Maps to operator level

TACACS+ Privilege Levels

# In tac_plus.conf
service = exec {
    priv-lvl = 15  # Admin access
}

service = exec {
    priv-lvl = 1   # Operator access
}

Testing Authentication

Test from VyOS

# Try to SSH with RADIUS/TACACS user
ssh admin-user@vyos-router

# Check logs
show log | grep -i radius
show log | grep -i tacacs
show log | grep -i auth

Test from Server

# RADIUS test
radtest admin-user AdminPassword123 localhost 0 testing123

# TACACS+ test (requires test tool)
# Connect to VyOS and try login

Debug Authentication Issues

# On VyOS, check logs
show log | grep -i pam
show log | grep -i auth

# On RADIUS server (debug mode)
freeradius -X

# Common issues:
# - Wrong shared secret
# - Firewall blocking ports
# - Source address mismatch

Accounting Configuration

RADIUS Accounting

# RADIUS accounting sends session start/stop records
# Usually configured on RADIUS server side

# Check accounting logs on server
cat /var/log/radius/radacct/*/detail-*

TACACS+ Accounting

# tac_plus logs commands
# Accounting file location in tac_plus.conf:
accounting file = /var/log/tac_plus.acct

# View accounting log
tail -f /var/log/tac_plus.acct

High Availability Setup

Primary/Backup with Health Check

# Configure multiple servers
set system login radius server 10.0.0.100 priority 1
set system login radius server 10.0.0.101 priority 2

# VyOS automatically fails over if primary unavailable

Geographic Distribution

# Datacenter 1
set system login radius server 10.1.0.100 priority 1

# Datacenter 2
set system login radius server 10.2.0.100 priority 2

# Local cache doesn't exist - ensure server availability

Local Fallback (Critical)

# ALWAYS keep local emergency account
set system login user emergency-admin authentication public-keys key "..."
set system login user emergency-admin level admin

# If ALL RADIUS/TACACS servers fail, local accounts work

Integration with LDAP/AD

RADIUS/TACACS+ can proxy to LDAP/Active Directory:

FreeRADIUS with LDAP

# /etc/freeradius/3.0/mods-enabled/ldap

ldap {
    server = 'ldap.example.com'
    port = 389
    identity = 'cn=radius,dc=example,dc=com'
    password = ldap_password
    base_dn = 'dc=example,dc=com'

    user {
        base_dn = "ou=users,${..base_dn}"
        filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"
    }
}

Group-Based Access

# Map LDAP groups to VyOS levels
# In FreeRADIUS:

DEFAULT Ldap-Group == "network-admins"
    Service-Type = Administrative-User

DEFAULT Ldap-Group == "network-operators"
    Service-Type = NAS-Prompt-User

Security Best Practices

Secure Shared Secrets

# Use strong secrets (32+ characters)
# Different secret per device (ideally)
# Store secrets in vault, not text files

Network Security

# TACACS+ port
set firewall ipv4 name MGMT-OUT rule 100 action accept
set firewall ipv4 name MGMT-OUT rule 100 destination port 49
set firewall ipv4 name MGMT-OUT rule 100 destination address 10.0.0.100
set firewall ipv4 name MGMT-OUT rule 100 protocol tcp

# RADIUS ports
set firewall ipv4 name MGMT-OUT rule 110 action accept
set firewall ipv4 name MGMT-OUT rule 110 destination port 1812-1813
set firewall ipv4 name MGMT-OUT rule 110 destination address 10.0.0.100
set firewall ipv4 name MGMT-OUT rule 110 protocol udp

Encrypt Traffic

# TACACS+ encrypts full packet (preferred)
# RADIUS only encrypts password (use with caution over untrusted networks)

# Consider:
# - VPN between router and AAA server
# - Dedicated management network
# - IPsec protected links

Troubleshooting

Authentication Fails

# 1. Verify connectivity
ping 10.0.0.100

# 2. Check ports
nc -zv 10.0.0.100 49      # TACACS+
nc -zvu 10.0.0.100 1812   # RADIUS

# 3. Check VyOS logs
show log | grep -i auth

# 4. Check server logs
# RADIUS: /var/log/freeradius/radius.log
# TACACS+: /var/log/tac_plus.log

# 5. Test locally on server
radtest user pass localhost 0 testing123

Server Unreachable

# Check source address configuration
show configuration commands | grep source-address

# Verify routing
show ip route 10.0.0.100

# Check firewall rules
show firewall

The Lesson

At scale, central authentication is non-negotiable.

With local accounts:

Employee leaves → update 50 routers
Password breach → rotate on 50 routers
New hire → create on 50 routers
Audit → check 50 routers

With central AAA:

Employee leaves → disable one account
Password breach → one place to update
New hire → one account creation
Audit → one central log

The setup takes a few hours. The ongoing management saves hundreds of hours per year.

Requirements:

Redundancy: Multiple AAA servers
Fallback: Local emergency account always
Logging: Central accounting for audit
Security: Encrypted protocols, strong secrets

Don't run production network devices with only local accounts. Central authentication is infrastructure, not luxury.

User Management: Local Users, SSH Keys, and Access Control

berik@ashimov.com (Berik Ashimov) — Fri, 05 Dec 2025 00:00:00 GMT

Default VyOS has one user: vyos with password vyos. If that's still your production setup, you have a security problem. Every scanning bot knows those credentials.

Proper user management means: individual accounts, SSH keys instead of passwords, privilege separation, and audit trails. When something breaks, you need to know who touched what.

Shared accounts are an audit nightmare. Individual accounts with SSH keys are the baseline.

Default User Problem

# Default credentials
Username: vyos
Password: vyos

# Every automated scanner knows this
# First thing to change on new installation

Creating Users

Basic User Creation

configure

# Create admin user with full privileges
set system login user admin full-name "System Administrator"
set system login user admin authentication plaintext-password "SecurePassword123!"

# Create operator user with limited access
set system login user operator full-name "Network Operator"
set system login user operator authentication plaintext-password "OperatorPass456!"
set system login user operator level operator

commit

User Levels

Level	Access
admin	Full configuration access
operator	Show commands, limited operational commands

# Admin level (default)
set system login user admin level admin

# Operator level (read-mostly)
set system login user operator level operator

Operator Limitations

Operators can:

View configuration
Run show commands
Basic operational commands

Operators cannot:

Enter configuration mode
Modify settings
Restart services

SSH Key Authentication

Generate Keys (Client Side)

# On your workstation
ssh-keygen -t ed25519 -C "admin@example.com"
# Or RSA if ed25519 not supported
ssh-keygen -t rsa -b 4096 -C "admin@example.com"

# Get public key
cat ~/.ssh/id_ed25519.pub

Add Key to VyOS

configure

# Add SSH key for user
set system login user admin authentication public-keys admin@workstation key "AAAAC3NzaC1lZDI1NTE5AAAAIBxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
set system login user admin authentication public-keys admin@workstation type ssh-ed25519

# Or for RSA
set system login user admin authentication public-keys admin@workstation key "AAAAB3NzaC1yc2EAAAADAQABAAACAQxxxxxxxxx"
set system login user admin authentication public-keys admin@workstation type ssh-rsa

commit

Multiple Keys Per User

# Work laptop
set system login user admin authentication public-keys work-laptop key "..."
set system login user admin authentication public-keys work-laptop type ssh-ed25519

# Home workstation
set system login user admin authentication public-keys home-desktop key "..."
set system login user admin authentication public-keys home-desktop type ssh-ed25519

# Emergency key (stored securely)
set system login user admin authentication public-keys emergency key "..."
set system login user admin authentication public-keys emergency type ssh-ed25519

Disable Password Authentication

# After adding SSH keys, disable password login
set service ssh disable-password-authentication

commit

# Now only SSH key authentication works

Remove Default User

configure

# First, ensure you can login with new user!
# Test SSH key login in another terminal before deleting vyos user

# Delete default user
delete system login user vyos

commit

# If you lock yourself out, you'll need console access

SSH Service Configuration

Basic SSH Hardening

configure

# Listen only on management interface
set service ssh listen-address 192.168.1.1

# Change port (optional, security through obscurity)
set service ssh port 22222

# Disable password authentication
set service ssh disable-password-authentication

# Set login timeout
set service ssh timeout 120

# Limit authentication attempts
set service ssh max-auth-retries 3

commit

Allowed Networks

# Use firewall to limit SSH source IPs
set firewall ipv4 name MGMT-LOCAL rule 10 action accept
set firewall ipv4 name MGMT-LOCAL rule 10 destination port 22
set firewall ipv4 name MGMT-LOCAL rule 10 protocol tcp
set firewall ipv4 name MGMT-LOCAL rule 10 source address 10.0.0.0/24
set firewall ipv4 name MGMT-LOCAL rule 10 description "SSH from admin network only"

set firewall ipv4 name MGMT-LOCAL rule 999 action drop
set firewall ipv4 name MGMT-LOCAL rule 999 description "Drop all other"

SSH Client Keepalive

# Keep connections alive
set service ssh client-keepalive-interval 60

Per-User Restrictions

Restrict User to Specific Source

Can't be done directly in VyOS, but use firewall:

# Create group for restricted user's source
set firewall group network-group OPERATOR-NETS network 192.168.10.0/24

# Firewall rule allowing operator SSH only from specific network
# Combined with per-user SSH keys for enforcement

Login Tracking

# View current sessions
show users

# View login history
show log | grep -i ssh
show log | grep -i login

# Last logins
last

Emergency Access

Console Access

# Serial console always works
# Configure serial port
set system console device ttyS0 speed 115200

Emergency User

# Create break-glass account
set system login user emergency full-name "Emergency Access"
set system login user emergency authentication public-keys emergency key "..."
set system login user emergency authentication public-keys emergency type ssh-ed25519

# Store private key securely (safe, vault, etc.)
# Only use when normal access fails

Password Policies

VyOS doesn't have built-in password policies, but best practices:

Strong Passwords

# When setting passwords, enforce complexity
# Minimum 12 characters
# Mix of upper, lower, numbers, symbols

# Example (use password manager to generate)
set system login user admin authentication plaintext-password "K8#mP9$nL2@qR5&w"

Encrypted Password Storage

# VyOS shows passwords encrypted in config
show configuration commands | grep authentication

# Output shows hash, not plaintext:
# set system login user admin authentication encrypted-password '$6$rounds=xxx$...'

Regular Password Rotation

No automated policy, but establish process:

Document rotation schedule
Use calendar reminders
Change all passwords
Update documentation

Service Accounts

For automation (Ansible, scripts):

# Create service account
set system login user ansible full-name "Ansible Automation"
set system login user ansible authentication public-keys ansible-server key "..."
set system login user ansible authentication public-keys ansible-server type ssh-ed25519

# Admin level needed for configuration
set system login user ansible level admin

# Consider: dedicated key per automation tool

Audit Trail

Enable Logging

# VyOS logs authentication events to syslog
show log | grep -i auth

# Send to remote syslog for retention
set system syslog host 10.0.0.100 facility auth level info

What Gets Logged

SSH login success/failure
Configuration changes
Privilege escalation
User source IP

Review Logs

# Recent auth events
show log | grep -i auth | tail -50

# Failed logins
show log | grep -i "Failed password"

# Configuration changes
show log | grep -i commit

Multi-User Setup Example

Complete Setup

configure

# Admin users (full access)
set system login user admin1 full-name "Alice Admin"
set system login user admin1 authentication public-keys laptop key "..."
set system login user admin1 authentication public-keys laptop type ssh-ed25519
set system login user admin1 level admin

set system login user admin2 full-name "Bob Admin"
set system login user admin2 authentication public-keys laptop key "..."
set system login user admin2 authentication public-keys laptop type ssh-ed25519
set system login user admin2 level admin

# Operator users (limited access)
set system login user noc1 full-name "NOC Operator 1"
set system login user noc1 authentication public-keys workstation key "..."
set system login user noc1 authentication public-keys workstation type ssh-ed25519
set system login user noc1 level operator

# Service account (for automation)
set system login user ansible full-name "Ansible Service"
set system login user ansible authentication public-keys server key "..."
set system login user ansible authentication public-keys server type ssh-ed25519
set system login user ansible level admin

# Remove default user
delete system login user vyos

# SSH hardening
set service ssh disable-password-authentication

commit
save

The Lesson

Shared accounts are an audit nightmare. Individual accounts with SSH keys are the baseline.

Minimum requirements:

Individual accounts: One user = one person
SSH keys: No password authentication
Principle of least privilege: Operators don't need admin
Remove defaults: Delete vyos user
Log everything: Remote syslog for audit

When the next security incident happens:

With shared accounts: "Someone changed something"
With individual accounts: "admin1 changed firewall rule 50 at 14:32 from IP 10.0.0.55"

The audit trail is the difference between "we don't know" and "here's exactly what happened."

Set up users properly from day one. Retrofitting access control during an incident is not fun.

Upgrade Playbook: Safe Upgrades, Rollback, and Migration Testing

berik@ashimov.com (Berik Ashimov) — Tue, 02 Dec 2025 00:00:00 GMT

"Just upgrade to the latest version" sounds simple. Then you discover your config syntax changed, a feature was deprecated, and your BGP peers are down. The router is 200 km away. It's Friday at 6 PM.

VyOS image-based upgrades are actually quite safe — if you follow a process. The system keeps multiple images. Rollback is one reboot away. But you need to test before production.

Upgrades need a playbook, not improvisation.

VyOS Image System

VyOS runs from images. Multiple images can coexist:

# List installed images
show system image

# Output:
# The system currently has the following image(s) installed:
#   1: 1.4.0 (default boot)
#   2: 1.3.5 (running)
#   3: 1.3.4

Key concepts:

Running image: Currently booted
Default boot: Will boot on next restart
Multiple images: Can have several installed

Pre-Upgrade Checklist

1. Backup Everything

# Full config backup
show configuration commands > /config/backup-pre-upgrade.txt

# Save to local file
save /config/config.boot.backup-$(date +%Y%m%d)

# Copy offsite
scp /config/config.boot admin@backup-server:/backups/

# Also backup if you have custom scripts
tar -czf /tmp/config-backup.tar.gz /config/scripts/ /config/user-data/
scp /tmp/config-backup.tar.gz admin@backup-server:/backups/

2. Document Current State

# Version info
show version

# Running config
show configuration

# Interface status
show interfaces

# Routing state
show ip route
show ip bgp summary
show ip ospf neighbor

# Save all this output!

3. Check Release Notes

Before upgrading, read:

Release notes for target version
Migration notes
Known issues
Deprecated features

# VyOS release notes
# https://docs.vyos.io/en/latest/changelog/

# Check what changed between versions
# Pay attention to:
# - Breaking changes
# - Syntax changes
# - Feature deprecations

4. Verify Disk Space

# Check space for new image
df -h

# Images typically need 1-2 GB
# If low on space, remove old images
delete system image 1.3.3

Download and Add Image

From Release Server

# Add image from URL
add system image https://github.com/vyos/vyos-rolling-nightly-builds/releases/download/1.4-rolling-YYYYMMDD/vyos-1.4-rolling-YYYYMMDD-amd64.iso

# Or from local file
add system image /tmp/vyos-1.4.0-amd64.iso

Verify Download

# VyOS verifies image signature automatically
# Watch for verification messages during add

# After adding:
show system image

Upgrade Process

Step 1: Add New Image

# Download and add
add system image https://path/to/vyos-1.4.0-amd64.iso

# Follow prompts:
# - Confirm image signature
# - Keep or overwrite config
# - Set as default boot

Step 2: Set Default Boot

# If not set during add:
set system image default-boot 1.4.0

# Verify
show system image
# Should show 1.4.0 as default boot

Step 3: Reboot with Confirm Plan

# Before rebooting, ensure you have:
# - Console access (in case new image fails)
# - Rollback plan documented
# - Maintenance window scheduled

# Reboot
reboot

# Or schedule for off-hours
reboot at 02:00

Step 4: Verify After Boot

# Check version
show version

# Verify config loaded
show configuration

# Check critical services
show interfaces
show ip bgp summary
show ip ospf neighbor

# Test connectivity
ping 8.8.8.8

Rollback Procedures

If New Image Fails to Boot

At GRUB menu:

Select previous image from boot menu
System boots with old image
Config is preserved

After Booting Bad Image

# Set old image as default
set system image default-boot 1.3.5

# Reboot to old image
reboot

# After reboot, optionally delete bad image
delete system image 1.4.0

During Upgrade Issues

# If config migration fails
# System usually keeps backup

# Check for backup configs
ls /config/

# Restore backup
cp /config/config.boot.backup-20250108 /config/config.boot
reboot

Testing New Images

Test in Lab First

# Create test VM
# Same hardware profile as production

# Install current production version
# Apply production config (sanitized)
# Upgrade to new version
# Test everything

# Lab testing catches:
# - Config migration issues
# - Feature deprecations
# - Breaking changes

Staged Rollout

Day 1: Upgrade lab
Day 2-3: Test all features in lab
Day 4: Upgrade least critical production router
Day 5-7: Monitor
Day 8: Upgrade remaining routers

Test Checklist

## Upgrade Test Checklist

### Pre-Upgrade
- [ ] Config backup completed
- [ ] Release notes reviewed
- [ ] Lab test passed
- [ ] Maintenance window scheduled
- [ ] Console access verified
- [ ] Team notified

### Post-Upgrade Verification
- [ ] System booted successfully
- [ ] Correct version running
- [ ] All interfaces up
- [ ] BGP sessions established
- [ ] OSPF neighbors formed
- [ ] VPN tunnels up
- [ ] NAT working
- [ ] Firewall rules active
- [ ] DNS resolution working
- [ ] Monitoring connected

### Sign-off
- [ ] All tests passed
- [ ] No errors in logs
- [ ] Performance normal

Version-Specific Migration

1.3.x to 1.4.x Migration

Major syntax changes in VyOS 1.4:

# Firewall syntax changed significantly

# 1.3.x style:
set firewall name WAN-IN rule 10 action accept

# 1.4.x style:
set firewall ipv4 name WAN-IN rule 10 action accept
# Note the 'ipv4' addition

# VyOS migrates automatically, but verify
show firewall

Check Migration Logs

# After upgrade, check for warnings
show log | grep -i migrat
show log | grep -i deprecat
show log | grep -i error

# Migration script output
cat /var/log/vyos-migrate.log

Downgrade Considerations

Can You Downgrade?

Generally yes, but:

Config format may have changed
New features won't exist in old version
May need manual config adjustment

# Downgrade process:
# 1. Keep old image installed
set system image default-boot 1.3.5

# 2. Reboot
reboot

# 3. Old config should load
# 4. Check for issues
show configuration

Config Compatibility

# Before upgrade, save config in both formats

# As commands (works across versions)
show configuration commands > /config/backup-commands.txt

# As JSON (useful for parsing)
show configuration json > /config/backup-json.txt

Automation

Ansible Upgrade Playbook

---
- name: VyOS Upgrade Playbook
  hosts: vyos_routers
  gather_facts: no

  vars:
    new_image_url: "https://..."
    backup_dir: "/tmp/vyos-backups"

  tasks:
    - name: Backup configuration
      vyos.vyos.vyos_command:
        commands:
          - show configuration commands
      register: config_backup

    - name: Save backup locally
      local_action:
        module: copy
        content: "{{ config_backup.stdout[0] }}"
        dest: "{{ backup_dir }}/{{ inventory_hostname }}-pre-upgrade.txt"

    - name: Download new image
      vyos.vyos.vyos_command:
        commands:
          - "add system image {{ new_image_url }}"
      register: add_result

    - name: Set new image as default
      vyos.vyos.vyos_command:
        commands:
          - "set system image default-boot"

    - name: Reboot (async)
      vyos.vyos.vyos_command:
        commands:
          - reboot now
      async: 1
      poll: 0

    - name: Wait for reboot
      wait_for:
        host: "{{ ansible_host }}"
        port: 22
        delay: 60
        timeout: 300

    - name: Verify new version
      vyos.vyos.vyos_command:
        commands:
          - show version
      register: version_check

    - name: Display version
      debug:
        var: version_check.stdout_lines

Emergency Recovery

Boot from ISO

If all images fail:

Boot from VyOS ISO (USB/CD)
Select "Live" option

Mount existing config:

mount /dev/sda1 /mnt
cp /mnt/config/config.boot /config/

Install fresh image:
```
install image
```

Serial Console Recovery

# Connect via serial console
# Speed: 115200 8N1

# At GRUB, can select any installed image
# Even if network is misconfigured

The Lesson

Upgrades need a playbook, not improvisation.

The playbook:

Backup everything before touching anything
Test in lab with production config
Read release notes for breaking changes
Stage rollout across multiple days
Verify everything after upgrade
Rollback plan ready before you start

What makes VyOS upgrades safe:

Multiple images coexist
Rollback is one reboot
Config usually migrates automatically

What makes upgrades dangerous:

No backup
No testing
No rollback plan
Friday at 5 PM

Treat every upgrade as potentially breaking. Have the rollback plan ready. Test first. Then it's boring and safe — exactly how maintenance should be.

Configuration Standards: Naming, Comments, Structure That Scales

berik@ashimov.com (Berik Ashimov) — Fri, 28 Nov 2025 00:00:00 GMT

Look at a router config written three years ago. Can you understand what each rule does? Who added it? Why? If the answer is "no," you have a maintenance problem.

Configuration standards aren't bureaucracy. They're the difference between "I can fix this in 5 minutes" and "I need to reverse-engineer 500 rules to understand what might break."

Good config is self-documenting. Bad config is a puzzle box that only its creator could solve — and they left the company.

Naming Conventions

Interface Descriptions

# Bad: No description
set interfaces ethernet eth0

# Bad: Useless description
set interfaces ethernet eth0 description "interface 0"

# Good: Purpose and destination
set interfaces ethernet eth0 description "WAN: ISP-Acme-1Gbps-Circuit-12345"
set interfaces ethernet eth1 description "LAN: Server-VLAN-10.0.0.0/24"
set interfaces ethernet eth2 description "MGMT: OOB-Management-172.16.0.0/24"

# Pattern: <ZONE>: <Purpose>-<Details>

VLAN Naming

# Include VLAN ID in description for clarity
set interfaces ethernet eth0 vif 100 description "VLAN100: Production-Servers"
set interfaces ethernet eth0 vif 200 description "VLAN200: Development"
set interfaces ethernet eth0 vif 999 description "VLAN999: Quarantine-Untrusted"

Firewall Names

# Pattern: <FROM-ZONE>-<TO-ZONE> or <INTERFACE>-<DIRECTION>

# Zone-based naming
set firewall ipv4 name WAN-TO-LAN ...
set firewall ipv4 name LAN-TO-WAN ...
set firewall ipv4 name DMZ-TO-LAN ...

# Interface-based naming
set firewall ipv4 name ETH0-IN ...
set firewall ipv4 name ETH0-OUT ...
set firewall ipv4 name ETH0-LOCAL ...

# Pick one pattern, use it consistently

Firewall Rule Numbering

# Reserve ranges for different purposes:
# 1-99:     Critical infrastructure rules
# 100-199:  Management access
# 200-499:  Application rules
# 500-899:  User rules
# 900-998:  Logging/monitoring rules
# 999:      Default deny (explicit)

set firewall ipv4 name WAN-IN rule 10 description "Allow established"
set firewall ipv4 name WAN-IN rule 100 description "Management: SSH from admin nets"
set firewall ipv4 name WAN-IN rule 200 description "App: Web servers HTTP/HTTPS"
set firewall ipv4 name WAN-IN rule 999 description "Default: Deny and log"

BGP Peer Naming

# Include AS number and purpose
set protocols bgp neighbor 10.0.0.1 description "AS64512: Transit-ISP-Primary"
set protocols bgp neighbor 10.0.0.2 description "AS64513: Transit-ISP-Backup"
set protocols bgp neighbor 192.168.1.1 description "AS65001: Customer-Acme-Corp"
set protocols bgp neighbor 172.16.1.1 description "AS65000: iBGP-RR-Client-DC2"

Comment Strategy

What to Comment

# Comment WHY, not WHAT
# The config shows what. Comments explain why.

# Bad: States the obvious
set firewall ipv4 name WAN-IN rule 100 description "Allow TCP 22"

# Good: Explains the reason
set firewall ipv4 name WAN-IN rule 100 description "SSH: Admin access per SEC-2025-001"

# Good: References ticket/change
set firewall ipv4 name WAN-IN rule 150 description "Temp: Vendor access until 2025-03-01 - TKT-4521"

# Good: Explains non-obvious choice
set firewall ipv4 name WAN-IN rule 200 description "HTTP: Redirect only, actual traffic via reverse proxy"

Temporary Rules

# Always mark temporary rules with expiration
set firewall ipv4 name WAN-IN rule 500 description "TEMP: Contractor VPN until 2025-02-28 - Remove after project X"

# Create reminder
# Add to monitoring/ticketing system
# Set calendar reminder

# Pattern: TEMP: <purpose> until <date> - <context>

Configuration Sections

# Use comments to mark logical sections

# === MANAGEMENT ACCESS ===
set firewall ipv4 name WAN-IN rule 100 ...
set firewall ipv4 name WAN-IN rule 101 ...

# === APPLICATION TRAFFIC ===
set firewall ipv4 name WAN-IN rule 200 ...

# VyOS doesn't have section comments in config, but you can use
# a rule with high number as section marker:

set firewall ipv4 name WAN-IN rule 99 action accept
set firewall ipv4 name WAN-IN rule 99 description "=== MANAGEMENT SECTION ==="
set firewall ipv4 name WAN-IN rule 99 state established enable
# This rule does nothing special but marks a section

Firewall Groups

Groups are aliases that make rules readable and maintainable.

Network Groups

# Define once, use everywhere
set firewall group network-group ADMIN-NETS network 10.0.0.0/24
set firewall group network-group ADMIN-NETS network 192.168.100.0/24
set firewall group network-group ADMIN-NETS description "Admin workstation networks"

set firewall group network-group RFC1918 network 10.0.0.0/8
set firewall group network-group RFC1918 network 172.16.0.0/12
set firewall group network-group RFC1918 network 192.168.0.0/16
set firewall group network-group RFC1918 description "Private address space"

# Use in rules
set firewall ipv4 name WAN-IN rule 100 source group network-group ADMIN-NETS

Port Groups

set firewall group port-group WEB-PORTS port 80
set firewall group port-group WEB-PORTS port 443
set firewall group port-group WEB-PORTS description "HTTP and HTTPS"

set firewall group port-group MAIL-PORTS port 25
set firewall group port-group MAIL-PORTS port 465
set firewall group port-group MAIL-PORTS port 587
set firewall group port-group MAIL-PORTS port 993
set firewall group port-group MAIL-PORTS description "Mail server ports"

# Use in rules
set firewall ipv4 name WAN-IN rule 200 destination group port-group WEB-PORTS

Address Groups

# For individual IPs
set firewall group address-group DNS-SERVERS address 8.8.8.8
set firewall group address-group DNS-SERVERS address 8.8.4.4
set firewall group address-group DNS-SERVERS address 1.1.1.1
set firewall group address-group DNS-SERVERS description "Public DNS resolvers"

set firewall group address-group NTP-SERVERS address 129.6.15.28
set firewall group address-group NTP-SERVERS address 129.6.15.29
set firewall group address-group NTP-SERVERS description "NIST NTP servers"

Group Maintenance

# When a network changes, update the group - all rules update automatically

# Old approach (nightmare):
# Search every rule for 10.0.0.0/24, update each one

# Group approach:
show firewall group network-group ADMIN-NETS
# Update one place
set firewall group network-group ADMIN-NETS network 10.0.1.0/24
delete firewall group network-group ADMIN-NETS network 10.0.0.0/24
commit
# All rules using ADMIN-NETS now use new network

Policy Structure

Route Maps

# Consistent naming: <PURPOSE>-<DIRECTION>-<PEER-TYPE>

# For BGP
set policy route-map TRANSIT-IN-FILTER rule 10 ...
set policy route-map TRANSIT-OUT-FILTER rule 10 ...
set policy route-map CUSTOMER-IN-FILTER rule 10 ...
set policy route-map PEER-IN-FILTER rule 10 ...

# For redistribution
set policy route-map OSPF-TO-BGP rule 10 ...
set policy route-map BGP-TO-OSPF rule 10 ...
set policy route-map CONNECTED-TO-OSPF rule 10 ...

Prefix Lists

# Name clearly
set policy prefix-list BOGONS rule 10 prefix 0.0.0.0/8 le 32
set policy prefix-list BOGONS rule 10 action deny
set policy prefix-list BOGONS description "Invalid source addresses"

set policy prefix-list OUR-PREFIXES rule 10 prefix 203.0.113.0/24
set policy prefix-list OUR-PREFIXES description "Our announced address space"

set policy prefix-list DEFAULT-ONLY rule 10 prefix 0.0.0.0/0
set policy prefix-list DEFAULT-ONLY description "Match only default route"

AS Path Lists

# For BGP filtering
set policy as-path-list CUSTOMER-AS rule 10 regex "^65001$"
set policy as-path-list CUSTOMER-AS description "Customer AS65001 only"

set policy as-path-list NO-TRANSIT rule 10 regex ".*65000.*"
set policy as-path-list NO-TRANSIT description "Block routes through AS65000"

Configuration Templates

Standard Router Sections

# Order your config consistently:

# 1. System settings
set system host-name router01
set system domain-name example.com
set system time-zone UTC

# 2. Users and authentication
set system login user admin ...

# 3. Interfaces
set interfaces ethernet eth0 ...

# 4. Firewall groups (define before using)
set firewall group ...

# 5. Firewall rules
set firewall ipv4 name ...

# 6. NAT
set nat source ...

# 7. Routing protocols
set protocols static ...
set protocols bgp ...

# 8. Services (DHCP, DNS, etc)
set service dhcp-server ...

# 9. VPN
set vpn ...

# 10. Zone policy
set zone-policy ...

Change Documentation Template

When making changes, document:

## Change: <Brief description>
Date: 2025-01-08
Ticket: TKT-12345
Author: admin

### Purpose
Why this change is needed.

### Changes
- Added firewall rule 150 for new application
- Updated ADMIN-NETS group with new subnet

### Testing
- Verified connectivity from admin network
- Confirmed application accessible

### Rollback
Commands to undo:
delete firewall ipv4 name WAN-IN rule 150
delete firewall group network-group ADMIN-NETS network 10.0.2.0/24
commit

Avoiding Common Mistakes

Mistake 1: Magic Numbers

# Bad: What is 10.5.32.15?
set firewall ipv4 name WAN-IN rule 100 source address 10.5.32.15

# Good: Use groups with descriptions
set firewall group address-group MONITORING-SERVERS address 10.5.32.15
set firewall group address-group MONITORING-SERVERS description "Prometheus server"
set firewall ipv4 name WAN-IN rule 100 source group address-group MONITORING-SERVERS

Mistake 2: No Rule Descriptions

# Bad: What does this do?
set firewall ipv4 name WAN-IN rule 47 action accept
set firewall ipv4 name WAN-IN rule 47 destination port 8443

# Good: Self-documenting
set firewall ipv4 name WAN-IN rule 200 action accept
set firewall ipv4 name WAN-IN rule 200 destination port 8443
set firewall ipv4 name WAN-IN rule 200 description "App: Customer portal HTTPS"
set firewall ipv4 name WAN-IN rule 200 protocol tcp

Mistake 3: Inconsistent Numbering

# Bad: Random rule numbers
set firewall ipv4 name WAN-IN rule 5 ...
set firewall ipv4 name WAN-IN rule 23 ...
set firewall ipv4 name WAN-IN rule 7 ...
set firewall ipv4 name WAN-IN rule 156 ...

# Good: Deliberate ranges
# 1-99: Infrastructure
# 100-199: Management
# 200-899: Applications
# 900-999: Cleanup/deny

The Lesson

Good config is self-documenting. Bad config is a puzzle box.

Standards to adopt today:

Name things by purpose: WAN-TO-LAN, not FIREWALL1
Use groups: Define once, maintain once
Number consistently: Ranges for different rule types
Describe everything: Future you will thank present you
Reference tickets: Link to change management

The time spent on naming and comments pays back 100x when troubleshooting at 3 AM.

A config you can read is a config you can fix. A config you can't read is a liability waiting to become an incident.

Write configs for the next person. The next person might be you, six months from now, with no memory of why rule 47 exists.

Configuration Sessions: Parallel Work Without Conflicts

berik@ashimov.com (Berik Ashimov) — Tue, 25 Nov 2025 00:00:00 GMT

Two engineers SSH into the same router. Both type configure. Both make changes. Both commit. One overwrites the other's work. Nobody knows what happened until something breaks.

VyOS configuration sessions solve this. Each session is isolated. Changes don't interfere. Conflicts are detected before commit, not after outage.

Sessions prevent "who changed what" mysteries before they become incidents.

The Problem

Without sessions, configuration mode is a shared space:

Engineer A                    Engineer B
-----------                   -----------
configure                     configure
set firewall rule 10...      set interfaces eth1...
                              commit  ← B's changes saved
commit  ← A overwrites B's changes!

Engineer B's changes are silently lost. No warning, no merge, just gone.

How Sessions Work

VyOS 1.4+ supports configuration sessions. Each session:

Has unique ID
Isolated working copy of config
Can see other sessions
Detects conflicts on commit

Engineer A (session abc123)    Engineer B (session def456)
--------------------------     --------------------------
configure                      configure
# Working on isolated copy     # Working on isolated copy
set firewall...                set interfaces...
# Changes only in abc123       # Changes only in def456

Session Management

View Active Sessions

# List all configuration sessions
show configuration sessions

# Example output:
# Session ID          User        Started
# abc123             admin        2025-01-08 10:15:03
# def456             engineer     2025-01-08 10:18:22

Enter Named Session

# Create or resume named session
configure session my-firewall-changes

# Session persists even if you disconnect
# SSH back in, resume:
configure session my-firewall-changes

# All your uncommitted changes are still there

Discard Session

# Exit and discard changes
exit discard

# Or explicitly delete session
delete configuration session my-firewall-changes

Conflict Detection

What Happens on Conflict

# Engineer A in session abc123:
configure
set interfaces ethernet eth0 description "WAN Link"
commit  # Success

# Engineer B in session def456:
configure
set interfaces ethernet eth0 description "Internet Uplink"
commit  # ERROR: Configuration conflict detected

# Output:
# The following configuration conflicts were detected:
# interfaces ethernet eth0 description
#   Current: "WAN Link"
#   Your change: "Internet Uplink"

Resolving Conflicts

# Option 1: Refresh and redo
exit discard
configure
# See current config
show interfaces ethernet eth0
# Make decision, apply change
set interfaces ethernet eth0 description "Internet Uplink"
commit

# Option 2: Force your version
commit --force  # Overwrites conflicting values
# Use with caution - you're overwriting someone's work

# Option 3: Merge manually
compare  # See differences
# Adjust your changes to accommodate both

Session Workflow Examples

Example 1: Large Firewall Update

# Start named session for tracking
configure session firewall-2025-01-08

# Make many changes over time
set firewall ipv4 name WAN-IN rule 100 ...
set firewall ipv4 name WAN-IN rule 101 ...
set firewall ipv4 name WAN-IN rule 102 ...

# Save session, exit for lunch
exit

# Come back, continue work
configure session firewall-2025-01-08
set firewall ipv4 name WAN-IN rule 103 ...

# Review all changes
show | compare

# Commit when ready
commit-confirm 5
confirm

# Session automatically cleaned up after commit

Example 2: Team Coordination

# Engineer A: Working on BGP
configure session bgp-changes
set protocols bgp neighbor 10.0.0.1 ...

# Engineer B: Working on firewall (different session)
configure session firewall-changes
set firewall ipv4 name ...

# Both can work simultaneously
# Both can commit (no conflicts - different config sections)

Example 3: Testing Before Commit

# Create test session
configure session testing-nat

# Make changes
set nat source rule 100 ...

# Compare what would change
show | compare

# Show the commands that would be applied
show | commands

# Decide to discard
exit discard
# Or decide to apply
commit

Viewing Session Differences

Compare Session to Running Config

configure session my-changes

# What would change if I commit?
show | compare

# Output shows:
# +set interfaces ethernet eth0 description "New description"
# -set interfaces ethernet eth0 description "Old description"

Compare Sessions to Each Other

# Not directly supported, but you can:

# From session A, save diff
show | compare > /tmp/session-a-diff.txt

# From session B, save diff
show | compare > /tmp/session-b-diff.txt

# Compare files
diff /tmp/session-a-diff.txt /tmp/session-b-diff.txt

Session Timeout

Sessions don't persist forever:

# Default: sessions expire after 30 minutes of inactivity

# Check current timeout
show configuration session-timeout

# Modify timeout (in minutes)
configure
set system session-timeout 60
commit

Persistent Sessions

For long-running work:

# Keep session alive with periodic activity
# Or increase timeout significantly

# Check session status
show configuration sessions

Best Practices

1. Use Named Sessions for Major Changes

# Bad: Anonymous session
configure  # What was I working on?

# Good: Descriptive name
configure session ticket-12345-add-bgp-peer

2. One Logical Change Per Session

# Bad: Everything in one session
configure session all-my-stuff
set firewall ...
set bgp ...
set nat ...
# Huge commit, hard to rollback one thing

# Good: Separate sessions
configure session firewall-update
# ... only firewall changes

configure session bgp-peer-addition
# ... only BGP changes

3. Review Before Commit

configure session important-change

# Always review what you're committing
show | compare
show | commands

# Then commit
commit-confirm 5

4. Clean Up Abandoned Sessions

# List old sessions
show configuration sessions

# Delete abandoned ones
delete configuration session old-test-session
delete configuration session johns-forgotten-work

Limitations

What Sessions Don't Do

Don't provide history - Sessions are temporary workspaces, not audit trails
Don't allow collaborative editing - One user per session
Don't auto-merge - Conflicts must be resolved manually
Don't persist across reboots - Uncommitted sessions are lost

When Sessions Aren't Enough

For complex team workflows, consider:

Git-based configuration management
Ansible/Terraform for deployments
Separate staging environment

Sessions handle concurrent edits. Version control handles history and rollback.

Troubleshooting

Can't Enter Configuration Mode

# Another session might be locked
show configuration sessions

# If stuck session exists
# Try to take ownership (if you're sure)
configure session stuck-session
exit discard

Lost Session Changes

# Session timed out or router rebooted?
# Uncommitted changes are gone

# Always commit before:
# - Long breaks
# - Router maintenance
# - End of work day

Conflict on Every Commit

# Another process might be modifying config

# Check for automation
show log | grep -i commit

# Coordinate with team
# Pause automated deployments during manual work

The Lesson

Sessions prevent "who changed what" mysteries before they become incidents.

On a shared router:

Without sessions: Chaos, overwrites, mystery changes
With sessions: Isolation, conflict detection, clear ownership

The few seconds to type configure session descriptive-name saves hours of debugging "why did this config disappear?"

Every team environment should use sessions. Every major change should use a named session. Every commit should follow a review.

The router knows when you're about to overwrite someone's work. Let it tell you.

Commit-Confirm: Your Safety Net Against Self-Lockout

berik@ashimov.com (Berik Ashimov) — Fri, 21 Nov 2025 00:00:00 GMT

You SSH into a remote router. Change firewall rules. Commit. Connection drops. You just locked yourself out. The router is 500 km away. Your Friday evening plans are cancelled.

commit-confirm exists precisely for this scenario. It commits changes temporarily. If you don't confirm within a timeout, changes roll back automatically. The router saves itself from your mistakes.

Every remote change should use commit-confirm. No exceptions.

How Commit-Confirm Works

1. You run: commit-confirm 5
2. Changes are applied
3. Timer starts (5 minutes)
4. If you run 'confirm' before timeout → changes persist
5. If timeout expires → automatic rollback to previous config

The key insight: if your change breaks connectivity, you can't confirm. So the router reverts itself.

Basic Usage

Standard Commit-Confirm

# Enter configuration mode
configure

# Make your changes
set firewall ipv4 name WAN-IN rule 100 action drop
set firewall ipv4 name WAN-IN rule 100 source address 10.0.0.0/8

# Commit with 5-minute timeout
commit-confirm 5

# Test connectivity, verify everything works
# Then confirm:
confirm

# Or if something is wrong, just wait - it will rollback

Different Timeout Values

# Quick change, confident it works (minimum 1 minute)
commit-confirm 1

# Complex change, need time to verify
commit-confirm 10

# Major change, need extensive testing
commit-confirm 30

Choose timeout based on how long you need to verify the change works.

Check Remaining Time

# See if there's a pending confirm
show system commit

# Output shows:
# Commit confirmed in X minutes

Real-World Scenarios

Scenario 1: Firewall Rule Change

configure

# Adding rule to allow new service
set firewall ipv4 name WAN-IN rule 50 action accept
set firewall ipv4 name WAN-IN rule 50 destination port 8443
set firewall ipv4 name WAN-IN rule 50 protocol tcp

# Use confirm because this touches firewall
commit-confirm 3

# Test from external client
# curl https://server:8443 - works!

# Confirm the change
confirm

Scenario 2: Interface Address Change

configure

# Changing management IP - very risky remotely
set interfaces ethernet eth0 address 192.168.1.100/24
delete interfaces ethernet eth0 address 192.168.1.50/24

# Critical: use confirm with enough time to reconnect
commit-confirm 5

# Immediately try to SSH to new IP
# ssh admin@192.168.1.100

# If you can connect:
confirm

# If you can't connect - wait 5 minutes, router reverts

Scenario 3: Routing Change

configure

# Changing default gateway
delete protocols static route 0.0.0.0/0
set protocols static route 0.0.0.0/0 next-hop 10.0.0.1

commit-confirm 3

# Verify connectivity
ping 8.8.8.8
traceroute 8.8.8.8

# Looks good
confirm

What Happens on Rollback

When timeout expires without confirmation:

# 1. Current session is terminated
# 2. Config reverts to pre-commit state
# 3. All services reload with old config
# 4. Log entry created

# Check what happened
show log | grep -i rollback
show log | grep -i commit

# View current config (should be pre-change)
show configuration

Commit-Confirm vs Regular Commit

Aspect	`commit`	`commit-confirm`
Immediate	Yes	Yes
Persistent	Yes	Only after confirm
Rollback	Manual	Automatic on timeout
Remote safety	None	High
Use case	Local changes	Remote changes

Best Practices

1. Always Use for Remote Sessions

# Script wrapper for remote changes
#!/bin/bash
# safe-commit.sh

TIMEOUT=${1:-5}  # Default 5 minutes

source /opt/vyatta/etc/functions/script-template
configure
# ... your changes ...
commit-confirm $TIMEOUT

echo "Changes applied. Confirm with 'confirm' within $TIMEOUT minutes"
echo "Or changes will automatically rollback"

2. Test Connectivity Immediately

commit-confirm 3

# Immediately verify you can still reach the router
# From another terminal:
ping router-ip
ssh admin@router-ip

# Only then confirm
confirm

3. Have Console Access Ready

For major changes, have out-of-band access ready:

IPMI/iLO console
Serial console
Local access

If commit-confirm somehow doesn't save you, console access will.

4. Document the Rollback Happened

# After a rollback, document it
show log | grep -i rollback

# Add comment explaining what was attempted
configure
comment firewall ipv4 name WAN-IN "Note: rule 100 attempt on 2025-01-08 caused lockout, rolled back"
commit

Edge Cases

Confirm from Different Session

# Session 1: commit-confirm 5
# Session 1: loses connectivity

# Session 2: SSH into router (if possible via different path)
confirm  # This confirms Session 1's changes

# Useful when you have multiple network paths

Immediate Rollback

# Changed your mind before timeout?
configure
rollback 0  # Rollback to last saved config
commit

# Or simply exit without confirming
exit discard
# Wait for timeout

Power Failure During Confirm Window

If router reboots during commit-confirm window:

Uncommitted changes are lost
Router boots with last saved (pre-commit-confirm) config
This is the safe behavior

Automating Confirm

For automated deployments, you need programmatic confirm:

# Ansible approach
- name: Apply VyOS config with confirm
  vyos.vyos.vyos_config:
    lines:
      - set firewall ipv4 name TEST rule 10 action accept
    save: no

- name: Verify connectivity
  wait_for:
    host: "{{ inventory_hostname }}"
    port: 22
    timeout: 60

- name: Confirm changes
  vyos.vyos.vyos_command:
    commands:
      - confirm

# If verify fails, Ansible doesn't reach confirm
# Timeout expires, config rolls back

API-Based Confirm

# Using VyOS API
curl -X POST https://router/configure \
  -H "Content-Type: application/json" \
  -d '{
    "op": "set",
    "path": ["firewall", "ipv4", "name", "TEST"],
    "value": "..."
  }'

# Apply with timeout
curl -X POST https://router/config-file \
  -d '{"op": "commit-confirm", "minutes": 5}'

# Verify connectivity, then:
curl -X POST https://router/config-file \
  -d '{"op": "confirm"}'

Common Mistakes

Mistake 1: Timeout Too Short

# Bad: 1 minute for complex change
commit-confirm 1
# You're still verifying when it rolls back

# Better: Give yourself time
commit-confirm 10

Mistake 2: Forgetting to Confirm

commit-confirm 5
# Test, looks good
# Get distracted
# 5 minutes pass
# Changes gone!

# Tip: Set a timer on your phone

Mistake 3: Not Using It At All

# The classic mistake
configure
set firewall ...  # Typo somewhere
commit  # No safety net
# Locked out

The Lesson

Every remote change should use commit-confirm. No exceptions.

The cost of using commit-confirm:

Few seconds to type the command
Remember to confirm

The cost of not using it:

Emergency drive to datacenter
Out-of-band console access scramble
Ruined weekend

The router can save itself from your mistakes. Let it.

I've been saved by commit-confirm more times than I care to admit. Every time I think "this change is simple, I don't need confirm" — that's exactly when I need it most.

Use it. Always.

Automation & GitOps for VyOS: Templates, Backups, Safe Deploy

berik@ashimov.com (Berik Ashimov) — Tue, 18 Nov 2025 00:00:00 GMT

Every network incident postmortem I've read includes some variation of "a configuration change was made." Manual changes on production routers are the leading cause of outages. We know this. We still do it.

Automation isn't about being fancy. It's about reducing the blast radius of human error. When configs live in Git, changes are reviewed before deployment, and rollback is one command away — you still make mistakes, but they're smaller and recoverable.

This is how to automate VyOS configuration management in a way that actually works.

The Problem with Manual Configuration

Picture this:

Need to add a firewall rule
SSH into router
Type commands from memory
Typo in IP address
Commit
Traffic drops
Panic

Now picture:

Edit rule in Git
PR reviewed by colleague (catches typo)
Merge triggers automated deploy
Change applied
If wrong, git revert and redeploy

Same change. One is an incident, one is Tuesday.

Config Backup Strategy

Before automating changes, automate backups. You need to recover from whatever you're about to break.

Manual Backup Commands

# Full config as set commands
show configuration commands > /config/backup-$(date +%Y%m%d).txt

# Config as JSON (useful for parsing)
show configuration json > /config/backup-$(date +%Y%m%d).json

Automated Backup Script

Create /config/scripts/backup-config.sh:

#!/bin/bash

BACKUP_DIR="/config/backups"
DATE=$(date +%Y%m%d-%H%M%S)
HOSTNAME=$(hostname)
BACKUP_FILE="${BACKUP_DIR}/${HOSTNAME}-${DATE}.cfg"

# Create backup directory
mkdir -p "${BACKUP_DIR}"

# Export config
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper begin
/opt/vyatta/bin/cli-shell-api showCfg --show-active-only > "${BACKUP_FILE}"
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper end

# Compress
gzip "${BACKUP_FILE}"

# Keep last 30 days
find "${BACKUP_DIR}" -name "*.cfg.gz" -mtime +30 -delete

# Optional: Push to remote storage
# scp "${BACKUP_FILE}.gz" backup-server:/backups/

Schedule via cron:

configure
set system task-scheduler task backup-config cron-spec '0 * * * *'
set system task-scheduler task backup-config executable path '/config/scripts/backup-config.sh'
commit

Hourly backups, 30 days retention.

Off-Router Backup

Backups on the router die with the router. Push to external storage:

#!/bin/bash
# /config/scripts/backup-remote.sh

HOSTNAME=$(hostname)
DATE=$(date +%Y%m%d)
REMOTE="git@git.example.com:network/configs.git"
WORK_DIR="/tmp/config-backup"

# Clone repo
rm -rf "${WORK_DIR}"
git clone "${REMOTE}" "${WORK_DIR}"

# Export config
/opt/vyatta/bin/cli-shell-api showCfg --show-active-only > "${WORK_DIR}/${HOSTNAME}.cfg"

# Commit and push
cd "${WORK_DIR}"
git add "${HOSTNAME}.cfg"
git commit -m "Automated backup: ${HOSTNAME} ${DATE}" || true
git push

# Cleanup
rm -rf "${WORK_DIR}"

Now every config change is version-controlled, even manual ones.

Configuration as Code

Store your configs in Git from the start, not just as backups.

Repository Structure

vyos-configs/
├── README.md
├── inventory/
│   ├── production.yml
│   └── staging.yml
├── templates/
│   ├── base/
│   │   ├── system.j2
│   │   ├── interfaces.j2
│   │   └── firewall.j2
│   └── roles/
│       ├── edge-router.j2
│       └── core-router.j2
├── vars/
│   ├── common.yml
│   └── per-router/
│       ├── router1.yml
│       └── router2.yml
├── configs/
│   ├── router1.cfg
│   └── router2.cfg
└── scripts/
    ├── generate.py
    ├── deploy.sh
    └── validate.sh

Jinja2 Templates

Templates let you define config patterns once and instantiate for each router.

Template Example

templates/base/interfaces.j2:

{# Interface configuration template #}

{% for iface in interfaces %}
set interfaces ethernet {{ iface.name }} address '{{ iface.address }}'
set interfaces ethernet {{ iface.name }} description '{{ iface.description }}'
{% if iface.vrrp is defined %}
set high-availability vrrp group {{ iface.vrrp.group }} interface '{{ iface.name }}'
set high-availability vrrp group {{ iface.vrrp.group }} virtual-address '{{ iface.vrrp.vip }}'
set high-availability vrrp group {{ iface.vrrp.group }} priority '{{ iface.vrrp.priority }}'
{% endif %}
{% endfor %}

Variables File

vars/per-router/router1.yml:

hostname: router1
router_id: 10.255.255.1

interfaces:
  - name: eth0
    address: 10.0.0.2/24
    description: LAN
    vrrp:
      group: LAN
      vip: 10.0.0.1/24
      priority: 200
  - name: eth1
    address: 203.0.113.2/24
    description: WAN

Generation Script

scripts/generate.py:

#!/usr/bin/env python3
import yaml
import jinja2
import sys
from pathlib import Path

def generate_config(router_name):
    # Load variables
    common = yaml.safe_load(open('vars/common.yml'))
    router = yaml.safe_load(open(f'vars/per-router/{router_name}.yml'))

    # Merge variables
    variables = {**common, **router}

    # Load templates
    env = jinja2.Environment(
        loader=jinja2.FileSystemLoader('templates'),
        undefined=jinja2.StrictUndefined
    )

    # Render each template
    output = []
    for template_file in sorted(Path('templates/base').glob('*.j2')):
        template = env.get_template(f'base/{template_file.name}')
        output.append(template.render(**variables))

    return '\n'.join(output)

if __name__ == '__main__':
    router = sys.argv[1]
    config = generate_config(router)
    print(config)

Generate config:

python scripts/generate.py router1 > configs/router1.cfg

Ansible Integration

Ansible is the standard tool for network automation. VyOS has a collection.

Inventory

inventory/production.yml:

all:
  children:
    vyos_routers:
      hosts:
        router1:
          ansible_host: 10.0.0.2
        router2:
          ansible_host: 10.0.0.3
      vars:
        ansible_user: vyos
        ansible_network_os: vyos.vyos.vyos
        ansible_connection: ansible.netcommon.network_cli

Playbook: Apply Configuration

playbooks/apply-config.yml:

---
- name: Apply VyOS configuration
  hosts: vyos_routers
  gather_facts: no

  tasks:
    - name: Load configuration from file
      set_fact:
        config_lines: "{{ lookup('file', 'configs/' + inventory_hostname + '.cfg').split('\n') }}"

    - name: Apply configuration
      vyos.vyos.vyos_config:
        lines: "{{ config_lines }}"
        save: yes
      register: result

    - name: Show changes
      debug:
        var: result.commands
      when: result.changed

Run:

ansible-playbook -i inventory/production.yml playbooks/apply-config.yml

Playbook: Backup Before Change

Always backup before deploying:

---
- name: Safe configuration deployment
  hosts: vyos_routers
  gather_facts: no

  tasks:
    - name: Backup current configuration
      vyos.vyos.vyos_config:
        backup: yes
        backup_options:
          filename: "{{ inventory_hostname }}-{{ ansible_date_time.iso8601 }}.cfg"
          dir_path: ./backups/

    - name: Apply new configuration
      vyos.vyos.vyos_config:
        src: "configs/{{ inventory_hostname }}.cfg"
        save: yes

Safe Deployment Practices

Automation without safety is just faster mistakes.

1. Dry Run First

VyOS doesn't have a true dry-run, but you can compare:

#!/bin/bash
# scripts/diff-config.sh

ROUTER=$1
NEW_CONFIG=$2

# Get current config
ssh vyos@${ROUTER} 'show configuration commands' > /tmp/current.cfg

# Compare
diff -u /tmp/current.cfg "${NEW_CONFIG}"

Review the diff before deploying.

2. Staged Rollout

Don't deploy to all routers at once:

# Deploy to staging first
- hosts: staging_routers
  tasks:
    - include_tasks: apply-config.yml

# Wait and validate
- hosts: staging_routers
  tasks:
    - name: Wait for convergence
      pause:
        minutes: 5

    - name: Validate connectivity
      vyos.vyos.vyos_command:
        commands:
          - ping 8.8.8.8 count 3
      register: ping_result
      failed_when: "'0 received' in ping_result.stdout[0]"

# Only then production
- hosts: production_routers
  tasks:
    - include_tasks: apply-config.yml

3. Rollback Procedure

When things go wrong (they will), rollback fast:

#!/bin/bash
# scripts/rollback.sh

ROUTER=$1
BACKUP_FILE=$2

echo "Rolling back ${ROUTER} to ${BACKUP_FILE}"

# Load backup config
ssh vyos@${ROUTER} "configure; load ${BACKUP_FILE}; commit; save; exit"

echo "Rollback complete"

Or with Ansible:

- name: Emergency rollback
  hosts: "{{ target_router }}"
  gather_facts: no

  tasks:
    - name: Load backup configuration
      vyos.vyos.vyos_config:
        src: "backups/{{ inventory_hostname }}-{{ backup_date }}.cfg"
        save: yes

4. Change Windows

Automate deployment timing, not just deployment:

# Only deploy during change window
- name: Check change window
  hosts: localhost
  tasks:
    - name: Verify time is within change window
      assert:
        that:
          - ansible_date_time.weekday in ['Saturday', 'Sunday']
          - ansible_date_time.hour | int >= 2
          - ansible_date_time.hour | int <= 6
        fail_msg: "Outside change window (Sat-Sun 02:00-06:00)"

5. Validation After Deploy

Don't just deploy and hope:

- name: Post-deployment validation
  hosts: vyos_routers
  tasks:
    - name: Check BGP sessions
      vyos.vyos.vyos_command:
        commands:
          - show ip bgp summary
      register: bgp_status

    - name: Verify BGP established
      assert:
        that:
          - "'Established' in bgp_status.stdout[0]"
        fail_msg: "BGP session not established!"

    - name: Check VRRP status
      vyos.vyos.vyos_command:
        commands:
          - show vrrp
      register: vrrp_status

    - name: Check route count
      vyos.vyos.vyos_command:
        commands:
          - show ip route summary
      register: route_count

GitOps Workflow

Full GitOps: Git is the source of truth. Changes go through Git, not directly to routers.

Workflow

1. Engineer creates branch
2. Edits config in vars/ or templates/
3. Runs generate.py locally
4. Commits generated config
5. Opens PR
6. Colleague reviews diff
7. CI validates (syntax, linting)
8. PR merged
9. CD pipeline deploys to routers
10. Monitoring confirms success

CI Pipeline (GitHub Actions Example)

.github/workflows/validate.yml:

name: Validate Config

on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install jinja2 pyyaml

      - name: Generate configs
        run: |
          for router in vars/per-router/*.yml; do
            name=$(basename $router .yml)
            python scripts/generate.py $name > configs/$name.cfg
          done

      - name: Check for config drift
        run: |
          git diff --exit-code configs/

CD Pipeline

.github/workflows/deploy.yml:

name: Deploy Config

on:
  push:
    branches: [main]
    paths:
      - 'configs/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Ansible
        run: |
          pip install ansible
          ansible-galaxy collection install vyos.vyos

      - name: Deploy to staging
        run: |
          ansible-playbook -i inventory/staging.yml playbooks/apply-config.yml

      - name: Validate staging
        run: |
          ansible-playbook -i inventory/staging.yml playbooks/validate.yml

      - name: Deploy to production
        run: |
          ansible-playbook -i inventory/production.yml playbooks/apply-config.yml

The Lesson

Automation reduces manual errors — if you have rules of the game.

Automation without process is just automated mistakes. The value comes from:

Version control: Every change tracked, reviewable, revertible
Code review: Someone else catches your typos
Testing: Validate before production
Staged rollout: Break staging, not production
Fast rollback: Recover in minutes, not hours

The router config should never be edited directly. Changes flow through Git. If it's not in Git, it didn't happen (or it shouldn't have).

Start small. Automate backups first — that's pure upside. Then move to templated configs. Then add Ansible deployment. Then CI/CD. Each step reduces risk and increases confidence.

The goal isn't to eliminate human involvement. It's to move humans from "typing commands at 2 AM" to "reviewing diffs in daylight." That's where we make fewer mistakes.

High Availability: VRRP + State Sync (What You Can and Can't Do)

berik@ashimov.com (Berik Ashimov) — Fri, 14 Nov 2025 00:00:00 GMT

High availability sounds simple: two routers, one fails, the other takes over. Users don't notice. Uptime maintained. Check the HA box and move on.

Reality is messier. VRRP can fail over an IP address in seconds. But what about active connections? NAT state? BGP sessions? Firewall sessions? Some of this can be synchronized. Some can't. Some can, but with caveats that matter.

This is an honest guide to VyOS HA. What works, what doesn't, and how to test so you find out before production does.

VRRP Basics

VRRP (Virtual Router Redundancy Protocol) provides a virtual IP (VIP) shared between two or more routers. One is master, others are backup. If the master fails, a backup takes the VIP.

Clients point to the VIP. They don't care which physical router is answering.

Basic VRRP Configuration

Two VyOS routers: R1 (primary) and R2 (backup).

R1 (Primary):

configure

set interfaces ethernet eth0 address '10.0.0.2/24'
set interfaces ethernet eth0 description 'LAN'

set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'
set high-availability vrrp group LAN preempt

commit

R2 (Backup):

configure

set interfaces ethernet eth0 address '10.0.0.3/24'
set interfaces ethernet eth0 description 'LAN'

set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '100'
set high-availability vrrp group LAN preempt

commit

Key settings:

vrid: Virtual Router ID. Must match on both routers.
virtual-address: The shared IP clients use (10.0.0.1).
priority: Higher wins. R1 at 200 beats R2 at 100.
preempt: If R1 recovers, it reclaims master status.

Verify:

show vrrp

Multiple VRRP Groups

Most routers have multiple interfaces. Each needs its own VRRP group:

configure

# LAN side
set high-availability vrrp group LAN vrid '10'
set high-availability vrrp group LAN interface 'eth0'
set high-availability vrrp group LAN virtual-address '10.0.0.1/24'
set high-availability vrrp group LAN priority '200'

# WAN side
set high-availability vrrp group WAN vrid '20'
set high-availability vrrp group WAN interface 'eth1'
set high-availability vrrp group WAN virtual-address '203.0.113.1/24'
set high-availability vrrp group WAN priority '200'

commit

Sync Groups: Fail Together

If LAN interface fails but WAN is fine, you want BOTH to fail over. Otherwise, traffic enters one router and tries to exit another — asymmetric routing disaster.

configure

set high-availability vrrp sync-group MAIN member 'LAN'
set high-availability vrrp sync-group MAIN member 'WAN'

commit

Now if either interface fails, both VRRP groups fail over together.

What VRRP Does NOT Do

VRRP fails over IP addresses. That's it. It does NOT automatically:

Transfer active TCP connections
Sync NAT translation tables
Maintain firewall connection state
Preserve BGP sessions
Sync DHCP leases

For that, you need additional sync mechanisms.

Connection Tracking Sync (Conntrack)

VyOS can synchronize its connection tracking table between routers. This means established connections (TCP sessions, NAT translations) survive failover.

Conntrack Sync Configuration

On both routers:

configure

# Define sync interface (dedicated link between routers)
set service conntrack-sync interface 'eth2'
set service conntrack-sync failover-mechanism vrrp sync-group 'MAIN'
set service conntrack-sync accept-protocol 'tcp,udp,icmp'

# Optional: Exclude local traffic
set service conntrack-sync ignore-address ipv4-address '10.0.0.2'
set service conntrack-sync ignore-address ipv4-address '10.0.0.3'

commit

Requirements:

Dedicated interface between routers (eth2 in this example)
This interface should be direct (crossover) or on isolated VLAN
NOT through the same switch that might fail

What Conntrack Sync Does

Syncs NAT translation table (internal→external mappings)
Syncs connection states (ESTABLISHED, RELATED)
Allows TCP connections to survive failover (mostly)

What Conntrack Sync Does NOT Do

Guarantee zero packet loss during failover
Sync application-layer state
Help with stateless protocols beyond basic tracking

The "Mostly" Caveat

TCP connections can survive, but:

Packets in flight are lost. During failover (typically 1-3 seconds), packets are dropped. TCP will retransmit, but there's a gap.
Sequence number issues. Sometimes the new master's kernel disagrees about TCP sequence numbers. Connection may reset.
Asymmetric routing. If return traffic goes to wrong router, connections break. Sync groups help, but network design matters.

Realistic expectation: Long-lived connections (SSH sessions, database connections) usually survive. Short requests during failover may fail and retry. Users experience a brief hiccup, not a disconnect.

What You CAN'T Sync

BGP Sessions

BGP sessions are between your router and peer. When you fail over:

New master has different source IP (its physical IP)
Peer sees different neighbor
BGP session must re-establish

This takes seconds to minutes depending on timers.

Mitigation: Use aggressive BGP timers, BFD, and accept that BGP convergence is part of failover time.

configure

# Aggressive keepalive (3s) and hold (9s)
set protocols bgp neighbor 198.51.100.1 timers keepalive '3'
set protocols bgp neighbor 198.51.100.1 timers holdtime '9'

# BFD for faster detection
set protocols bgp neighbor 198.51.100.1 bfd

commit

IPsec Tunnels

IPsec SAs are bound to specific IPs. On failover:

IKE SAs re-negotiate
Child SAs re-establish
Tunnel is down for seconds

Mitigation: Use DPD (Dead Peer Detection) with short intervals. Accept brief tunnel downtime.

Routing Protocol State

OSPF neighbor relationships, BGP tables — none of this syncs. The new master starts fresh:

OSPF: Neighbors detect failure via dead interval, then re-elect
BGP: Sessions reset, routes re-exchanged

Application Sessions

If you're running services on VyOS (rare, but possible):

DHCP leases: Can sync with ISC DHCP failover, but VyOS config is separate
DNS cache: Lost
Any local state: Lost

Testing Failover

HA that isn't tested is HA that doesn't work. Test before production.

Test 1: Clean Failover

# On primary, simulate failure
sudo ip link set eth0 down

# Watch secondary take over
show vrrp

# Verify traffic flows
# From a client, ping the VIP continuously
ping 10.0.0.1

# Restore
sudo ip link set eth0 up

Test 2: Primary Recovery (Preemption)

# Ensure preempt is enabled
# Take down primary, let secondary take over
# Bring primary back up
# Verify primary reclaims master
show vrrp

Test 3: Connection Survival

# Start long-running connection through router
# SSH through the VIP to a server on the other side
ssh user@server-behind-router

# Fail over primary
# Check if SSH session survives
# It should pause briefly then continue

Test 4: Split Brain

What if the sync link fails but both routers are up?

# Disconnect eth2 (sync interface)
# Both routers think they're alone
# Both might become master = split brain

# VyOS should still function, but conntrack sync stops
# This is a failure mode to document, not prevent

Test 5: Dual Failure

What if both routers fail?

This isn't HA's job. HA handles single failures. Document that dual failure = outage and size your expectations.

VRRP Tuning

Advertisement Interval

Default is 1 second. Faster detection = faster failover, but more network traffic.

configure

set high-availability vrrp group LAN advertise-interval '1'

commit

For sub-second failover, some use 0.1-0.5 seconds. Be careful — this is more sensitive to network jitter.

Preempt Delay

When primary recovers, don't immediately preempt. Let it stabilize:

configure

set high-availability vrrp group LAN preempt-delay '30'

commit

Primary must be up for 30 seconds before reclaiming master. Prevents flapping.

Health Check Scripts

Fail over based on more than interface status:

configure

set high-availability vrrp group LAN health-check script '/config/scripts/check-uplink.sh'
set high-availability vrrp group LAN health-check interval '5'
set high-availability vrrp group LAN health-check failure-count '3'

commit

Example script (/config/scripts/check-uplink.sh):

#!/bin/bash
# Check if upstream is reachable
ping -c 1 -W 1 198.51.100.1 > /dev/null 2>&1
exit $?

If script returns non-zero 3 times, VRRP fails over.

Realistic HA Architecture

A production VyOS HA setup:

                    ┌─────────────────┐
                    │    Internet     │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
        ┌─────┴─────┐               ┌───────┴───────┐
        │  R1 (Pri) │───sync link───│  R2 (Backup)  │
        │ eth1: WAN │     eth2      │  eth1: WAN    │
        │ eth0: LAN │               │  eth0: LAN    │
        └─────┬─────┘               └───────┬───────┘
              │ VIP: 10.0.0.1               │
              └──────────────┬──────────────┘
                             │
                    ┌────────┴────────┐
                    │   LAN Switch    │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    │     Clients     │
                    └─────────────────┘

Key points:

Dedicated sync link (eth2) — not through the LAN switch
Both routers connect to same LAN switch (single point of failure, but usually acceptable)
VIP is what clients use
If the LAN switch fails, both routers are useless anyway

The Lesson

HA is not a checkbox. It's a set of failure scenarios and tests.

VRRP gives you IP failover in seconds. Conntrack sync gives you connection state (mostly). But:

BGP sessions reset
IPsec tunnels re-establish
Application state is lost
Brief packet loss happens

HA means: single router failure doesn't cause outage. It doesn't mean zero impact. Users may see a brief hiccup. Long connections survive but might stutter. This is acceptable for most use cases.

What makes HA work isn't the configuration — it's the testing. Every failure scenario you test is one you understand. Every one you skip is one that surprises you at 3 AM.

Document your failure modes. Test your failover. Know exactly what happens when the primary dies. That's HA.

VRF & Segmentation: When VLANs Aren't Enough

berik@ashimov.com (Berik Ashimov) — Tue, 11 Nov 2025 00:00:00 GMT

VLANs give you Layer 2 separation. Different broadcast domains, different subnets. But they all share the same routing table. When a server in VLAN 10 wants to reach a server in VLAN 20, your router sees both networks, compares the destination to its single routing table, and forwards.

VRF (Virtual Routing and Forwarding) gives you something VLANs can't: completely separate routing tables. Traffic in VRF "Production" has no idea that VRF "Management" exists. They're parallel universes on the same hardware.

When do you need this? When VLAN isolation isn't enough. Multi-tenant environments, management plane separation, compliance requirements, or just wanting to ensure that a routing mistake in one segment can't affect another.

VRF Fundamentals

A VRF is an isolated routing instance. It has:

Its own routing table
Its own interfaces
Its own routing protocols (BGP, OSPF, static routes)
No visibility into other VRFs unless you explicitly leak routes

Think of it as running multiple virtual routers on one box.

Creating VRFs on VyOS

Basic VRF setup:

configure

# Create VRFs
set vrf name PRODUCTION description 'Production workloads'
set vrf name PRODUCTION table '100'

set vrf name MANAGEMENT description 'Management and monitoring'
set vrf name MANAGEMENT table '200'

set vrf name DMZ description 'Public-facing services'
set vrf name DMZ table '300'

commit

The table parameter is the routing table ID. Each VRF needs a unique one.

Assigning Interfaces to VRFs

Interfaces must be assigned to a VRF. An interface can only belong to one VRF at a time.

configure

# Physical interface in Production
set interfaces ethernet eth1 vrf 'PRODUCTION'
set interfaces ethernet eth1 address '10.1.0.1/24'
set interfaces ethernet eth1 description 'Production LAN'

# VLAN interfaces in different VRFs
set interfaces ethernet eth0 vif 100 vrf 'MANAGEMENT'
set interfaces ethernet eth0 vif 100 address '10.100.0.1/24'

set interfaces ethernet eth0 vif 200 vrf 'DMZ'
set interfaces ethernet eth0 vif 200 address '10.200.0.1/24'

commit

Important: Once an interface is in a VRF, its routes only exist in that VRF's table. The main routing table won't see them.

Viewing VRF Status

# List all VRFs
show vrf

# Show routes in a specific VRF
show ip route vrf PRODUCTION

# Show interfaces in a VRF
show vrf PRODUCTION

# Ping from a specific VRF
ping 10.1.0.5 vrf PRODUCTION

Static Routes in VRFs

Static routes can be VRF-specific:

configure

# Default route in Production VRF
set protocols static route 0.0.0.0/0 next-hop 10.1.0.254 vrf 'PRODUCTION'

# Specific route in Management VRF
set protocols static route 10.0.0.0/8 next-hop 10.100.0.254 vrf 'MANAGEMENT'

commit

Running Routing Protocols in VRFs

Each VRF can run its own routing protocol instances:

OSPF per VRF

configure

# OSPF in Production VRF
set protocols ospf vrf PRODUCTION area 0 network '10.1.0.0/24'
set protocols ospf vrf PRODUCTION parameters router-id '10.1.0.1'
set protocols ospf vrf PRODUCTION redistribute connected

# OSPF in Management VRF (completely separate)
set protocols ospf vrf MANAGEMENT area 0 network '10.100.0.0/24'
set protocols ospf vrf MANAGEMENT parameters router-id '10.100.0.1'

commit

BGP per VRF

configure

# BGP for Production (AS 65001)
set protocols bgp vrf PRODUCTION system-as '65001'
set protocols bgp vrf PRODUCTION neighbor 10.1.0.254 remote-as '65000'

# BGP for DMZ (different AS or same, depending on design)
set protocols bgp vrf DMZ system-as '65002'
set protocols bgp vrf DMZ neighbor 10.200.0.254 remote-as '65000'

commit

Inter-VRF Routing (Route Leaking)

Sometimes you need controlled communication between VRFs. A jump host in Management needs to reach Production. A monitoring server needs visibility into all VRFs.

Option 1: Static route leaking

configure

# Leak Production network into Management VRF
set protocols static route 10.1.0.0/24 interface 'eth1' vrf 'MANAGEMENT'
set protocols static route 10.1.0.0/24 next-hop-vrf 'PRODUCTION' vrf 'MANAGEMENT'

commit

This tells the Management VRF: "to reach 10.1.0.0/24, look in the Production VRF's routing table."

Option 2: Import/Export with BGP (MP-BGP)

For complex scenarios, use BGP to import/export routes between VRFs:

configure

# Define route distinguisher and route targets
set protocols bgp vrf PRODUCTION address-family ipv4-unicast rd '65000:100'
set protocols bgp vrf PRODUCTION address-family ipv4-unicast route-target export '65000:100'
set protocols bgp vrf PRODUCTION address-family ipv4-unicast route-target import '65000:200'

set protocols bgp vrf MANAGEMENT address-family ipv4-unicast rd '65000:200'
set protocols bgp vrf MANAGEMENT address-family ipv4-unicast route-target export '65000:200'
set protocols bgp vrf MANAGEMENT address-family ipv4-unicast route-target import '65000:100'

commit

This is the MPLS VPN model applied to VRFs. Routes tagged with RT 65000:100 (from Production) get imported into Management, and vice versa.

Firewalling Between VRFs

Leaking routes doesn't mean allowing all traffic. Use firewall rules to control inter-VRF communication:

configure

# Zone-based approach
set firewall zone PRODUCTION interface 'eth1'
set firewall zone MANAGEMENT interface 'eth0.100'
set firewall zone DMZ interface 'eth0.200'

# Policy: Management can reach Production (SSH only)
set firewall ipv4 name MGMT-TO-PROD default-action 'drop'
set firewall ipv4 name MGMT-TO-PROD rule 10 action 'accept'
set firewall ipv4 name MGMT-TO-PROD rule 10 destination port '22'
set firewall ipv4 name MGMT-TO-PROD rule 10 protocol 'tcp'
set firewall ipv4 name MGMT-TO-PROD rule 10 state 'new'

# Policy: Production cannot initiate to Management
set firewall ipv4 name PROD-TO-MGMT default-action 'drop'
set firewall ipv4 name PROD-TO-MGMT rule 10 action 'accept'
set firewall ipv4 name PROD-TO-MGMT rule 10 state 'established,related'

# Apply zone policies
set firewall zone PRODUCTION from MANAGEMENT firewall name 'MGMT-TO-PROD'
set firewall zone MANAGEMENT from PRODUCTION firewall name 'PROD-TO-MGMT'

commit

VRF for Management Plane Separation

A common pattern: keep management traffic completely separate from data traffic.

configure

# Management VRF
set vrf name MGMT table '999'

# Management interface
set interfaces ethernet eth0 vrf 'MGMT'
set interfaces ethernet eth0 address '192.168.255.1/24'
set interfaces ethernet eth0 description 'Out-of-band management'

# SSH binds to Management VRF
set service ssh vrf 'MGMT'

# SNMP in Management VRF
set service snmp vrf 'MGMT'
set service snmp community public authorization 'ro'

# NTP in Management VRF
set service ntp vrf 'MGMT'
set service ntp server 192.168.255.10

commit

Now SSH, SNMP, and NTP only work through the management interface. Someone on the production network can't SSH to the router's production-facing IP — that IP isn't listening.

The Mental Model

VRF makes complex networks manageable, but only if you maintain a clear mental model. Here's how to think about it:

1. Draw Your VRF Topology

Before configuring, diagram:

What VRFs exist
What interfaces belong to each
What routes leak between them
What firewall rules control inter-VRF traffic

┌─────────────────────────────────────────────────────┐
│                    VyOS Router                      │
├─────────────┬─────────────────┬────────────────────┤
│    MGMT     │   PRODUCTION    │        DMZ         │
│  Table 999  │    Table 100    │     Table 300      │
├─────────────┼─────────────────┼────────────────────┤
│ eth0        │ eth1            │ eth2               │
│ 192.168.x.x │ 10.1.0.0/16     │ 10.200.0.0/24      │
├─────────────┴─────────────────┴────────────────────┤
│ Route Leaking:                                      │
│ - MGMT → PROD: 10.1.0.0/24 (monitoring)            │
│ - PROD → DMZ: 10.200.0.0/24 (web backends)         │
│ - DMZ → PROD: NONE (DMZ can't initiate to prod)   │
└─────────────────────────────────────────────────────┘

2. Default to Isolation

Start with VRFs completely isolated. Only leak routes when there's a documented requirement. Every route leak is a potential security boundary crossing.

3. Document Why

For every inter-VRF route, document:

Why it's needed
Who approved it
What firewall rules protect it

# In VyOS config, use descriptions
set protocols static route 10.1.0.0/24 next-hop-vrf 'PRODUCTION' vrf 'MANAGEMENT'
# Description: "Monitoring servers in MGMT need to reach Prod for health checks. Approved: 2024-01. Protected by MGMT-TO-PROD firewall rules."

4. Test Isolation

Regularly verify that isolation is working:

# From a host in DMZ VRF, try to reach Management — should fail
ping 192.168.255.1 vrf DMZ
# Should timeout or be rejected

# Verify routes aren't leaking unexpectedly
show ip route vrf DMZ
# Should NOT see Management networks

Real-World Example: Multi-Tenant Router

Service provider scenario: one VyOS router handles multiple customers, each in their own VRF.

configure

# Customer A
set vrf name CUSTOMER-A table '1001'
set interfaces ethernet eth1 vif 100 vrf 'CUSTOMER-A'
set interfaces ethernet eth1 vif 100 address '10.100.1.1/24'
set protocols bgp vrf CUSTOMER-A system-as '65100'
set protocols bgp vrf CUSTOMER-A neighbor 10.100.1.2 remote-as '65100'

# Customer B
set vrf name CUSTOMER-B table '1002'
set interfaces ethernet eth1 vif 200 vrf 'CUSTOMER-B'
set interfaces ethernet eth1 vif 200 address '10.100.2.1/24'
set protocols bgp vrf CUSTOMER-B system-as '65200'
set protocols bgp vrf CUSTOMER-B neighbor 10.100.2.2 remote-as '65200'

# Customers cannot see each other — complete isolation
# Each has their own BGP session, their own routes, their own world

commit

Customer A and B could both use 10.0.0.0/8 internally. No conflict — they're in different routing tables.

Troubleshooting VRFs

Common issues:

1. Interface not in VRF

show interfaces ethernet eth1
# Check "VRF:" field — should show your VRF name

2. Routes not appearing

show ip route vrf PRODUCTION
# If empty, check interface addresses and that interface is up

3. Traffic not flowing between VRFs

# Check routes exist in source VRF
show ip route vrf MANAGEMENT 10.1.0.0/24

# Check firewall isn't blocking
show firewall statistics

# Check return path — VRF routing must work both directions
show ip route vrf PRODUCTION 10.100.0.0/24

4. Services not binding to VRF

# Verify service is configured for VRF
show configuration commands | grep "vrf"

# Check socket bindings
ss -tlnp | grep sshd

The Lesson

VRF simplifies complex networks if you keep the model in your head.

VLANs isolate at Layer 2. VRF isolates at Layer 3. Combined, they give you complete network segmentation. But VRF adds complexity — every interface, every route, every service needs to be VRF-aware.

The key is maintaining a clear mental model. Know which interfaces are in which VRF. Know what routes leak between them. Know what firewall rules protect those leaks. When you have that model, VRF makes complex multi-tenant or highly-segmented networks manageable. When you lose track of that model, VRF becomes a debugging nightmare.

Start simple. One or two VRFs. Get comfortable with VRF-aware commands. Expand when you need to. And always, always diagram your VRF topology before you build it.

RPKI/IRR Filtering Strategy: Practical, Not Academic

berik@ashimov.com (Berik Ashimov) — Fri, 07 Nov 2025 00:00:00 GMT

RPKI and IRR filtering aren't academic exercises. They're the difference between a stable network and being part of someone else's route hijack. Every year we see incidents where networks accept hijacked prefixes because they didn't validate. The tools exist. The question is whether you use them.

This isn't a theoretical overview. This is how to actually implement route validation on VyOS and maintain it over time. Because filtering isn't something you configure once — it's an ongoing process.

The Problem We're Solving

BGP trusts what peers tell it. If your upstream sends you a route for 8.8.8.0/24, BGP will accept it (unless you filter). If that route is a hijack, you're now directing your users to an attacker. RPKI and IRR give us ways to validate that routes are legitimate.

RPKI (Resource Public Key Infrastructure): Cryptographic validation. Route Origin Authorizations (ROAs) are signed by the prefix owner and published in repositories. You query a validator, get the validation state.

IRR (Internet Routing Registry): Database-driven. Networks register their routing policy (what they originate, what they accept). You build filters from these registrations.

Neither is perfect. Use both.

RPKI Validation on VyOS

Step 1: Run an RPKI Validator

VyOS connects to an RPKI validator via the RTR (RPKI-to-Router) protocol. You need a validator running somewhere — this can be on the VyOS router itself (limited resources) or preferably on a separate server.

Popular validators:

Routinator (NLnet Labs) — Rust, lightweight, recommended
FORT Validator — C, LACNIC
rpki-client — OpenBSD team, very lightweight

Example: Running Routinator on a separate Linux server:

# On your validator server (Debian/Ubuntu)
apt install routinator

# Configure /etc/routinator/routinator.conf
[server]
rtr-listen = ["0.0.0.0:3323"]
http-listen = ["0.0.0.0:8323"]

# Start and enable
systemctl enable --now routinator

Step 2: Configure VyOS as RPKI Client

Connect VyOS to your validator:

configure

# Define RPKI cache server
set protocols rpki cache VALIDATOR address '10.0.0.50'
set protocols rpki cache VALIDATOR port '3323'
set protocols rpki cache VALIDATOR preference '1'

# Optional: Set polling interval (default 300 seconds)
set protocols rpki polling-period '300'

commit

Verify the connection:

show rpki cache-connection
show rpki prefix-table

You should see prefixes with their validation states: valid, invalid, or not-found.

Step 3: Create Validation Policy

Here's where RPKI becomes useful. Create route-maps that act on validation state:

configure

# Route-map for upstream: reject invalid, accept valid and unknown
set policy route-map UPSTREAM-RPKI rule 10 action 'deny'
set policy route-map UPSTREAM-RPKI rule 10 match rpki 'invalid'

set policy route-map UPSTREAM-RPKI rule 20 action 'permit'
set policy route-map UPSTREAM-RPKI rule 20 match rpki 'valid'

set policy route-map UPSTREAM-RPKI rule 30 action 'permit'
set policy route-map UPSTREAM-RPKI rule 30 match rpki 'notfound'
set policy route-map UPSTREAM-RPKI rule 30 set local-preference '90'

commit

Why accept notfound? RPKI adoption isn't 100%. Rejecting unknown would break connectivity to many legitimate destinations. Instead, we lower preference — valid routes are preferred when available.

Step 4: Apply to BGP Sessions

configure

set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import 'UPSTREAM-RPKI'

commit

IRR-Based Filtering

RPKI tells you if the origin AS is authorized. IRR tells you what prefixes a network claims to originate and what their routing policy is. Use IRR to build prefix-lists.

Generating Filters from IRR

You don't manually maintain these filters. Tools like bgpq4 generate them:

# Install bgpq4
apt install bgpq4

# Generate prefix-list for AS-EXAMPLE (an AS-SET)
bgpq4 -4 -l CUSTOMER-PREFIXES AS-EXAMPLE

# Output (VyOS-compatible format)
bgpq4 -4 -l CUSTOMER-PREFIXES -F 'set policy prefix-list %n rule %N action permit\nset policy prefix-list %n rule %N prefix %i/%l\n' AS-EXAMPLE

Example output:

set policy prefix-list CUSTOMER-PREFIXES rule 10 action permit
set policy prefix-list CUSTOMER-PREFIXES rule 10 prefix 203.0.113.0/24
set policy prefix-list CUSTOMER-PREFIXES rule 20 action permit
set policy prefix-list CUSTOMER-PREFIXES rule 20 prefix 198.51.100.0/24

Automation Is Required

IRR data changes. New prefixes get registered, old ones removed. If you generate filters once and forget, they become stale. This must be automated:

#!/bin/bash
# /opt/scripts/update-irr-filters.sh

CUSTOMER_AS="AS-EXAMPLE"
OUTPUT="/tmp/customer-prefixes.vyos"

# Generate VyOS commands
bgpq4 -4 -l CUSTOMER-IN -F 'set policy prefix-list %n rule %N action permit\nset policy prefix-list %n rule %N prefix %i/%l\n' $CUSTOMER_AS > $OUTPUT

# Add deny rule at end
echo "set policy prefix-list CUSTOMER-IN rule 9999 action deny" >> $OUTPUT
echo "set policy prefix-list CUSTOMER-IN rule 9999 prefix 0.0.0.0/0 le 32" >> $OUTPUT

# Apply to VyOS (use vbash for configure mode)
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper begin
while read line; do
    /opt/vyatta/sbin/vyatta-cfg-cmd-wrapper $line
done < $OUTPUT
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper commit
/opt/vyatta/sbin/vyatta-cfg-cmd-wrapper end

Run this on a schedule (cron job) — daily or when you receive notification of policy changes.

Where to Store Policy

Filtering configuration gets complex. Where should it live?

Option 1: On the Router (Simple but Limited)

For small deployments, keep everything in VyOS config. Works, but doesn't scale and no version control.

Option 2: Git Repository (Recommended)

Store all policies in Git:

network-policy/
├── prefix-lists/
│   ├── customers/
│   │   ├── customer-a.txt
│   │   └── customer-b.txt
│   └── bogons.txt
├── as-sets/
│   └── customer-as-sets.txt
├── route-maps/
│   └── templates/
└── scripts/
    ├── generate-filters.sh
    └── deploy.sh

Benefits:

Version history (who changed what, when)
Review process (PRs before deployment)
Rollback capability
Documentation alongside config

Option 3: Automation Platform (Ansible/Nornir)

For larger networks, use configuration management:

# Ansible playbook
- name: Update BGP filters
  hosts: routers
  tasks:
    - name: Generate prefix-lists from IRR
      local_action:
        module: shell
        cmd: bgpq4 -4 -l {{ customer.name }}-IN {{ customer.as_set }}
      register: prefix_list

    - name: Apply prefix-lists
      vyos.vyos.vyos_config:
        lines: "{{ prefix_list.stdout_lines }}"

Validation Workflow

Here's a practical workflow that actually gets maintained:

1. Onboarding New Peers/Customers

Customer request → Verify AS/prefix ownership →
Add to IRR monitoring → Generate initial filters →
Apply with soft-reconfiguration → Monitor logs

2. Ongoing Maintenance

Weekly:
  - Regenerate IRR-based filters (automated)
  - Review RPKI invalid count (should be zero or near-zero)
  - Check validator health

Monthly:
  - Review filter hit counts (unused rules?)
  - Verify customer IRR registrations current
  - Test failover to backup validators

3. Incident Response

When you see RPKI invalid or unexpected routes:

# Check specific prefix
show rpki prefix 203.0.113.0/24

# See what the validator says
show rpki cache-server

# Check where route is coming from
show ip bgp 203.0.113.0/24

Combined Policy Example

Real-world inbound filter combining RPKI and prefix-list:

configure

# Bogon prefix-list (never accept these)
set policy prefix-list BOGONS rule 10 action 'permit'
set policy prefix-list BOGONS rule 10 prefix '0.0.0.0/8'
set policy prefix-list BOGONS rule 10 le '32'

set policy prefix-list BOGONS rule 20 action 'permit'
set policy prefix-list BOGONS rule 20 prefix '10.0.0.0/8'
set policy prefix-list BOGONS rule 20 le '32'

set policy prefix-list BOGONS rule 30 action 'permit'
set policy prefix-list BOGONS rule 30 prefix '127.0.0.0/8'
set policy prefix-list BOGONS rule 30 le '32'

# ... add remaining bogons

# Combined route-map
set policy route-map UPSTREAM-IN rule 10 action 'deny'
set policy route-map UPSTREAM-IN rule 10 match ip address prefix-list 'BOGONS'
set policy route-map UPSTREAM-IN rule 10 description 'Reject bogons'

set policy route-map UPSTREAM-IN rule 20 action 'deny'
set policy route-map UPSTREAM-IN rule 20 match rpki 'invalid'
set policy route-map UPSTREAM-IN rule 20 description 'Reject RPKI invalid'

set policy route-map UPSTREAM-IN rule 30 action 'permit'
set policy route-map UPSTREAM-IN rule 30 match rpki 'valid'
set policy route-map UPSTREAM-IN rule 30 set local-preference '110'
set policy route-map UPSTREAM-IN rule 30 description 'Prefer RPKI valid'

set policy route-map UPSTREAM-IN rule 100 action 'permit'
set policy route-map UPSTREAM-IN rule 100 description 'Accept remaining'

commit

Monitoring and Alerting

Filtering without monitoring is incomplete. Track:

# Number of RPKI-invalid routes (should alert if > 0 accepted)
show rpki prefix-table | grep -c Invalid

# Prefix counts by validation state
show ip bgp summary
show rpki cache-server

Set up alerts when:

RPKI validator connection drops
Number of invalid routes increases suddenly
Prefix count changes dramatically

The Lesson

Filtering is a process, not a config.

You can't configure BGP filters once and forget them. Policies change, customers add prefixes, hijacks happen. Effective filtering requires:

Automated generation (IRR → filters)
Cryptographic validation (RPKI)
Version control (Git)
Regular review (is this still correct?)
Monitoring (are filters working?)

The networks that avoid hijack incidents aren't lucky — they have processes that keep their filters current. The ones that become headlines treated filtering as a one-time task.

Start with RPKI (it's the lowest effort for significant protection), add IRR-based filters for customers and peers, automate the maintenance, and review regularly. That's the practical path to route validation that actually works.

BGP on VyOS: Filters Are Not Optional

berik@ashimov.com (Berik Ashimov) — Tue, 04 Nov 2025 00:00:00 GMT

BGP is the protocol that runs the internet. It's also the protocol that can take down the internet — one misconfiguration, one missing filter, and you're either leaking routes or accepting someone else's garbage. Every major BGP incident (and there are many) comes down to the same thing: missing or inadequate filtering.

VyOS uses FRR (Free Range Routing) as its BGP implementation. It's battle-tested and feature-complete. But FRR, like any BGP implementation, will happily accept and advertise whatever you tell it to. It's your job to tell it the right things.

BGP Fundamentals

eBGP (External BGP): Between different Autonomous Systems (AS). Your connection to ISPs, IXPs, or other organizations.

iBGP (Internal BGP): Within the same AS. Distributes external routes internally.

Key difference: eBGP changes the AS path on advertisements. iBGP doesn't — routes learned via iBGP keep their original AS path.

Basic eBGP Configuration

Typical scenario: You have AS 65000, connecting to an upstream provider (AS 12345).

configure

# Define your AS and router ID
set protocols bgp system-as '65000'
set protocols bgp parameters router-id '203.0.113.1'

# Define neighbor (your upstream)
set protocols bgp neighbor 198.51.100.1 remote-as '12345'
set protocols bgp neighbor 198.51.100.1 description 'Upstream Provider'
set protocols bgp neighbor 198.51.100.1 update-source '203.0.113.1'

# Announce your prefix
set protocols bgp address-family ipv4-unicast network 203.0.113.0/24

commit

Never run this in production. This configuration has no filters — you'll accept whatever your upstream sends (including their full table if they misconfigure) and potentially leak routes.

The Golden Rule: Always Filter

BGP without filters is an incident waiting to happen. Every BGP session needs:

Inbound filter: What routes you accept
Outbound filter: What routes you advertise
max-prefix limit: Protection against route leaks from peers

No exceptions.

Prefix Lists

Prefix lists define which IP prefixes to match:

configure

# What you own and want to advertise
set policy prefix-list MY-PREFIXES rule 10 action 'permit'
set policy prefix-list MY-PREFIXES rule 10 prefix '203.0.113.0/24'

# What you accept from upstream (example: allow all but filter specifics)
set policy prefix-list UPSTREAM-IN rule 10 action 'deny'
set policy prefix-list UPSTREAM-IN rule 10 prefix '0.0.0.0/0'
set policy prefix-list UPSTREAM-IN rule 10 description 'Deny default unless explicitly wanted'

set policy prefix-list UPSTREAM-IN rule 20 action 'deny'
set policy prefix-list UPSTREAM-IN rule 20 prefix '10.0.0.0/8'
set policy prefix-list UPSTREAM-IN rule 20 le '32'
set policy prefix-list UPSTREAM-IN rule 20 description 'Deny RFC1918'

set policy prefix-list UPSTREAM-IN rule 30 action 'deny'
set policy prefix-list UPSTREAM-IN rule 30 prefix '172.16.0.0/12'
set policy prefix-list UPSTREAM-IN rule 30 le '32'

set policy prefix-list UPSTREAM-IN rule 40 action 'deny'
set policy prefix-list UPSTREAM-IN rule 40 prefix '192.168.0.0/16'
set policy prefix-list UPSTREAM-IN rule 40 le '32'

set policy prefix-list UPSTREAM-IN rule 1000 action 'permit'
set policy prefix-list UPSTREAM-IN rule 1000 prefix '0.0.0.0/0'
set policy prefix-list UPSTREAM-IN rule 1000 le '24'
set policy prefix-list UPSTREAM-IN rule 1000 description 'Accept /24 and larger'

commit

Prefix List Syntax

prefix: The IP prefix to match
le (less than or equal): Match prefixes with length up to this value
ge (greater than or equal): Match prefixes with length at least this value

Example: prefix 10.0.0.0/8 le 24 ge 16 matches any prefix within 10.0.0.0/8 that is /16 to /24.

Route Maps

Route maps combine conditions with actions. They're the Swiss Army knife of BGP policy:

configure

# Outbound: Only advertise my prefixes, set correct attributes
set policy route-map TO-UPSTREAM rule 10 action 'permit'
set policy route-map TO-UPSTREAM rule 10 match ip address prefix-list 'MY-PREFIXES'
set policy route-map TO-UPSTREAM rule 10 set local-preference '100'

set policy route-map TO-UPSTREAM rule 1000 action 'deny'
set policy route-map TO-UPSTREAM rule 1000 description 'Deny everything else'

# Inbound: Filter bad routes, set local-pref based on source
set policy route-map FROM-UPSTREAM rule 10 action 'deny'
set policy route-map FROM-UPSTREAM rule 10 match ip address prefix-list 'BOGONS'
set policy route-map FROM-UPSTREAM rule 10 description 'Drop bogons'

set policy route-map FROM-UPSTREAM rule 100 action 'permit'
set policy route-map FROM-UPSTREAM rule 100 match ip address prefix-list 'UPSTREAM-IN'
set policy route-map FROM-UPSTREAM rule 100 set local-preference '200'

set policy route-map FROM-UPSTREAM rule 1000 action 'deny'

commit

Apply Route Maps to Neighbors

set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import 'FROM-UPSTREAM'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map export 'TO-UPSTREAM'

Max-Prefix Protection

This is your safety valve. If a peer sends more prefixes than expected, the session tears down:

# Expect ~10 prefixes, warn at 8, shut down at 10
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix '10'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix threshold '80'

# For full table peers (~900k prefixes as of 2024)
set protocols bgp neighbor 198.51.100.2 address-family ipv4-unicast maximum-prefix '1000000'
set protocols bgp neighbor 198.51.100.2 address-family ipv4-unicast maximum-prefix threshold '90'

When max-prefix triggers:

Session goes down
Log entry created
Manual intervention required (or restart-timer)

# Auto-restart after 30 minutes (use carefully)
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix restart '30'

BGP Communities

Communities are tags attached to routes. Use them to:

Signal intent to upstreams (traffic engineering)
Control route propagation
Implement policy at scale

Standard Communities

configure

# Define community list
set policy community-list BLACKHOLE rule 10 action 'permit'
set policy community-list BLACKHOLE rule 10 regex '65000:666'

# Set community in route-map
set policy route-map SET-COMMUNITY rule 10 action 'permit'
set policy route-map SET-COMMUNITY rule 10 set community '65000:100'

# Match community in route-map
set policy route-map MATCH-COMMUNITY rule 10 action 'permit'
set policy route-map MATCH-COMMUNITY rule 10 match community 'BLACKHOLE'

commit

Well-Known Communities

Community	Meaning
no-export	Don't advertise outside the AS
no-advertise	Don't advertise to any peer
local-as	Don't advertise outside the local confederation
no-peer	Don't advertise to eBGP peers (RFC 3765)

# Example: Mark routes as no-export
set policy route-map INTERNAL-ONLY rule 10 set community 'no-export'

Blackhole Communities

Most transit providers support blackhole communities. When you're under DDoS attack:

# Advertise the attacked IP with blackhole community
# (community varies by provider — check their documentation)
set policy route-map BLACKHOLE-ANNOUNCE rule 10 action 'permit'
set policy route-map BLACKHOLE-ANNOUNCE rule 10 match ip address prefix-list 'ATTACKED-IP'
set policy route-map BLACKHOLE-ANNOUNCE rule 10 set community '12345:666'

The upstream drops traffic to that prefix at their edge. You stop receiving the DDoS.

iBGP Configuration

iBGP distributes routes within your AS. Different rules apply:

No AS path modification
Full mesh required (or use route reflectors)
Next-hop often needs adjustment

Basic iBGP

configure

set protocols bgp neighbor 10.0.0.2 remote-as '65000'
set protocols bgp neighbor 10.0.0.2 description 'iBGP peer - Router 2'
set protocols bgp neighbor 10.0.0.2 update-source 'lo'

# Next-hop-self for eBGP routes advertised via iBGP
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast nexthop-self

commit

Why Next-Hop-Self?

When you learn a route via eBGP, the next-hop is the eBGP peer's IP. If you advertise this to iBGP neighbors, they need a route to that external IP — which they might not have.

next-hop-self rewrites the next-hop to your own address, which iBGP peers definitely can reach.

Route Reflectors

Full mesh iBGP doesn't scale. With N routers, you need N*(N-1)/2 sessions. Route reflectors solve this:

# On route reflector
set protocols bgp neighbor 10.0.0.2 address-family ipv4-unicast route-reflector-client
set protocols bgp neighbor 10.0.0.3 address-family ipv4-unicast route-reflector-client
set protocols bgp neighbor 10.0.0.4 address-family ipv4-unicast route-reflector-client

# Clients just peer with reflector, not each other

Route reflector modifies the normal iBGP advertisement rules to re-advertise routes to other iBGP peers.

Local Preference

Local preference determines which path to use when you have multiple options. Higher is preferred.

# Prefer primary upstream (local-pref 200) over backup (local-pref 100)
set policy route-map FROM-PRIMARY rule 10 set local-preference '200'
set policy route-map FROM-BACKUP rule 10 set local-preference '100'

Local preference is only meaningful within your AS — it's not advertised to eBGP peers.

AS Path Prepending

Make your routes less attractive to specific upstreams (traffic engineering):

# Prepend your AS twice when advertising to backup upstream
set policy route-map TO-BACKUP rule 10 action 'permit'
set policy route-map TO-BACKUP rule 10 match ip address prefix-list 'MY-PREFIXES'
set policy route-map TO-BACKUP rule 10 set as-path prepend '65000 65000'

The longer AS path makes this route less preferred. Traffic should prefer the primary path.

Warning: Prepending more than 3x is usually pointless. Some networks filter very long AS paths.

BGP Timers

# Keepalive every 30 seconds, hold time 90 seconds
set protocols bgp neighbor 198.51.100.1 timers keepalive '30'
set protocols bgp neighbor 198.51.100.1 timers holdtime '90'

# For faster failover (aggressive)
set protocols bgp neighbor 198.51.100.1 timers keepalive '10'
set protocols bgp neighbor 198.51.100.1 timers holdtime '30'

For sub-second failover, use BFD instead of aggressive BGP timers:

set protocols bgp neighbor 198.51.100.1 bfd

Bogon Filtering

Always filter bogons — prefixes that should never appear in the global routing table:

configure

# IPv4 Bogons (not exhaustive — use a maintained list)
set policy prefix-list BOGONS-V4 rule 10 action 'permit'
set policy prefix-list BOGONS-V4 rule 10 prefix '0.0.0.0/8'
set policy prefix-list BOGONS-V4 rule 10 le '32'

set policy prefix-list BOGONS-V4 rule 20 action 'permit'
set policy prefix-list BOGONS-V4 rule 20 prefix '10.0.0.0/8'
set policy prefix-list BOGONS-V4 rule 20 le '32'

set policy prefix-list BOGONS-V4 rule 30 action 'permit'
set policy prefix-list BOGONS-V4 rule 30 prefix '100.64.0.0/10'
set policy prefix-list BOGONS-V4 rule 30 le '32'

set policy prefix-list BOGONS-V4 rule 40 action 'permit'
set policy prefix-list BOGONS-V4 rule 40 prefix '127.0.0.0/8'
set policy prefix-list BOGONS-V4 rule 40 le '32'

set policy prefix-list BOGONS-V4 rule 50 action 'permit'
set policy prefix-list BOGONS-V4 rule 50 prefix '169.254.0.0/16'
set policy prefix-list BOGONS-V4 rule 50 le '32'

set policy prefix-list BOGONS-V4 rule 60 action 'permit'
set policy prefix-list BOGONS-V4 rule 60 prefix '172.16.0.0/12'
set policy prefix-list BOGONS-V4 rule 60 le '32'

set policy prefix-list BOGONS-V4 rule 70 action 'permit'
set policy prefix-list BOGONS-V4 rule 70 prefix '192.0.0.0/24'
set policy prefix-list BOGONS-V4 rule 70 le '32'

set policy prefix-list BOGONS-V4 rule 80 action 'permit'
set policy prefix-list BOGONS-V4 rule 80 prefix '192.0.2.0/24'
set policy prefix-list BOGONS-V4 rule 80 le '32'

set policy prefix-list BOGONS-V4 rule 90 action 'permit'
set policy prefix-list BOGONS-V4 rule 90 prefix '192.168.0.0/16'
set policy prefix-list BOGONS-V4 rule 90 le '32'

set policy prefix-list BOGONS-V4 rule 100 action 'permit'
set policy prefix-list BOGONS-V4 rule 100 prefix '198.18.0.0/15'
set policy prefix-list BOGONS-V4 rule 100 le '32'

set policy prefix-list BOGONS-V4 rule 110 action 'permit'
set policy prefix-list BOGONS-V4 rule 110 prefix '198.51.100.0/24'
set policy prefix-list BOGONS-V4 rule 110 le '32'

set policy prefix-list BOGONS-V4 rule 120 action 'permit'
set policy prefix-list BOGONS-V4 rule 120 prefix '203.0.113.0/24'
set policy prefix-list BOGONS-V4 rule 120 le '32'

set policy prefix-list BOGONS-V4 rule 130 action 'permit'
set policy prefix-list BOGONS-V4 rule 130 prefix '224.0.0.0/3'
set policy prefix-list BOGONS-V4 rule 130 le '32'

# Apply in route-map
set policy route-map FROM-PEER rule 10 action 'deny'
set policy route-map FROM-PEER rule 10 match ip address prefix-list 'BOGONS-V4'

commit

For production, use Team Cymru's bogon reference or automate prefix list updates from IRR.

Debugging BGP

Check BGP Status

# Summary of all BGP neighbors
show ip bgp summary

# Detailed neighbor info
show ip bgp neighbors 198.51.100.1

# Advertised routes
show ip bgp neighbors 198.51.100.1 advertised-routes

# Received routes
show ip bgp neighbors 198.51.100.1 received-routes

# Routes in BGP table
show ip bgp

Common Issues

Symptom	Cause	Fix
State: Idle	No route to peer, firewall	Check connectivity, allow TCP 179
State: Active	TCP connect failing	Firewall, wrong IP, peer not configured
State: OpenSent	Waiting for peer's OPEN	Peer might be filtering, AS mismatch
No routes received	Inbound filter too strict	Check route-map, prefix-list
Routes not advertised	Outbound filter, route not in BGP	Check network statement, route-map

BGP Messages

# Enable BGP debugging (careful in production)
debug bgp neighbor-events
debug bgp updates

# View logs
show log | match bgp

Complete eBGP Configuration

# === BGP Core ===
set protocols bgp system-as '65000'
set protocols bgp parameters router-id '203.0.113.1'
set protocols bgp parameters log-neighbor-changes

# === Prefix Lists ===
set policy prefix-list MY-PREFIXES rule 10 action 'permit'
set policy prefix-list MY-PREFIXES rule 10 prefix '203.0.113.0/24'

set policy prefix-list BOGONS-V4 rule 10 action 'permit'
set policy prefix-list BOGONS-V4 rule 10 prefix '10.0.0.0/8'
set policy prefix-list BOGONS-V4 rule 10 le '32'
# ... (other bogon entries)

set policy prefix-list INBOUND-FILTER rule 10 action 'deny'
set policy prefix-list INBOUND-FILTER rule 10 prefix '0.0.0.0/0'
set policy prefix-list INBOUND-FILTER rule 10 ge '25'
set policy prefix-list INBOUND-FILTER rule 10 description 'Deny too-specific prefixes'

set policy prefix-list INBOUND-FILTER rule 1000 action 'permit'
set policy prefix-list INBOUND-FILTER rule 1000 prefix '0.0.0.0/0'
set policy prefix-list INBOUND-FILTER rule 1000 le '24'

# === Route Maps ===
set policy route-map TO-UPSTREAM rule 10 action 'permit'
set policy route-map TO-UPSTREAM rule 10 match ip address prefix-list 'MY-PREFIXES'
set policy route-map TO-UPSTREAM rule 1000 action 'deny'

set policy route-map FROM-UPSTREAM rule 10 action 'deny'
set policy route-map FROM-UPSTREAM rule 10 match ip address prefix-list 'BOGONS-V4'
set policy route-map FROM-UPSTREAM rule 100 action 'permit'
set policy route-map FROM-UPSTREAM rule 100 match ip address prefix-list 'INBOUND-FILTER'
set policy route-map FROM-UPSTREAM rule 100 set local-preference '200'
set policy route-map FROM-UPSTREAM rule 1000 action 'deny'

# === Neighbor Configuration ===
set protocols bgp neighbor 198.51.100.1 remote-as '12345'
set protocols bgp neighbor 198.51.100.1 description 'Primary Upstream'
set protocols bgp neighbor 198.51.100.1 update-source '203.0.113.1'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map import 'FROM-UPSTREAM'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast route-map export 'TO-UPSTREAM'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast maximum-prefix '1000000'
set protocols bgp neighbor 198.51.100.1 address-family ipv4-unicast soft-reconfiguration inbound

# === Networks to Advertise ===
set protocols bgp address-family ipv4-unicast network 203.0.113.0/24

The Lesson

BGP without filters is an incident waiting to happen. Every route leak, every hijack, every accidental full-table advertisement — they all trace back to missing or inadequate filtering.

The essentials:

Prefix lists: Define exactly what you advertise and accept
Route maps: Apply policy consistently
Max-prefix: Protect against route leaks from peers
Bogon filtering: Never accept or announce garbage prefixes
Community discipline: Use communities for consistent policy

The internet runs on trust and filtering. BGP peers trust that you advertise only what you should. Filters ensure that even when mistakes happen, damage is limited.

Before any BGP session goes live, ask: "What's the worst that happens if I misconfigure this?" Then add filters to prevent that worst case.

OSPF on VyOS: When Details Break Everything

berik@ashimov.com (Berik Ashimov) — Fri, 31 Oct 2025 00:00:00 GMT

OSPF is deceptively simple to configure. Two routers, same area, same subnet — they should just work. And then they don't. The adjacency sticks at EXSTART, or neighbors appear and disappear, or routes mysteriously vanish.

The problem is always in the details. OSPF has strict requirements that must match between neighbors: MTU, hello/dead timers, area type, authentication. Miss one, and the adjacency fails — often silently.

OSPF Fundamentals

OSPF (Open Shortest Path First) is a link-state protocol. Each router maintains a complete map of the network topology and calculates shortest paths independently.

Key concepts:

Area: Logical grouping of routers. Area 0 is the backbone — all other areas must connect to it
Router ID: Unique identifier, usually an IP address
Adjacency: Full neighbor relationship where routers exchange LSAs
LSA: Link State Advertisement — the building blocks of the topology database

Basic OSPF Configuration

configure

# Set router ID (use a loopback IP if you have one)
set protocols ospf parameters router-id '10.255.0.1'

# Enable OSPF on interfaces
set protocols ospf area 0 network '10.0.0.0/24'
set protocols ospf area 0 network '10.0.1.0/24'
set protocols ospf area 0 network '10.255.0.1/32'

commit

This enables OSPF on all interfaces matching those networks in area 0.

Interface-Based Configuration

More explicit approach — configure OSPF per interface:

configure

set protocols ospf parameters router-id '10.255.0.1'

# Enable on specific interfaces
set protocols ospf interface eth0 area '0'
set protocols ospf interface eth1 area '0'
set protocols ospf interface lo area '0'

commit

Interface-based is clearer and preferred for complex setups.

Passive Interfaces: The Silent Killer

Passive interfaces don't send or receive OSPF hello packets. Use them on:

LAN segments with no OSPF neighbors
Internet-facing interfaces
Management networks

# Mark interface as passive
set protocols ospf passive-interface 'eth2'
set protocols ospf passive-interface 'default'  # All interfaces passive by default

# Then explicitly enable OSPF interfaces
set protocols ospf passive-interface-exclude 'eth0'
set protocols ospf passive-interface-exclude 'eth1'

The trap: Forgetting to exclude an interface means no neighbors form. OSPF just sits there, advertising the network but never receiving hellos. No errors, no warnings — just silence.

Debugging Passive Issues

show ip ospf neighbor

# Empty? Check if interface is passive
show ip ospf interface eth0
# Look for "Passive interface" in output

MTU Mismatch: The Classic OSPF Failure

OSPF includes MTU in Database Description packets. If MTU doesn't match between neighbors, adjacency sticks at EXSTART/EXCHANGE state.

# Check current MTU
show interfaces ethernet eth0

# Symptoms of MTU mismatch
show ip ospf neighbor
# Neighbor stuck in EXSTART or EXCHANGE state

Fixing MTU Issues

Option 1: Match MTU on both sides (preferred)

set interfaces ethernet eth0 mtu '1500'

Option 2: Ignore MTU check (workaround)

set protocols ospf interface eth0 mtu-ignore

Use mtu-ignore only when you can't control the other side's MTU. It hides the problem rather than fixing it.

Common MTU Scenarios

Scenario	Typical MTU	Notes
Standard Ethernet	1500	Default
Jumbo frames	9000	Must match on all devices in path
GRE tunnel	1476	24 bytes overhead
IPsec tunnel	1400-1438	Varies by encryption
VXLAN	1450	50 bytes overhead

Tunnel interfaces are the usual suspects. Always check MTU when OSPF over tunnels fails.

Hello and Dead Timers

OSPF sends hello packets at regular intervals. Miss too many, and the neighbor is declared dead.

Hello interval: How often to send hellos (default: 10 seconds)
Dead interval: How long to wait before declaring neighbor dead (default: 40 seconds)

These must match between neighbors.

# Check current timers
show ip ospf interface eth0

# Modify timers (both sides must match)
set protocols ospf interface eth0 hello-interval '10'
set protocols ospf interface eth0 dead-interval '40'

Fast Failure Detection

For faster convergence, reduce timers:

# Aggressive timers (1 second hello, 4 second dead)
set protocols ospf interface eth0 hello-interval '1'
set protocols ospf interface eth0 dead-interval '4'

Trade-off: Faster detection but more CPU and more sensitive to packet loss. A single dropped hello could trigger failover.

BFD for Sub-Second Failover

For true fast failover, use BFD (Bidirectional Forwarding Detection) instead of aggressive OSPF timers:

# Enable BFD on interface
set protocols ospf interface eth0 bfd

# Configure BFD parameters
set protocols bfd peer 10.0.0.2 source address '10.0.0.1'
set protocols bfd peer 10.0.0.2 interval transmit '300'
set protocols bfd peer 10.0.0.2 interval receive '300'
set protocols bfd peer 10.0.0.2 interval multiplier '3'

BFD provides ~1 second detection without the overhead of fast OSPF hellos.

OSPF Areas

Large OSPF networks need multiple areas to:

Reduce SPF calculations (changes in one area don't affect others)
Limit LSA flooding
Summarize routes at area boundaries

Multi-Area Setup

configure

# Backbone area (always area 0)
set protocols ospf interface eth0 area '0'

# Other areas connect through ABR (Area Border Router)
set protocols ospf interface eth1 area '1'
set protocols ospf interface eth2 area '2'

commit

The router with interfaces in multiple areas is an ABR (Area Border Router).

Stub Areas

Stub areas don't receive external routes (Type 5 LSAs). Useful for areas that only need a default route to the rest of the network:

# Configure area as stub
set protocols ospf area 1 area-type stub

# On ABR, optionally set default route cost
set protocols ospf area 1 area-type stub default-cost '10'

All routers in the area must agree on stub configuration.

Totally Stubby Areas

Block both external routes AND inter-area routes:

# On ABR only
set protocols ospf area 1 area-type stub no-summary

Routers in the area only see a default route. Simplest routing table, least flexibility.

NSSA (Not-So-Stubby Area)

Like stub, but allows local external routes:

set protocols ospf area 1 area-type nssa

Useful when the area has an ASBR (redistributing from another protocol) but you don't want external routes from other areas.

OSPF Authentication

MD5 Authentication (Recommended)

configure

# Set authentication for interface
set protocols ospf interface eth0 authentication md5 key-id 1 md5-key 'YourSecretKey123'

commit

Both neighbors must have identical key-id and key.

Rotating Keys

OSPF supports multiple keys for hitless rotation:

# Add new key
set protocols ospf interface eth0 authentication md5 key-id 2 md5-key 'NewSecretKey456'

# Both keys active — neighbors using either key will authenticate
# After all neighbors updated, remove old key
delete protocols ospf interface eth0 authentication md5 key-id 1

Plain Text Authentication (Don't Use)

# Exists but insecure — anyone can sniff the password
set protocols ospf interface eth0 authentication plaintext-password 'visible-password'

Use MD5 or no authentication. Plain text is false security.

Network Types

OSPF behavior changes based on network type:

Type	DR/BDR	Multicast	Use Case
broadcast	Yes	Yes	Ethernet, default
point-to-point	No	Yes	P2P links, tunnels
point-to-multipoint	No	Yes	NBMA with full connectivity
non-broadcast	Yes	No	Frame Relay (legacy)

Point-to-Point Links

For direct router-to-router links, use point-to-point:

set protocols ospf interface eth0 network 'point-to-point'

Benefits:

No DR/BDR election delay
Faster adjacency formation
Works over unnumbered interfaces

Use for: GRE tunnels, VTI interfaces, WireGuard tunnels, direct fiber links.

Route Redistribution

Import routes from other sources into OSPF:

configure

# Redistribute connected routes
set protocols ospf redistribute connected

# Redistribute static routes
set protocols ospf redistribute static

# Redistribute with metric
set protocols ospf redistribute connected metric '100'
set protocols ospf redistribute connected metric-type '2'

commit

Metric types:

Type 1 (E1): External metric added to internal path cost
Type 2 (E2): External metric only, internal cost ignored (default)

Filtering Redistributed Routes

Use route-maps to control what gets redistributed:

# Define prefix list
set policy prefix-list OSPF-EXPORT rule 10 action 'permit'
set policy prefix-list OSPF-EXPORT rule 10 prefix '10.10.0.0/16'
set policy prefix-list OSPF-EXPORT rule 10 le '24'

# Define route-map
set policy route-map OSPF-REDISTRIBUTE rule 10 action 'permit'
set policy route-map OSPF-REDISTRIBUTE rule 10 match ip address prefix-list 'OSPF-EXPORT'
set policy route-map OSPF-REDISTRIBUTE rule 10 set metric '50'

# Apply to redistribution
set protocols ospf redistribute connected route-map 'OSPF-REDISTRIBUTE'

Troubleshooting OSPF

Check Neighbor State

show ip ospf neighbor

# Expected: FULL state for all neighbors
# Problem states:
# - INIT: Receiving hellos, but they don't see us
# - 2-WAY: Seen each other, waiting for DR election (normal on broadcast)
# - EXSTART/EXCHANGE: Database sync starting (often MTU mismatch)
# - LOADING: Receiving LSAs

Check Interface Configuration

show ip ospf interface eth0

# Verify:
# - Correct area
# - Hello/Dead intervals match
# - Not passive when shouldn't be
# - Network type appropriate

Check OSPF Database

# Show all LSAs
show ip ospf database

# Show specific LSA type
show ip ospf database router
show ip ospf database network
show ip ospf database external

Check Routes

# OSPF routes
show ip route ospf

# Why isn't a route showing?
# 1. LSA not received (neighbor issue)
# 2. Better route exists
# 3. Filtering applied

Common Problems and Solutions

Symptom	Likely Cause	Fix
No neighbors	Passive interface, ACL blocking	Check passive config, firewall rules
Stuck at INIT	One-way communication	Check firewall, routing back to us
Stuck at EXSTART	MTU mismatch	Match MTU or use mtu-ignore
Neighbors flapping	Timer mismatch, unstable link	Match timers, check link quality
Routes missing	Area mismatch, summarization	Verify area config, check ABR

Complete OSPF Configuration

# === OSPF Core ===
set protocols ospf parameters router-id '10.255.0.1'
set protocols ospf log-adjacency-changes

# === Interfaces ===
set protocols ospf interface eth0 area '0'
set protocols ospf interface eth0 network 'point-to-point'
set protocols ospf interface eth0 authentication md5 key-id 1 md5-key 'SecureKey123'
set protocols ospf interface eth0 bfd

set protocols ospf interface eth1 area '0'
set protocols ospf interface eth1 network 'broadcast'
set protocols ospf interface eth1 priority '100'

# === Passive Interfaces ===
set protocols ospf passive-interface 'eth2'
set protocols ospf passive-interface 'lo'

# === Area Configuration ===
set protocols ospf area 1 area-type stub

# === Redistribution ===
set protocols ospf redistribute connected metric '100'
set protocols ospf redistribute connected route-map 'OSPF-EXPORT'

The Lesson

OSPF fails on details:

MTU: Must match. When adjacency sticks at EXSTART, check MTU first.
Timers: Hello and dead intervals must be identical. Mismatched timers = no adjacency.
Passive interfaces: A passive interface that should be active produces no errors — just silence.
Authentication: Both sides need identical keys and key-ids.
Network type: Point-to-point for tunnels and direct links. Broadcast for Ethernet LANs.

The pattern: OSPF is strict about requirements but quiet about failures. When something doesn't work, methodically check each parameter. The problem is always a mismatch somewhere.

Debug OSPF by elimination: Can you ping the neighbor? Is the interface passive? Does MTU match? Do timers match? Is authentication correct? Work through the list, and you'll find it.

Observability on VyOS: Logs, Metrics, and Backups That Matter

berik@ashimov.com (Berik Ashimov) — Tue, 28 Oct 2025 00:00:00 GMT

Your router is infrastructure. It deserves the same observability as any production system. When something breaks at 3 AM, "I don't know what happened" isn't an acceptable answer. Logs, metrics, and configuration backups turn mysterious failures into diagnosable incidents.

This guide covers practical observability for VyOS — what to capture, where to store it, and how to use it when things go wrong.

The Logging Strategy

What to Log

Not all logs are equal. High-value logs for troubleshooting:

Log Type	Why It Matters
Firewall drops	Shows blocked traffic, attack attempts, misconfigurations
Interface state changes	Link up/down events, carrier changes
BGP/routing changes	Route flaps, peer state changes
Authentication	SSH login attempts, successful and failed
Configuration changes	Who changed what, when
System errors	Kernel messages, service failures

Basic Logging Setup

configure

# Enable system logging
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'

# Log to local file
set system syslog file messages facility all level 'notice'
set system syslog file auth facility auth level 'info'
set system syslog file firewall facility all level 'debug'

commit

Remote Logging (Recommended)

Local logs disappear when the router dies. Send logs to a remote syslog server:

configure

# Remote syslog server
set system syslog host 10.0.0.100 facility all level 'info'
set system syslog host 10.0.0.100 protocol 'udp'
set system syslog host 10.0.0.100 port '514'

# For TLS-encrypted syslog (rsyslog with TLS)
set system syslog host logs.example.com protocol 'tcp'
set system syslog host logs.example.com port '6514'

commit

Popular syslog receivers:

rsyslog: Standard Linux syslog daemon
Graylog: Full log management platform
Loki: Lightweight, Prometheus-style logs
Vector: Modern log aggregation

Firewall Logging

Firewall logs are crucial. Log all drops, and selectively log accepts:

configure

# Log dropped packets
set firewall ipv4 name WAN-TO-LAN default-action 'drop'
set firewall ipv4 name WAN-TO-LAN default-log

# Log specific rules
set firewall ipv4 name WAN-TO-LAN rule 100 action 'drop'
set firewall ipv4 name WAN-TO-LAN rule 100 log
set firewall ipv4 name WAN-TO-LAN rule 100 description 'Log and drop invalid'
set firewall ipv4 name WAN-TO-LAN rule 100 state 'invalid'

# Log successful SSH (for audit)
set firewall ipv4 name LAN-LOCAL rule 50 log
set firewall ipv4 name LAN-LOCAL rule 50 description 'Log SSH access'

commit

Warning: Don't log everything at accept. High-traffic rules logging can overwhelm storage and CPU.

Reading Logs

# Recent logs
show log

# Filtered logs
show log | match firewall
show log | match -i error

# Tail logs in real-time
monitor log

# Specific log file
show log file firewall

Metrics and Monitoring

SNMP for Traditional Monitoring

If you have Zabbix, PRTG, LibreNMS, or similar:

configure

# SNMP v2c (simple but less secure)
set service snmp community public authorization 'ro'
set service snmp community public network '10.0.0.0/24'
set service snmp listen-address 10.0.0.1

# SNMP v3 (recommended)
set service snmp v3 user monitor auth encrypted-password 'authpassword'
set service snmp v3 user monitor auth type 'sha'
set service snmp v3 user monitor privacy encrypted-password 'privpassword'
set service snmp v3 user monitor privacy type 'aes'
set service snmp v3 user monitor group 'monitor-group'

set service snmp v3 group monitor-group mode 'ro'
set service snmp v3 group monitor-group view 'monitor-view'
set service snmp v3 view monitor-view oid '.1'

commit

Prometheus/Node Exporter

For modern monitoring stacks:

# VyOS doesn't have native Prometheus exporter, but you can:
# 1. Install node_exporter via container
# 2. Use SNMP exporter with Prometheus
# 3. Script custom metrics export

# Example: expose metrics via simple script
# Create /config/scripts/metrics.sh

A simple metrics approach:

#!/bin/bash
# /config/scripts/metrics.sh - run via cron or http server

echo "# HELP vyos_interface_rx_bytes Interface received bytes"
echo "# TYPE vyos_interface_rx_bytes counter"
for iface in eth0 eth1 eth2; do
    rx=$(cat /sys/class/net/$iface/statistics/rx_bytes 2>/dev/null || echo 0)
    echo "vyos_interface_rx_bytes{interface=\"$iface\"} $rx"
done

echo "# HELP vyos_firewall_dropped Firewall dropped packets"
echo "# TYPE vyos_firewall_dropped counter"
# Parse from iptables -L -v -n

Health Checks

Monitor critical functions:

# Interface status
show interfaces

# Routing table
show ip route

# Firewall counters
show firewall

# VPN status
show vpn ipsec sa
show wireguard peers

# System resources
show system memory
show system cpu
show system storage

Automate these checks and alert on anomalies.

Configuration Backup

Manual Backup

# Show configuration (can be piped to file)
show configuration commands

# Save to file
show configuration commands > /config/backup/config-$(date +%Y%m%d).txt

# Compare configuration files
diff /config/backup/config-old.txt /config/backup/config-new.txt

Automated Backup Script

#!/bin/bash
# /config/scripts/backup-config.sh

BACKUP_DIR="/config/backup"
DATE=$(date +%Y%m%d-%H%M)
KEEP_DAYS=30

# Create backup
/opt/vyatta/bin/vyatta-op-cmd-wrapper show configuration commands > "$BACKUP_DIR/vyos-config-$DATE.txt"

# Clean old backups
find "$BACKUP_DIR" -name "vyos-config-*.txt" -mtime +$KEEP_DAYS -delete

# Optional: copy to remote server
# scp "$BACKUP_DIR/vyos-config-$DATE.txt" backup@server:/backups/vyos/

Schedule it:

set system task-scheduler task backup-config executable path '/config/scripts/backup-config.sh'
set system task-scheduler task backup-config interval '1d'

Remote Backup

Send backups off-device:

#!/bin/bash
# Backup to remote server via SCP

REMOTE_USER="backup"
REMOTE_HOST="10.0.0.100"
REMOTE_PATH="/backups/vyos"

CONFIG_FILE="/tmp/vyos-config-$(date +%Y%m%d).txt"

# Generate config
/opt/vyatta/bin/vyatta-op-cmd-wrapper show configuration commands > "$CONFIG_FILE"

# Send to remote
scp -i /config/auth/backup_key "$CONFIG_FILE" "${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_PATH}/"

# Cleanup
rm "$CONFIG_FILE"

Git-Based Configuration Management

For infrastructure-as-code approach:

#!/bin/bash
# /config/scripts/git-backup.sh

cd /config
git add -A
git commit -m "Config backup $(date +%Y%m%d-%H%M)"
git push origin main

Initialize git in /config:

cd /config
git init
git remote add origin git@github.com:yourorg/vyos-config.git

This gives you:

Full version history
Diff between any versions
Blame to see who changed what
Rollback to any point

Configuration Diff

Always diff before committing changes:

configure

# Make some changes
set interfaces ethernet eth0 description 'NEW-WAN'

# See what would change
compare

# Discard if wrong
discard

# Or commit if correct
commit

For historical comparison:

# Compare running config with saved boot config (in configure mode)
configure
compare saved
exit

# Compare two backup files
diff /config/backup/config-old.txt /config/backup/config-new.txt

Alerting

Simple Email Alerts

#!/bin/bash
# /config/scripts/alert.sh

SUBJECT="$1"
MESSAGE="$2"
RECIPIENT="admin@example.com"

echo "$MESSAGE" | mail -s "$SUBJECT" "$RECIPIENT"

Integrate with monitoring:

#!/bin/bash
# /config/scripts/wan-monitor.sh

if ! ping -c 3 -W 5 8.8.8.8 > /dev/null 2>&1; then
    /config/scripts/alert.sh "WAN DOWN" "Primary WAN unreachable at $(date)"
fi

Webhook Alerts (Slack, Discord, PagerDuty)

#!/bin/bash
# /config/scripts/webhook-alert.sh

WEBHOOK_URL="https://hooks.slack.com/services/xxx"
MESSAGE="$1"

curl -X POST -H 'Content-type: application/json' \
    --data "{\"text\":\"$MESSAGE\"}" \
    "$WEBHOOK_URL"

What to Monitor

Essential metrics for router health:

Metric	Warning Threshold	Critical Threshold
CPU usage	> 70%	> 90%
Memory usage	> 80%	> 95%
Interface errors	> 0.1%	> 1%
Firewall drops/sec	Depends on baseline	Sudden 10x increase
BGP peer state	Any change	Down
VPN tunnel state	Flapping	Down
Disk usage	> 80%	> 95%
Config changes	Any	Unexpected

Disaster Recovery Checklist

When everything fails, you need:

Configuration backup (tested restore)
Firmware/image backup (same VyOS version)
Documented procedure (how to restore)
Out-of-band access (console, IPMI, if available)

Test Your Backups

# Periodically test restore
# On a test instance:
configure
load /config/backup/vyos-config-latest.txt
compare
# Review changes
commit

A backup you've never tested restoring is not a backup.

Complete Observability Setup

# === Syslog ===
set system syslog global facility all level 'info'
set system syslog host 10.0.0.100 facility all level 'info'
set system syslog host 10.0.0.100 protocol 'udp'
set system syslog file messages facility all level 'notice'

# === SNMP ===
set service snmp community monitoring authorization 'ro'
set service snmp community monitoring network '10.0.0.0/24'
set service snmp listen-address 10.0.0.1
set service snmp location 'Network Closet'
set service snmp contact 'admin@example.com'

# === Firewall Logging ===
set firewall ipv4 name WAN-TO-LAN default-log
set firewall ipv4 name WAN-LOCAL default-log

# === Scheduled Tasks ===
set system task-scheduler task backup-config executable path '/config/scripts/backup-config.sh'
set system task-scheduler task backup-config interval '1d'
set system task-scheduler task wan-monitor executable path '/config/scripts/wan-monitor.sh'
set system task-scheduler task wan-monitor interval '5m'

The Lesson

A router without observability is like running production without monitoring — you'll only know something's wrong when users complain, and you'll have no data to diagnose it.

The minimum viable observability:

Remote syslog: Logs survive device failure
Firewall logging: See what's being blocked
Configuration backups: Automated, tested, off-device
Health monitoring: Alert before users notice

Everything else builds on this foundation. Start simple, add complexity as needed. The goal isn't comprehensive monitoring — it's having the data you need when things break.

QoS on VyOS: Making Latency Feel Better

berik@ashimov.com (Berik Ashimov) — Fri, 24 Oct 2025 00:00:00 GMT

QoS (Quality of Service) is often misunderstood. People expect it to "make the internet faster." It doesn't. QoS is about managing scarcity — when there's not enough bandwidth for everyone, QoS decides who gets priority.

The key insight: QoS only works when you control the bottleneck. If your ISP is the bottleneck, traffic shaping on your router shapes what leaves your network, not what your ISP does. Understanding this is crucial for effective QoS.

Understanding the Problem: Bufferbloat

Modern networks have a hidden enemy: bufferbloat. Network devices have large buffers that queue packets when congested. Large buffers = high latency during congestion.

Scenario: You're on a video call. Someone starts a large download. Suddenly your call has 500ms latency because packets are stuck in buffers behind download packets.

QoS solves this by:

Shaping traffic below the actual link speed (to control where queuing happens)
Prioritizing latency-sensitive traffic
Using smart queue disciplines that prevent buffer buildup

Measuring the Problem

Before configuring QoS, measure your baseline:

# Test bufferbloat (from client, while running a speed test)
ping 8.8.8.8

# Watch for latency increase during upload/download
# Normal: ~20ms, Bufferbloat: 200-1000ms

Or use the Bufferbloat test — it specifically measures latency under load.

VyOS Traffic Shaping Basics

VyOS uses Linux tc (traffic control) under the hood. Two main components:

Shaper: Limits overall bandwidth Classes/Queues: Divide bandwidth among traffic types

configure

# Basic traffic shaping on WAN interface
set traffic-policy shaper WAN-OUT bandwidth '95mbit'
set traffic-policy shaper WAN-OUT default bandwidth '50%'
set traffic-policy shaper WAN-OUT default ceiling '100%'
set traffic-policy shaper WAN-OUT default queue-type 'fq-codel'

# Apply to outbound on WAN interface
set interfaces ethernet eth0 traffic-policy out 'WAN-OUT'

commit

Key points:

bandwidth '95mbit': Total shaper bandwidth. Set this ~95% of your actual upload speed
fq-codel: Fair Queue with Controlled Delay — fights bufferbloat
ceiling '100%': Can burst to full bandwidth if available

Why 95% of Your Actual Speed?

If your upload is 100Mbps and you shape at 100Mbps, congestion still happens at your ISP's edge. You're not controlling the bottleneck.

Shape at 95Mbps (or even 90Mbps for very stable latency), and congestion happens at your router, where you control the queue. Your router's smart queue (fq-codel) manages latency instead of your ISP's dumb FIFO buffer.

Traffic Classes: Prioritizing Different Traffic

configure

# Create shaper with classes
set traffic-policy shaper WAN-OUT bandwidth '95mbit'

# Voice/Video - highest priority, guaranteed bandwidth
set traffic-policy shaper WAN-OUT class 10 bandwidth '20%'
set traffic-policy shaper WAN-OUT class 10 ceiling '50%'
set traffic-policy shaper WAN-OUT class 10 priority '0'
set traffic-policy shaper WAN-OUT class 10 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 10 match VOIP ip dscp 'ef'

# Interactive (SSH, gaming) - high priority
set traffic-policy shaper WAN-OUT class 20 bandwidth '10%'
set traffic-policy shaper WAN-OUT class 20 ceiling '100%'
set traffic-policy shaper WAN-OUT class 20 priority '1'
set traffic-policy shaper WAN-OUT class 20 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 20 match SSH ip protocol 'tcp'
set traffic-policy shaper WAN-OUT class 20 match SSH ip destination port '22'

# Web browsing - normal priority
set traffic-policy shaper WAN-OUT class 30 bandwidth '30%'
set traffic-policy shaper WAN-OUT class 30 ceiling '100%'
set traffic-policy shaper WAN-OUT class 30 priority '3'
set traffic-policy shaper WAN-OUT class 30 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 30 match HTTP ip protocol 'tcp'
set traffic-policy shaper WAN-OUT class 30 match HTTP ip destination port '80,443'

# Bulk downloads - lowest priority
set traffic-policy shaper WAN-OUT class 40 bandwidth '20%'
set traffic-policy shaper WAN-OUT class 40 ceiling '90%'
set traffic-policy shaper WAN-OUT class 40 priority '5'
set traffic-policy shaper WAN-OUT class 40 queue-type 'fq-codel'

# Default for unclassified traffic
set traffic-policy shaper WAN-OUT default bandwidth '20%'
set traffic-policy shaper WAN-OUT default ceiling '100%'
set traffic-policy shaper WAN-OUT default priority '4'
set traffic-policy shaper WAN-OUT default queue-type 'fq-codel'

set interfaces ethernet eth0 traffic-policy out 'WAN-OUT'

commit

Understanding the Parameters

bandwidth: Guaranteed minimum bandwidth for this class ceiling: Maximum bandwidth when other classes aren't using theirs priority: Lower number = higher priority (0 is highest) queue-type: Algorithm for managing the queue

Bandwidth percentages should roughly add up to 100%. The ceiling allows classes to borrow unused bandwidth.

Queue Types: fq-codel vs Others

fq-codel (Fair Queue Controlled Delay): Best for most cases. Maintains low latency, fair sharing between flows. Use this unless you have specific needs.

sfq (Stochastic Fair Queue): Simpler, less effective at latency control. Legacy option.

pfifo/bfifo: Simple FIFO queues. Don't fight bufferbloat. Avoid.

cake: Advanced shaper (may need additional packages). Even better than fq-codel for some scenarios.

# If cake is available
set traffic-policy shaper WAN-OUT default queue-type 'cake'

Inbound Shaping: The Hard Problem

You can't directly control inbound traffic — it's already at your doorstep. But you can:

Police incoming traffic: Drop/mark packets exceeding rate
Use ingress shaping: Shape traffic after it arrives
Rely on TCP feedback: Shaping outbound ACKs affects inbound TCP rate

configure

# Ingress policing on WAN interface
set traffic-policy limiter WAN-IN class 10 bandwidth '95mbit'
set traffic-policy limiter WAN-IN class 10 match ALL ip source address '0.0.0.0/0'
set traffic-policy limiter WAN-IN default bandwidth '95mbit'

set interfaces ethernet eth0 traffic-policy in 'WAN-IN'

commit

This is less precise than outbound shaping. For better download QoS, shape slightly below your download speed and let fq-codel manage queuing.

Practical Examples

Home Office: Prioritize Video Calls

# Identify video call traffic (Zoom, Teams, etc use UDP on various ports)
set traffic-policy shaper WAN-OUT class 10 match VIDEO-UDP ip protocol 'udp'
set traffic-policy shaper WAN-OUT class 10 match VIDEO-UDP ip destination port '3478-3481,8801-8810,19302-19309'

# Give video 30% guaranteed, can burst to 60%
set traffic-policy shaper WAN-OUT class 10 bandwidth '30%'
set traffic-policy shaper WAN-OUT class 10 ceiling '60%'
set traffic-policy shaper WAN-OUT class 10 priority '0'

Gaming: Low Latency

# Gaming often uses specific ports (varies by game)
set traffic-policy shaper WAN-OUT class 15 match GAMING ip protocol 'udp'
set traffic-policy shaper WAN-OUT class 15 match GAMING ip source port '1024-65535'
set traffic-policy shaper WAN-OUT class 15 bandwidth '15%'
set traffic-policy shaper WAN-OUT class 15 ceiling '50%'
set traffic-policy shaper WAN-OUT class 15 priority '0'

# Also prioritize small packets (often game updates)
set traffic-policy shaper WAN-OUT class 15 match SMALL ip ip-length '<256'

Torrent/Backup Deprioritization

# Bulk traffic class - low priority
set traffic-policy shaper WAN-OUT class 50 bandwidth '10%'
set traffic-policy shaper WAN-OUT class 50 ceiling '80%'
set traffic-policy shaper WAN-OUT class 50 priority '7'

# Match by ports commonly used by bulk transfers
set traffic-policy shaper WAN-OUT class 50 match TORRENT ip destination port '6881-6889'

DSCP Marking

DSCP (Differentiated Services Code Point) is a field in IP header used to classify traffic. Many applications set DSCP; you can use it for classification:

# Match on DSCP values set by applications
set traffic-policy shaper WAN-OUT class 10 match VOICE ip dscp 'ef'
set traffic-policy shaper WAN-OUT class 20 match VIDEO ip dscp 'af41'

Common DSCP values:

EF (46): Expedited Forwarding - voice
AF41 (34): Assured Forwarding - video
AF21 (18): Assured Forwarding - business critical
CS1 (8): Scavenger - bulk/background

You can also mark traffic yourself:

# Mark VoIP traffic with EF
set firewall ipv4 name MARK-QOS rule 10 action 'accept'
set firewall ipv4 name MARK-QOS rule 10 protocol 'udp'
set firewall ipv4 name MARK-QOS rule 10 destination port '5060'
set firewall ipv4 name MARK-QOS rule 10 set dscp 'ef'

Monitoring QoS

# Show current traffic policy statistics
show queueing interface eth0

# Show class statistics
tc -s class show dev eth0

# Watch queue lengths
watch tc -s qdisc show dev eth0

Look for:

drops: Some drops are normal (fq-codel drops to signal congestion)
backlogs: Should be low, high backlog = buffer building up
overlimits: Traffic exceeding class bandwidth (borrowing from ceiling)

Debugging QoS Issues

Traffic Not Being Classified

# Most traffic in default class? Check your matches
show queueing interface eth0

If priority traffic isn't getting classified, verify:

Port/protocol matches are correct
Traffic isn't using unexpected ports (HTTPS multiplexes everything over 443)
Match rules are specific enough

Latency Still High

Shaper bandwidth too high: Lower it (try 90% of link speed)
Not using fq-codel: Change queue-type
Inbound is the problem: Need ingress shaping too
ISP QoS: Your ISP might have their own queuing

VoIP Quality Still Poor

Jitter buffer: Some jitter is handled by endpoints
Packet loss: Check show queueing for excessive drops
Misclassified traffic: Verify VoIP is hitting the right class

Complete QoS Configuration

# === Traffic Policy ===
set traffic-policy shaper WAN-OUT bandwidth '95mbit'

# Voice/Video - highest priority
set traffic-policy shaper WAN-OUT class 10 bandwidth '20%'
set traffic-policy shaper WAN-OUT class 10 ceiling '50%'
set traffic-policy shaper WAN-OUT class 10 priority '0'
set traffic-policy shaper WAN-OUT class 10 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 10 match VOIP ip dscp 'ef'
set traffic-policy shaper WAN-OUT class 10 match REALTIME ip protocol 'udp'
set traffic-policy shaper WAN-OUT class 10 match REALTIME ip destination port '3478-3481,5060,16384-32767'

# Interactive
set traffic-policy shaper WAN-OUT class 20 bandwidth '15%'
set traffic-policy shaper WAN-OUT class 20 ceiling '100%'
set traffic-policy shaper WAN-OUT class 20 priority '1'
set traffic-policy shaper WAN-OUT class 20 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 20 match SSH ip protocol 'tcp'
set traffic-policy shaper WAN-OUT class 20 match SSH ip destination port '22'
set traffic-policy shaper WAN-OUT class 20 match DNS ip protocol 'udp'
set traffic-policy shaper WAN-OUT class 20 match DNS ip destination port '53'

# Web
set traffic-policy shaper WAN-OUT class 30 bandwidth '35%'
set traffic-policy shaper WAN-OUT class 30 ceiling '100%'
set traffic-policy shaper WAN-OUT class 30 priority '3'
set traffic-policy shaper WAN-OUT class 30 queue-type 'fq-codel'
set traffic-policy shaper WAN-OUT class 30 match WEB ip protocol 'tcp'
set traffic-policy shaper WAN-OUT class 30 match WEB ip destination port '80,443'

# Bulk
set traffic-policy shaper WAN-OUT class 40 bandwidth '10%'
set traffic-policy shaper WAN-OUT class 40 ceiling '80%'
set traffic-policy shaper WAN-OUT class 40 priority '6'
set traffic-policy shaper WAN-OUT class 40 queue-type 'fq-codel'

# Default
set traffic-policy shaper WAN-OUT default bandwidth '20%'
set traffic-policy shaper WAN-OUT default ceiling '100%'
set traffic-policy shaper WAN-OUT default priority '4'
set traffic-policy shaper WAN-OUT default queue-type 'fq-codel'

# === Apply ===
set interfaces ethernet eth0 traffic-policy out 'WAN-OUT'

The Lesson

QoS works when you understand the bottleneck:

Shape below link speed: This moves the bottleneck to your router where you control queuing
Use smart queues (fq-codel): They maintain low latency automatically
Prioritize appropriately: Not everything can be high priority — that's just no priority

The goal isn't faster internet — it's predictable internet. Video calls that don't stutter when someone starts a download. SSH that stays responsive during backups. Gaming that doesn't spike during updates.

Test before and after. Measure latency under load. If it doesn't improve, you haven't identified the real bottleneck yet.

Multi-WAN on VyOS: Failover That Actually Works

berik@ashimov.com (Berik Ashimov) — Tue, 21 Oct 2025 00:00:00 GMT

Having two internet connections means nothing if your router doesn't know when one fails. I've seen setups where the "failover" just meant two default routes with different metrics — the primary could be completely dead, and the router would happily keep trying to send traffic through it until the metrics were manually adjusted.

Real failover requires active health checking. VyOS provides this, but it needs proper configuration. Let's build multi-WAN that actually works.

The Multi-WAN Architecture

Typical setup:

eth0: Primary ISP (faster, preferred)
eth1: Secondary ISP (backup)
eth2: LAN

Goals:

Use primary when healthy
Failover to secondary when primary fails
Fail back when primary recovers
All of this automatically

Basic Interface Setup

configure

# Primary WAN
set interfaces ethernet eth0 description 'WAN-PRIMARY'
set interfaces ethernet eth0 address dhcp

# Secondary WAN
set interfaces ethernet eth1 description 'WAN-SECONDARY'
set interfaces ethernet eth1 address dhcp

# LAN
set interfaces ethernet eth2 description 'LAN'
set interfaces ethernet eth2 address '10.0.0.1/24'

commit

The Wrong Way: Static Metrics

You might think this works:

# DON'T DO THIS (or at least, don't rely only on this)
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 distance 10
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 distance 20

This creates two default routes. The lower distance (10) is preferred. But here's the problem: if the primary ISP goes down at layer 3 (routing issue, ISP outage, etc.), the interface might still be up. The router keeps using the "preferred" route that goes nowhere.

The Right Way: Health Checking

VyOS uses conntrack-sync or custom scripts for health checking. A more robust approach is using vyos-wan-load-balance or implementing checks with route monitoring.

Option 1: Interface State Tracking

Basic tracking — failover when interface goes down:

configure

# Primary route with interface tracking
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 distance 10
set protocols static route 0.0.0.0/0 next-hop 192.168.1.1 interface 'eth0'

# Secondary route - used when primary interface is down
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 distance 20
set protocols static route 0.0.0.0/0 next-hop 192.168.2.1 interface 'eth1'

commit

This helps but only detects link failure, not upstream issues.

Option 2: Scripted Health Checks

For proper SLA monitoring, create a health check script:

#!/bin/bash
# /config/scripts/wan-health-check.sh

PRIMARY_GW="192.168.1.1"
SECONDARY_GW="192.168.2.1"
CHECK_TARGET="8.8.8.8"
PRIMARY_METRIC=10
FAILOVER_METRIC=5

# Check primary WAN by pinging through it
if ping -c 3 -W 2 -I eth0 $CHECK_TARGET > /dev/null 2>&1; then
    # Primary is healthy - ensure it's preferred
    ip route replace default via $PRIMARY_GW metric $PRIMARY_METRIC
    ip route replace default via $SECONDARY_GW metric 20
else
    # Primary is down - make secondary preferred
    ip route replace default via $SECONDARY_GW metric $FAILOVER_METRIC
    ip route replace default via $PRIMARY_GW metric 100
    logger "WAN Failover: Primary down, using secondary"
fi

Schedule via cron:

set system task-scheduler task wan-check executable path '/config/scripts/wan-health-check.sh'
set system task-scheduler task wan-check interval '30'

Option 3: VyOS WAN Load Balancing (Recommended)

VyOS has built-in WAN load balancing with health checks:

configure

# Define WAN interfaces for load balancing
set load-balancing wan interface-health eth0 failure-count '3'
set load-balancing wan interface-health eth0 nexthop '192.168.1.1'
set load-balancing wan interface-health eth0 success-count '3'
set load-balancing wan interface-health eth0 test 10 resp-time '5'
set load-balancing wan interface-health eth0 test 10 target '8.8.8.8'
set load-balancing wan interface-health eth0 test 10 ttl-limit '1'
set load-balancing wan interface-health eth0 test 10 type 'ping'

set load-balancing wan interface-health eth1 failure-count '3'
set load-balancing wan interface-health eth1 nexthop '192.168.2.1'
set load-balancing wan interface-health eth1 success-count '3'
set load-balancing wan interface-health eth1 test 10 resp-time '5'
set load-balancing wan interface-health eth1 test 10 target '8.8.4.4'
set load-balancing wan interface-health eth1 test 10 ttl-limit '1'
set load-balancing wan interface-health eth1 test 10 type 'ping'

# Define load balancing rule
set load-balancing wan rule 10 inbound-interface 'eth2'
set load-balancing wan rule 10 interface eth0 weight '100'
set load-balancing wan rule 10 interface eth1 weight '1'
set load-balancing wan rule 10 failover

# Sticky connections (optional - keeps sessions on same WAN)
set load-balancing wan sticky-connections inbound
set load-balancing wan enable-local-traffic

commit

Key parameters:

failure-count: How many failed tests before marking down
success-count: How many successes before marking up
weight: Higher = more traffic (100:1 means primary gets almost all traffic)
failover: Enable failover mode (not just load balancing)

Understanding the Health Check

set load-balancing wan interface-health eth0 test 10 target '8.8.8.8'
set load-balancing wan interface-health eth0 test 10 type 'ping'
set load-balancing wan interface-health eth0 test 10 resp-time '5'

This pings 8.8.8.8 through eth0. If response takes >5 seconds or fails, it counts as a failure. After 3 failures (failure-count), the interface is marked down.

Choose your test target wisely:

Public DNS (8.8.8.8, 1.1.1.1) - highly available
Your ISP's gateway - tests only first hop
Multiple targets for more confidence

# Multiple tests - all must pass
set load-balancing wan interface-health eth0 test 10 target '8.8.8.8'
set load-balancing wan interface-health eth0 test 10 type 'ping'
set load-balancing wan interface-health eth0 test 20 target '1.1.1.1'
set load-balancing wan interface-health eth0 test 20 type 'ping'

NAT for Multi-WAN

Each WAN needs its own NAT rule:

configure

# NAT for primary WAN
set nat source rule 100 outbound-interface name 'eth0'
set nat source rule 100 source address '10.0.0.0/24'
set nat source rule 100 translation address 'masquerade'

# NAT for secondary WAN
set nat source rule 110 outbound-interface name 'eth1'
set nat source rule 110 source address '10.0.0.0/24'
set nat source rule 110 translation address 'masquerade'

commit

masquerade automatically uses the correct outbound IP based on which interface traffic exits.

Monitoring WAN Status

# Check WAN health status
show wan-load-balance

# Check current routing
show ip route

# Check NAT sessions
show nat source translations

Sticky Sessions: Why They Matter

Without sticky sessions, a TCP connection might start on WAN1, then mid-connection failover happens, and packets go out WAN2 with a different source IP. The remote server sees packets from a different IP and drops them.

set load-balancing wan sticky-connections inbound

Sticky connections track existing connections and keep them on the same WAN until they complete. New connections go to whichever WAN is preferred at that moment.

Exclude Certain Traffic from Load Balancing

Some traffic should always use a specific WAN:

# VPN traffic always uses primary (to maintain stable VPN connection)
set load-balancing wan rule 5 inbound-interface 'eth2'
set load-balancing wan rule 5 destination port '51820'
set load-balancing wan rule 5 protocol 'udp'
set load-balancing wan rule 5 interface eth0 weight '100'

# VoIP traffic uses secondary (more stable latency)
set load-balancing wan rule 6 inbound-interface 'eth2'
set load-balancing wan rule 6 destination port '5060-5061'
set load-balancing wan rule 6 protocol 'udp'
set load-balancing wan rule 6 interface eth1 weight '100'

Rules are processed in order. Rule 5 and 6 handle specific traffic, rule 10 (from earlier) handles everything else.

Active-Active vs Active-Passive

Active-Passive (Failover):

set load-balancing wan rule 10 interface eth0 weight '100'
set load-balancing wan rule 10 interface eth1 weight '1'
set load-balancing wan rule 10 failover

Primary handles all traffic. Secondary only used when primary fails.

Active-Active (Load Sharing):

set load-balancing wan rule 10 interface eth0 weight '70'
set load-balancing wan rule 10 interface eth1 weight '30'
# Remove 'failover' flag

Traffic distributed across both. 70% to primary, 30% to secondary (roughly).

Active-Active provides more bandwidth but complicates troubleshooting and may cause issues with services that expect consistent source IP.

Testing Failover

Before relying on failover, test it:

Verify both WANs work independently

# Test via primary
ping -I eth0 8.8.8.8

# Test via secondary
ping -I eth1 8.8.8.8

Simulate primary failure

# Temporarily block test target from primary using output filter
set firewall ipv4 name TEST rule 1 action 'drop'
set firewall ipv4 name TEST rule 1 destination address '8.8.8.8'
set firewall ipv4 output filter rule 100 outbound-interface name 'eth0'
set firewall ipv4 output filter rule 100 action 'jump'
set firewall ipv4 output filter rule 100 jump-target 'TEST'
commit

# Watch failover happen
show wan-load-balance

# Remove test firewall
delete firewall ipv4 name TEST
delete firewall ipv4 output filter rule 100
commit

Physically disconnect primary Unplug eth0. Verify traffic continues via eth1. Reconnect and verify fail-back.

Complete Multi-WAN Configuration

# === Interfaces ===
set interfaces ethernet eth0 description 'WAN-PRIMARY'
set interfaces ethernet eth0 address dhcp
set interfaces ethernet eth1 description 'WAN-SECONDARY'
set interfaces ethernet eth1 address dhcp
set interfaces ethernet eth2 description 'LAN'
set interfaces ethernet eth2 address '10.0.0.1/24'

# === NAT ===
set nat source rule 100 outbound-interface name 'eth0'
set nat source rule 100 source address '10.0.0.0/24'
set nat source rule 100 translation address 'masquerade'
set nat source rule 110 outbound-interface name 'eth1'
set nat source rule 110 source address '10.0.0.0/24'
set nat source rule 110 translation address 'masquerade'

# === WAN Load Balancing with Health Check ===
set load-balancing wan interface-health eth0 failure-count '3'
set load-balancing wan interface-health eth0 nexthop 'dhcp'
set load-balancing wan interface-health eth0 success-count '3'
set load-balancing wan interface-health eth0 test 10 resp-time '5'
set load-balancing wan interface-health eth0 test 10 target '8.8.8.8'
set load-balancing wan interface-health eth0 test 10 type 'ping'

set load-balancing wan interface-health eth1 failure-count '3'
set load-balancing wan interface-health eth1 nexthop 'dhcp'
set load-balancing wan interface-health eth1 success-count '3'
set load-balancing wan interface-health eth1 test 10 resp-time '5'
set load-balancing wan interface-health eth1 test 10 target '8.8.4.4'
set load-balancing wan interface-health eth1 test 10 type 'ping'

set load-balancing wan rule 10 inbound-interface 'eth2'
set load-balancing wan rule 10 interface eth0 weight '100'
set load-balancing wan rule 10 interface eth1 weight '1'
set load-balancing wan rule 10 failover

set load-balancing wan sticky-connections inbound
set load-balancing wan enable-local-traffic

The Lesson

Multi-WAN without proper health checking is false confidence. Your router might report two routes while happily sending traffic into a black hole.

Real failover requires:

Active health checks that test actual connectivity, not just link state
Reasonable timers - fast enough to detect failures quickly, slow enough to avoid flapping
Testing - verify failover actually works before you need it
Monitoring - alerts when failover happens so you know to investigate

VyOS's WAN load balancing provides all of this out of the box. Configure it, test it, and trust it — but verify with monitoring.

IPsec on VyOS: Site-to-Site Tunnels That Survive Reality

berik@ashimov.com (Berik Ashimov) — Fri, 17 Oct 2025 00:00:00 GMT

IPsec has a reputation for being complex and fragile. There's some truth to that — it has more moving parts than WireGuard, more states to manage, more things that can go wrong. But IPsec is also universal. It works with nearly any vendor's equipment and is often required for corporate connectivity.

The key to reliable IPsec: understand the timers and states. When an IPsec tunnel fails, it's almost always timer mismatch or phase state issues. This guide covers practical IPsec configuration and the debugging skills to diagnose problems.

IPsec Fundamentals

IPsec has two phases:

IKE Phase 1 (IKE SA): Negotiate encryption, authenticate peers, establish secure channel for Phase 2 negotiations.

IKE Phase 2 (IPsec SA / Child SA): Negotiate the actual tunnel parameters, establish encryption for user traffic.

Both phases have lifetimes. When they expire, rekeying occurs. Mismatched timers between peers cause tunnels to drop.

Site-to-Site Configuration: IKEv2

IKEv2 is preferred over IKEv1 — better NAT traversal, faster failover, simpler configuration. Use IKEv1 only when the peer doesn't support IKEv2.

Site A Configuration

configure

# IKE group (Phase 1 parameters)
set vpn ipsec ike-group IKE-SITE-B close-action 'none'
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection action 'restart'
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection interval '30'
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection timeout '120'
set vpn ipsec ike-group IKE-SITE-B key-exchange 'ikev2'
set vpn ipsec ike-group IKE-SITE-B lifetime '28800'
set vpn ipsec ike-group IKE-SITE-B proposal 1 dh-group '14'
set vpn ipsec ike-group IKE-SITE-B proposal 1 encryption 'aes256'
set vpn ipsec ike-group IKE-SITE-B proposal 1 hash 'sha256'

# ESP group (Phase 2 parameters)
set vpn ipsec esp-group ESP-SITE-B lifetime '3600'
set vpn ipsec esp-group ESP-SITE-B pfs 'dh-group14'
set vpn ipsec esp-group ESP-SITE-B proposal 1 encryption 'aes256'
set vpn ipsec esp-group ESP-SITE-B proposal 1 hash 'sha256'

# Interface binding
set vpn ipsec interface 'eth0'

# Site-to-Site connection
set vpn ipsec site-to-site peer SITE-B authentication local-id 'site-a@example.com'
set vpn ipsec site-to-site peer SITE-B authentication mode 'pre-shared-secret'
set vpn ipsec site-to-site peer SITE-B authentication pre-shared-secret 'YourVeryStrongPSKHere123!'
set vpn ipsec site-to-site peer SITE-B authentication remote-id 'site-b@example.com'
set vpn ipsec site-to-site peer SITE-B connection-type 'initiate'
set vpn ipsec site-to-site peer SITE-B default-esp-group 'ESP-SITE-B'
set vpn ipsec site-to-site peer SITE-B ike-group 'IKE-SITE-B'
set vpn ipsec site-to-site peer SITE-B local-address '203.0.113.1'
set vpn ipsec site-to-site peer SITE-B remote-address '198.51.100.1'

# Traffic selectors (what traffic goes through the tunnel)
set vpn ipsec site-to-site peer SITE-B tunnel 1 local prefix '10.1.0.0/24'
set vpn ipsec site-to-site peer SITE-B tunnel 1 remote prefix '10.2.0.0/24'

commit

Site B Configuration

Mirror configuration with swapped addresses and IDs:

configure

# IKE group - MUST MATCH Site A
set vpn ipsec ike-group IKE-SITE-A close-action 'none'
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection action 'restart'
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection interval '30'
set vpn ipsec ike-group IKE-SITE-A dead-peer-detection timeout '120'
set vpn ipsec ike-group IKE-SITE-A key-exchange 'ikev2'
set vpn ipsec ike-group IKE-SITE-A lifetime '28800'
set vpn ipsec ike-group IKE-SITE-A proposal 1 dh-group '14'
set vpn ipsec ike-group IKE-SITE-A proposal 1 encryption 'aes256'
set vpn ipsec ike-group IKE-SITE-A proposal 1 hash 'sha256'

# ESP group - MUST MATCH Site A
set vpn ipsec esp-group ESP-SITE-A lifetime '3600'
set vpn ipsec esp-group ESP-SITE-A pfs 'dh-group14'
set vpn ipsec esp-group ESP-SITE-A proposal 1 encryption 'aes256'
set vpn ipsec esp-group ESP-SITE-A proposal 1 hash 'sha256'

set vpn ipsec interface 'eth0'

set vpn ipsec site-to-site peer SITE-A authentication local-id 'site-b@example.com'
set vpn ipsec site-to-site peer SITE-A authentication mode 'pre-shared-secret'
set vpn ipsec site-to-site peer SITE-A authentication pre-shared-secret 'YourVeryStrongPSKHere123!'
set vpn ipsec site-to-site peer SITE-A authentication remote-id 'site-a@example.com'
set vpn ipsec site-to-site peer SITE-A connection-type 'initiate'
set vpn ipsec site-to-site peer SITE-A default-esp-group 'ESP-SITE-A'
set vpn ipsec site-to-site peer SITE-A ike-group 'IKE-SITE-A'
set vpn ipsec site-to-site peer SITE-A local-address '198.51.100.1'
set vpn ipsec site-to-site peer SITE-A remote-address '203.0.113.1'

set vpn ipsec site-to-site peer SITE-A tunnel 1 local prefix '10.2.0.0/24'
set vpn ipsec site-to-site peer SITE-A tunnel 1 remote prefix '10.1.0.0/24'

commit

Critical: Parameter Matching

Both peers MUST have identical:

Key exchange version (ikev2)
IKE lifetime
ESP lifetime
DH group
Encryption algorithm
Hash algorithm
PFS settings
Traffic selectors (swapped local/remote)

Mismatch in any of these = tunnel won't establish or will randomly fail.

NAT Traversal (NAT-T)

When either peer is behind NAT, IPsec encapsulates packets in UDP 4500 instead of raw ESP (protocol 50). VyOS enables NAT-T automatically, but you may need firewall rules:

set firewall ipv4 name WAN-LOCAL rule 70 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 70 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 70 destination port '500'

set firewall ipv4 name WAN-LOCAL rule 71 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 71 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 71 destination port '4500'

# Also allow ESP protocol for non-NAT scenarios
set firewall ipv4 name WAN-LOCAL rule 72 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 72 protocol 'esp'

Dead Peer Detection (DPD)

DPD detects when the remote peer becomes unreachable. Without it, your router won't know the tunnel is dead until traffic fails.

set vpn ipsec ike-group IKE-SITE-B dead-peer-detection action 'restart'
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection interval '30'
set vpn ipsec ike-group IKE-SITE-B dead-peer-detection timeout '120'

interval: Send DPD request every 30 seconds
timeout: Declare peer dead after 120 seconds without response
action: restart = try to re-establish, clear = delete SA, none = do nothing

For stable connections, restart is usually best. It automatically recovers from transient outages.

Rekeying: The Lifetime Dance

IKE and ESP SAs have separate lifetimes. When they expire, rekeying occurs. Problems happen when:

Both peers try to rekey simultaneously
Timers differ slightly, causing race conditions
Rekey fails and tunnel drops

Best practices:

Make lifetimes identical on both peers
IKE lifetime should be longer than ESP lifetime
Common values: IKE 28800s (8h), ESP 3600s (1h)

# Site A and Site B MUST match
set vpn ipsec ike-group IKE-SITE-B lifetime '28800'
set vpn ipsec esp-group ESP-SITE-B lifetime '3600'

If rekeying causes drops, increase lifetimes. If security policy requires short lifetimes, ensure DPD is configured to recover quickly.

Debugging IPsec

When IPsec fails, check in order:

1. Check SA Status

show vpn ipsec sa

Shows established Security Associations. You want to see both IKE SA and Child SA (ESP).

2. Check Connection State

show vpn ipsec connections

Shows connection status: ESTABLISHED, CONNECTING, INSTALLED, etc.

3. Check Logs

show log | match -i ipsec
# or
sudo journalctl -u strongswan -f

Common errors:

NO_PROPOSAL_CHOSEN: Algorithm mismatch
AUTHENTICATION_FAILED: Wrong PSK or ID mismatch
TS_UNACCEPTABLE: Traffic selector mismatch
INVALID_IKE_SPI: Stale SA, restart connection

4. Reset the Connection

reset vpn ipsec site-to-site peer SITE-B

Clears SAs and re-initiates. Often fixes "stuck" tunnels.

5. Verify Traffic Selectors

show vpn ipsec sa detail

Shows exactly what traffic selectors are installed. Mismatch here = traffic bypasses tunnel.

Common Issues

Symptom	Cause	Fix
No SA established	Firewall blocking 500/4500/ESP	Open firewall ports
Auth failed	PSK mismatch or wrong local/remote-id	Verify both match exactly
Connects then drops	Timer mismatch or rekey failure	Match lifetimes, check DPD
Traffic doesn't flow	Traffic selector mismatch	Verify local/remote prefix match
Works then stops	NAT timeout (if behind NAT)	Ensure NAT-T is working

Debug Commands Summary

show vpn ipsec sa              # SA status
show vpn ipsec connections     # Connection state
show vpn ipsec sa detail       # Detailed SA info including traffic selectors
show vpn ipsec status          # Overall IPsec status
reset vpn ipsec site-to-site peer <name>  # Reset specific peer

Route-Based vs Policy-Based IPsec

VyOS supports both:

Policy-based (shown above): Traffic selectors define what goes through tunnel. Configured via tunnel X local/remote prefix. Simpler, but less flexible.

Route-based: Virtual tunnel interface (vti), routes determine what traffic enters. More flexible, better for dynamic routing.

Route-Based Example

configure

# Virtual tunnel interface
set interfaces vti vti0 address '10.255.0.1/30'
set interfaces vti vti0 description 'IPsec to Site B'

# IPsec connection using vti
set vpn ipsec site-to-site peer SITE-B tunnel 1 local prefix '0.0.0.0/0'
set vpn ipsec site-to-site peer SITE-B tunnel 1 remote prefix '0.0.0.0/0'
set vpn ipsec site-to-site peer SITE-B vti bind 'vti0'

# Route traffic to remote network through vti
set protocols static route 10.2.0.0/24 interface vti0

commit

Route-based is better when:

You need dynamic routing (OSPF/BGP over IPsec)
Multiple networks with complex routing
You want firewall rules on the tunnel interface

Firewall for IPsec Traffic

Traffic through IPsec still needs firewall consideration:

# If using VTI, apply firewall via forward filter
# Define what the remote site can access
set firewall ipv4 name IPSEC-IN default-action 'drop'
set firewall ipv4 name IPSEC-IN rule 10 action 'accept'
set firewall ipv4 name IPSEC-IN rule 10 state 'established'
set firewall ipv4 name IPSEC-IN rule 10 state 'related'
set firewall ipv4 name IPSEC-IN rule 20 action 'accept'
set firewall ipv4 name IPSEC-IN rule 20 destination address '10.1.0.0/24'

# Apply to forward filter
set firewall ipv4 forward filter rule 30 inbound-interface name 'vti0'
set firewall ipv4 forward filter rule 30 action 'jump'
set firewall ipv4 forward filter rule 30 jump-target 'IPSEC-IN'

Production Checklist

[ ] IKE and ESP parameters match on both peers
[ ] Lifetimes match exactly
[ ] DPD configured with appropriate action
[ ] Firewall allows UDP 500, 4500, and ESP
[ ] Traffic selectors match (local/remote swapped)
[ ] PSK is strong (20+ random characters)
[ ] Local and remote IDs match configuration
[ ] Monitoring/alerting for tunnel status

The Lesson

IPsec reliability comes down to discipline:

Timer discipline: Both peers must have identical lifetimes. IKE > ESP lifetime. Configure DPD to detect and recover from failures.
SA verification: Regularly check show vpn ipsec sa. If IKE SA exists but Child SA doesn't, you have a Phase 2 problem. If neither exists, Phase 1 isn't completing.
Methodical debugging: Check SA status → check logs for specific error → verify matching configuration → reset and try again.

IPsec is more complex than WireGuard, but it's deterministic. When you understand the state machine (IKE SA → Child SA → traffic flows), you can diagnose any issue by figuring out where in that sequence things break.

WireGuard on VyOS: Production Configuration for Site-to-Site and Road Warriors

berik@ashimov.com (Berik Ashimov) — Tue, 14 Oct 2025 00:00:00 GMT

WireGuard is simple by design — a few keys, some IP addresses, and you're connected. But "connected" and "production-ready" are different things. Intermittent disconnections, mysterious packet loss, traffic leaking outside the tunnel — these happen when you skip the fundamentals.

The two things that make WireGuard stable: correct MTU and clear routing policy. Get these right, and WireGuard becomes boring (in the best way).

WireGuard Basics on VyOS

VyOS treats WireGuard as a first-class interface. Configuration is straightforward:

configure

# Generate keypair (or use existing)
run generate pki wireguard key-pair

# Save the output:
# Private-key: <base64 private key>
# Public-key: <base64 public key>

Store the private key securely. The public key is what you share with peers.

Site-to-Site Configuration

Two VyOS routers connecting their networks. Site A (10.1.0.0/24) connects to Site B (10.2.0.0/24).

Site A Configuration

configure

# WireGuard interface
set interfaces wireguard wg0 address '10.255.255.1/30'
set interfaces wireguard wg0 description 'Site-to-Site to Site B'
set interfaces wireguard wg0 port '51820'
set interfaces wireguard wg0 private-key '<site-a-private-key>'

# Peer configuration (Site B)
set interfaces wireguard wg0 peer site-b public-key '<site-b-public-key>'
set interfaces wireguard wg0 peer site-b allowed-ips '10.255.255.2/32'
set interfaces wireguard wg0 peer site-b allowed-ips '10.2.0.0/24'
set interfaces wireguard wg0 peer site-b endpoint 'site-b.example.com:51820'
set interfaces wireguard wg0 peer site-b persistent-keepalive '25'

# Route to Site B's network
set protocols static route 10.2.0.0/24 interface wg0

# Firewall: allow WireGuard traffic
set firewall ipv4 name WAN-LOCAL rule 60 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 60 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 60 destination port '51820'

commit

Site B Configuration

Mirror configuration with swapped keys and addresses:

configure

set interfaces wireguard wg0 address '10.255.255.2/30'
set interfaces wireguard wg0 description 'Site-to-Site to Site A'
set interfaces wireguard wg0 port '51820'
set interfaces wireguard wg0 private-key '<site-b-private-key>'

set interfaces wireguard wg0 peer site-a public-key '<site-a-public-key>'
set interfaces wireguard wg0 peer site-a allowed-ips '10.255.255.1/32'
set interfaces wireguard wg0 peer site-a allowed-ips '10.1.0.0/24'
set interfaces wireguard wg0 peer site-a endpoint 'site-a.example.com:51820'
set interfaces wireguard wg0 peer site-a persistent-keepalive '25'

set protocols static route 10.1.0.0/24 interface wg0

set firewall ipv4 name WAN-LOCAL rule 60 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 60 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 60 destination port '51820'

commit

Validation:

# Check interface status
show interfaces wireguard wg0

# Check peer status
show wireguard peers

# Test connectivity
ping 10.255.255.2    # Tunnel endpoint
ping 10.2.0.1        # Remote LAN (from Site A)

Road Warrior Configuration (Mobile Clients)

Roaming clients that connect from anywhere. The VyOS router acts as the VPN server.

VyOS Server Configuration

configure

set interfaces wireguard wg0 address '10.10.0.1/24'
set interfaces wireguard wg0 description 'Road Warrior VPN'
set interfaces wireguard wg0 port '51820'
set interfaces wireguard wg0 private-key '<server-private-key>'

# Client 1 (laptop)
set interfaces wireguard wg0 peer laptop public-key '<laptop-public-key>'
set interfaces wireguard wg0 peer laptop allowed-ips '10.10.0.10/32'

# Client 2 (phone)
set interfaces wireguard wg0 peer phone public-key '<phone-public-key>'
set interfaces wireguard wg0 peer phone allowed-ips '10.10.0.11/32'

# Allow VPN clients to access LAN and internet
set firewall group network-group VPN-CLIENTS network '10.10.0.0/24'

# NAT for VPN clients going to internet
set nat source rule 200 outbound-interface name 'eth0'
set nat source rule 200 source address '10.10.0.0/24'
set nat source rule 200 translation address 'masquerade'

# Firewall: allow WireGuard
set firewall ipv4 name WAN-LOCAL rule 60 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 60 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 60 destination port '51820'

commit

Client Configuration (wg0.conf)

For laptop/phone using standard WireGuard client:

[Interface]
PrivateKey = <laptop-private-key>
Address = 10.10.0.10/32
DNS = 10.0.0.1

[Peer]
PublicKey = <server-public-key>
AllowedIPs = 0.0.0.0/0
Endpoint = vpn.example.com:51820
PersistentKeepalive = 25

AllowedIPs = 0.0.0.0/0 means full tunnel — all traffic through VPN.

Split Tunnel vs Full Tunnel

Full tunnel: All client traffic goes through VPN

AllowedIPs = 0.0.0.0/0, ::/0
Pros: All traffic protected, consistent IP
Cons: Higher latency, more bandwidth on VPN server

Split tunnel: Only specific traffic through VPN

AllowedIPs = 10.0.0.0/8, 192.168.0.0/16
Pros: Better performance, less server load
Cons: Some traffic exposed, DNS leaks possible

Split Tunnel Client Example

[Interface]
PrivateKey = <private-key>
Address = 10.10.0.10/32

[Peer]
PublicKey = <server-public-key>
AllowedIPs = 10.0.0.0/8, 10.10.0.0/24
Endpoint = vpn.example.com:51820
PersistentKeepalive = 25

Only traffic to 10.x.x.x goes through VPN. Everything else uses local internet.

The MTU Problem

WireGuard encapsulates packets, adding overhead. If your MTU is too high, packets get fragmented or dropped. Symptoms:

SSH works, HTTPS fails
Small requests work, large transfers hang
Intermittent "connection reset"

Calculate Correct MTU

Standard Ethernet MTU: 1500 WireGuard overhead: 60 bytes (IPv4) or 80 bytes (IPv6) Safe WireGuard MTU: 1420 (IPv4) or 1400 (IPv6)

configure

set interfaces wireguard wg0 mtu '1420'

commit

Test MTU

# From client, test path MTU to a host through the tunnel
ping -M do -s 1392 10.0.0.1

-M do prevents fragmentation. -s 1392 = 1392 payload + 28 header = 1420. If it works, MTU is correct. If not, lower it.

For connections through multiple NATs or tunnels, you might need 1380 or even lower.

Kill Switch: Preventing Leaks

A kill switch ensures traffic can't leak if the VPN disconnects. On VyOS server, you control routing. On clients, configure the client app or OS firewall.

Server-Side: Ensure Clients Use VPN

If clients should only access internet through VPN:

# Already covered by NAT rule - VPN clients are masqueraded
# No direct route from VPN subnet to internet except through NAT

Client-Side Kill Switch (Linux)

# Allow only WireGuard and local traffic
iptables -A OUTPUT -o wg0 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A OUTPUT -p udp --dport 51820 -j ACCEPT
iptables -A OUTPUT -j DROP

Or use WireGuard's PostUp/PostDown scripts in the config.

Persistent Keepalive: When to Use It

WireGuard is silent when idle — no traffic means no packets. This causes problems:

NAT mappings expire (typically 30-60 seconds)
Stateful firewalls drop the "connection"
You can't initiate connections TO the client

persistent-keepalive '25' sends a keepalive every 25 seconds, keeping NAT/firewall state alive.

Use it when:

Client is behind NAT
Either side has stateful firewall
You need to reach the client from the server

Skip it when:

Both sides have static public IPs
No NAT involved
Saving minimal bandwidth matters

Debugging WireGuard

Check Interface Status

show interfaces wireguard wg0

Should show UP state and assigned address.

Check Peer Handshakes

show wireguard peers

Shows last handshake time. If "never" or very old, tunnel isn't working.

Check Keys Match

Most common issue: public/private key mismatch. Verify:

Server has client's public key
Client has server's public key
No copy-paste errors (check for trailing spaces)

Check Firewall

show firewall ipv4 name WAN-LOCAL

Ensure UDP 51820 is allowed inbound.

Check Routing

show ip route

Routes through wg0 should exist for peer's allowed-ips.

Monitor Traffic

sudo tcpdump -i wg0 -n

Should see traffic when peers communicate.

Common Issues

Symptom	Cause	Fix
No handshake	Key mismatch or blocked port	Verify keys, check firewall
Handshake but no traffic	Routing or allowed-ips wrong	Check routes match allowed-ips
Works then dies	NAT timeout	Enable persistent-keepalive
Large transfers fail	MTU too high	Lower MTU to 1420 or less
One direction works	Asymmetric allowed-ips	Both sides need matching allowed-ips

Production Checklist

Before calling it production-ready:

[ ] MTU set correctly (1420 or tested value)
[ ] Persistent keepalive enabled if behind NAT
[ ] Firewall allows WireGuard port (UDP 51820)
[ ] Routes exist for all allowed-ips
[ ] Keys are backed up securely
[ ] NAT configured if clients need internet access
[ ] DNS configured for full-tunnel clients
[ ] Kill switch configured if leak prevention needed

Complete Road Warrior Example

# === WireGuard Interface ===
set interfaces wireguard wg0 address '10.10.0.1/24'
set interfaces wireguard wg0 mtu '1420'
set interfaces wireguard wg0 port '51820'
set interfaces wireguard wg0 private-key '<server-private-key>'

# === Peers ===
set interfaces wireguard wg0 peer laptop public-key '<key>'
set interfaces wireguard wg0 peer laptop allowed-ips '10.10.0.10/32'

set interfaces wireguard wg0 peer phone public-key '<key>'
set interfaces wireguard wg0 peer phone allowed-ips '10.10.0.11/32'

# === NAT for VPN clients ===
set nat source rule 200 outbound-interface name 'eth0'
set nat source rule 200 source address '10.10.0.0/24'
set nat source rule 200 translation address 'masquerade'

# === Firewall ===
set firewall ipv4 name WAN-LOCAL rule 60 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 60 protocol 'udp'
set firewall ipv4 name WAN-LOCAL rule 60 destination port '51820'

# VPN clients to LAN (apply via forward filter)
set firewall ipv4 name VPN-TO-LAN default-action 'accept'
set firewall ipv4 forward filter rule 50 inbound-interface name 'wg0'
set firewall ipv4 forward filter rule 50 action 'jump'
set firewall ipv4 forward filter rule 50 jump-target 'VPN-TO-LAN'

# === DNS for VPN clients ===
set service dns forwarding listen-address '10.10.0.1'
set service dns forwarding allow-from '10.10.0.0/24'

The Lesson

WireGuard becomes stable after addressing two things:

MTU: Set it explicitly to 1420 or lower. Don't rely on automatic MTU discovery — it often fails through NAT and firewalls.
Routing policy: Be explicit about what traffic goes where. allowed-ips controls both routing AND cryptographic acceptance. If it's not in allowed-ips, it won't be accepted even if routed correctly.

Everything else — keepalives, firewall rules, NAT — follows logically from the use case. But MTU and routing policy are where most WireGuard problems live. Fix those, and the rest falls into place.

Policy-Based Routing on VyOS: Practical Patterns for Split Routing

berik@ashimov.com (Berik Ashimov) — Fri, 10 Oct 2025 00:00:00 GMT

Standard routing is simple: packets go to the destination via the best route in the table. But what if you need specific traffic to take a different path? Work traffic through the VPN, streaming through the ISP, certain devices always through a specific gateway?

That's Policy-Based Routing (PBR). VyOS implements PBR through policy routes and routing tables. It sounds complex, but the pattern is simple: match the traffic, then route it to a specific table.

The PBR Mental Model

Two components work together:

Routing table: Alternative routes for specific traffic
Policy route: Match traffic and direct to the appropriate table

Packet arrives → Policy route matches → Routed via alternate table

VyOS 1.4 policy route can match traffic directly (by source, destination, protocol, etc.) and route it to a specific table — no firewall marks needed for most cases.

Scenario 1: Route Specific Subnet Through VPN

Let's say you have:

Default internet via eth0 (ISP)
WireGuard VPN on wg0
Want 10.0.0.0/24 (work devices) to use VPN

configure

# Create a separate routing table for VPN traffic
set protocols static table 10 route 0.0.0.0/0 interface wg0

# Policy route: match source and set table
set policy route PBR rule 10 source address '10.0.0.0/24'
set policy route PBR rule 10 set table '10'

# Apply policy to LAN interface
set policy route PBR interface 'eth1'

commit

Validation:

# Check routing table 10 exists
show ip route table 10

# From a work device, check public IP
# Should show VPN exit IP, not ISP IP
curl ifconfig.me

Scenario 2: Route by Destination (Specific Sites Through VPN)

Route only certain destinations through VPN, everything else direct.

configure

# Routing table for VPN
set protocols static table 20 route 0.0.0.0/0 interface wg0

# Define destinations (IP ranges of services you want through VPN)
set firewall group network-group VPN-DESTINATIONS network '203.0.113.0/24'
set firewall group network-group VPN-DESTINATIONS network '198.51.100.0/24'

# Policy route: match destination and set table
set policy route PBR-DEST rule 10 destination group network-group 'VPN-DESTINATIONS'
set policy route PBR-DEST rule 10 set table '20'
set policy route PBR-DEST interface 'eth1'

commit

Scenario 3: Route by Domain (Using DNS-Based Groups)

VyOS 1.4+ supports domain groups. Traffic to specific domains can be routed differently.

configure

# Create domain group
set firewall group domain-group STREAMING domain 'netflix.com'
set firewall group domain-group STREAMING domain 'nflxvideo.net'
set firewall group domain-group STREAMING domain 'hulu.com'

# Route streaming through ISP (not VPN) even if VPN is default
set protocols static table 30 route 0.0.0.0/0 next-hop 192.168.1.1

# Policy route: match domain group and set table
set policy route PBR-DOMAIN rule 10 destination group domain-group 'STREAMING'
set policy route PBR-DOMAIN rule 10 set table '30'
set policy route PBR-DOMAIN interface 'eth1'

commit

Important: Domain groups rely on DNS resolution. VyOS maintains a cache of IP addresses for the domains. This isn't perfect — CDNs change IPs, some services use many domains. But for common use cases, it works well.

Scenario 4: Combined Rules (Source + Destination)

Real-world often needs combinations: "Work devices accessing work servers go through VPN, but their general browsing goes direct."

configure

# Groups
set firewall group network-group WORK-DEVICES network '10.0.0.0/24'
set firewall group network-group WORK-SERVERS network '10.100.0.0/16'

# Table for VPN
set protocols static table 40 route 0.0.0.0/0 interface wg0

# Policy route: match source AND destination, set table
set policy route PBR-WORK rule 10 source group network-group 'WORK-DEVICES'
set policy route PBR-WORK rule 10 destination group network-group 'WORK-SERVERS'
set policy route PBR-WORK rule 10 set table '40'

# Everything else from work devices goes direct (no matching rule = main table)

set policy route PBR-WORK interface 'eth1'

commit

Debugging PBR

When PBR doesn't work as expected, debug systematically:

1. Verify Policy Route is Matching

# Check policy route statistics
show policy route statistics

If rules show zero packets, the match criteria isn't hitting. Check source/destination groups.

2. Verify the Routing Table Exists

show ip route table 10

Should show the route (e.g., default via wg0). If empty, the table wasn't configured correctly.

3. Verify Policy Route is Applied

show policy route

Confirms which interfaces have policy routing and what rules exist.

4. Trace a Specific Packet

# From VyOS, simulate routing decision
ip route get 8.8.8.8 mark 10

Shows which route would be used for marked traffic.

5. Check Actual Traffic Flow

# Monitor traffic on interfaces
sudo tcpdump -i wg0 -n host 8.8.8.8

If traffic appears on the expected interface, PBR is working.

Common Issues

Problem	Cause	Fix
Traffic not matching	Source/dest mismatch	Verify group contents, check rule order
Matched but wrong route	Table number mismatch	Ensure table exists with correct routes
Works then fails	Gateway down in alternate table	Add gateway monitoring
DNS traffic bypasses PBR	DNS resolves before routing	Use domain groups or DNS on VPN

Best Practices

Use meaningful table numbers: 10 for VPN, 20 for backup ISP, etc. Document what each table is for.

Keep firewall groups organized:

set firewall group network-group VPN-CLIENTS description 'Devices that always use VPN'
set firewall group network-group BYPASS-VPN description 'Devices that never use VPN'

Test each component separately: First verify the routing table works (manually add a route and test), then verify policy rules are matching, then check traffic flows correctly.

Have a fallback: If VPN goes down, marked traffic will blackhole. Consider adding:

# Fallback route in VPN table
set protocols static table 10 route 0.0.0.0/0 next-hop 192.168.1.1 distance 10

Lower distance = preferred. If wg0 route (default distance 1) fails, traffic falls back to ISP.

Complete Example: Split-Tunnel VPN

Here's a realistic full configuration. Certain devices always use VPN, streaming services bypass VPN, everything else goes direct.

# === Routing Tables ===
set protocols static table 10 route 0.0.0.0/0 interface wg0

# === Firewall Groups ===
set firewall group network-group VPN-CLIENTS network '10.0.0.50/32'
set firewall group network-group VPN-CLIENTS network '10.0.0.51/32'
set firewall group domain-group STREAMING domain 'netflix.com'
set firewall group domain-group STREAMING domain 'nflxvideo.net'

# === Policy Route Rules ===
# Rule order matters - exceptions first!

# Streaming from VPN clients goes direct (not through VPN)
set policy route PBR rule 5 source group network-group 'VPN-CLIENTS'
set policy route PBR rule 5 destination group domain-group 'STREAMING'
# No table set = uses main routing table

# VPN clients use VPN for everything else
set policy route PBR rule 10 source group network-group 'VPN-CLIENTS'
set policy route PBR rule 10 set table '10'

# === Apply ===
set policy route PBR interface 'eth1'

Rule order matters: rule 5 (streaming exception) is checked before rule 10 (VPN routing). Streaming traffic matches rule 5 with no table override, uses default routing. Everything else from VPN clients matches rule 10, uses VPN table.

The Lesson

PBR isn't magic incantations. It's two clear steps:

Define where traffic should go (routing tables)
Define what traffic to affect (policy route rules with matching criteria)

When debugging, check each step independently. Are rules matching traffic? Does the table have the right routes? Is the policy applied to the correct interface?

Clear criteria + systematic debugging = PBR that works reliably.

IPv6 at Home: RA, DHCPv6, and Why Your Firewall Keeps Breaking It

berik@ashimov.com (Berik Ashimov) — Tue, 07 Oct 2025 00:00:00 GMT

IPv6 breaks differently than IPv4. There's no NAT hiding your mistakes, no single DHCP server controlling everything, and a whole new set of protocols (RA, ND, DHCPv6) that must work together. When IPv6 stops working, it's rarely "magic" — it's almost always Router Advertisements or firewall rules.

This guide covers practical IPv6 deployment on VyOS, including the debugging steps that will save you hours of frustration.

IPv6 Addressing: What You Actually Get

Most ISPs provide one of these:

Static prefix: You get a /48 or /56 that doesn't change
DHCPv6-PD (Prefix Delegation): Router requests a prefix dynamically
Single /64: Bare minimum, limits your options

For home use, DHCPv6-PD is most common. Let's configure for that scenario.

WAN Configuration: Getting Your Prefix

configure

# Request address via DHCPv6 for WAN interface
set interfaces ethernet eth0 ipv6 address autoconf

# Request delegated prefix for LAN
set interfaces ethernet eth0 dhcpv6-options pd 0 interface eth1 sla-id '0'
set interfaces ethernet eth0 dhcpv6-options pd 0 length '56'

commit

The sla-id lets you create multiple /64s from your delegated prefix. If you get a /56, you have 256 possible /64 subnets. sla-id '0' means use the first one on eth1.

Validation step:

show interfaces ethernet eth0
show interfaces ethernet eth1

eth0 should have a global IPv6 address. eth1 should have an address from your delegated prefix (something like 2001:db8:1234::/64 depending on your ISP).

LAN Configuration: Router Advertisements

Unlike IPv4 where DHCP does everything, IPv6 clients learn about the network through Router Advertisements (RA). The router periodically broadcasts "I exist, here's the prefix, here's how to get addresses."

configure

# Enable router advertisements on LAN
set service router-advert interface eth1 prefix ::/64
set service router-advert interface eth1 name-server 2606:4700:4700::1111
set service router-advert interface eth1 name-server 2001:4860:4860::8888

commit

The ::/64 prefix means "use whatever prefix is assigned to this interface." VyOS automatically advertises the correct prefix.

Validation step: On a LAN client, check for IPv6 address.

# Linux
ip -6 addr show

# macOS
ifconfig | grep inet6

# Windows
ipconfig | findstr IPv6

Clients should have a global IPv6 address (not just fe80:: link-local).

SLAAC vs DHCPv6: Understanding the Options

Two ways for clients to get IPv6 addresses:

SLAAC (Stateless Address Autoconfiguration)

Client generates its own address from the prefix
No server tracks who has what address
Simple, but no central lease database

DHCPv6 (Stateful)

Server assigns specific addresses
Tracks leases like IPv4 DHCP
More control, more complexity

For home networks, SLAAC is usually sufficient. The RA configuration above uses SLAAC by default.

If you need DHCPv6 (for address tracking or specific assignments):

configure

# Tell clients to also use DHCPv6 for addresses
set service router-advert interface eth1 managed-flag

# DHCPv6 server
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 range 0 start 2001:db8:1234::100
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 range 0 stop 2001:db8:1234::1ff
set service dhcpv6-server shared-network-name LAN subnet 2001:db8:1234::/64 subnet-id 1

commit

Note: Replace 2001:db8:1234::/64 with your actual delegated prefix.

The Firewall Problem

Here's where most IPv6 setups break. IPv4 NAT accidentally provided security — nothing could reach internal hosts without explicit port forwards. IPv6 has no NAT (normally), so every device is directly addressable from the internet.

You MUST have proper firewall rules.

configure

# IPv6 firewall: WAN to LAN
set firewall ipv6 name WANv6-TO-LANv6 default-action 'drop'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state 'established'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state 'related'

# ICMPv6 is REQUIRED for IPv6 to function
set firewall ipv6 name WANv6-TO-LANv6 rule 20 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 20 protocol 'ipv6-icmp'

# IPv6 firewall: WAN to router (local)
set firewall ipv6 name WANv6-LOCAL default-action 'drop'
set firewall ipv6 name WANv6-LOCAL rule 10 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 10 state 'established'
set firewall ipv6 name WANv6-LOCAL rule 10 state 'related'
set firewall ipv6 name WANv6-LOCAL rule 20 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 20 protocol 'ipv6-icmp'
set firewall ipv6 name WANv6-LOCAL rule 30 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 30 protocol 'udp'
set firewall ipv6 name WANv6-LOCAL rule 30 destination port '546'
set firewall ipv6 name WANv6-LOCAL rule 30 source port '547'

# LAN to WAN: allow outbound
set firewall ipv6 name LANv6-TO-WANv6 default-action 'accept'

# Apply to forward/input chains
set firewall ipv6 forward filter default-action 'accept'
set firewall ipv6 forward filter rule 10 inbound-interface name 'eth0'
set firewall ipv6 forward filter rule 10 action 'jump'
set firewall ipv6 forward filter rule 10 jump-target 'WANv6-TO-LANv6'

set firewall ipv6 input filter default-action 'drop'
set firewall ipv6 input filter rule 10 inbound-interface name 'eth0'
set firewall ipv6 input filter rule 10 action 'jump'
set firewall ipv6 input filter rule 10 jump-target 'WANv6-LOCAL'

commit

Critical: ICMPv6 Must Be Allowed

Unlike IPv4 where you could (unwisely) block all ICMP, IPv6 requires ICMPv6 for:

Neighbor Discovery (ND): IPv6's replacement for ARP
Router Advertisements: How clients find the gateway
Path MTU Discovery: Essential for connectivity
Duplicate Address Detection: Prevents IP conflicts

Blocking ICMPv6 = broken IPv6. Rule 20 in the firewall above allows all ICMPv6. You can be more restrictive:

# More restrictive ICMPv6 (still functional)
set firewall ipv6 name WANv6-TO-LANv6 rule 20 icmpv6 type 'echo-request'
set firewall ipv6 name WANv6-TO-LANv6 rule 21 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 21 protocol 'ipv6-icmp'
set firewall ipv6 name WANv6-TO-LANv6 rule 21 icmpv6 type 'destination-unreachable'
set firewall ipv6 name WANv6-TO-LANv6 rule 22 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 22 protocol 'ipv6-icmp'
set firewall ipv6 name WANv6-TO-LANv6 rule 22 icmpv6 type 'packet-too-big'
set firewall ipv6 name WANv6-TO-LANv6 rule 23 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 23 protocol 'ipv6-icmp'
set firewall ipv6 name WANv6-TO-LANv6 rule 23 icmpv6 type 'time-exceeded'

NAT66: When You Actually Need It

Pure IPv6 doesn't need NAT. But sometimes you're stuck:

ISP only gives you a /64 and you need multiple subnets
Privacy concerns about exposing internal addressing
Translating between different IPv6 ranges

NAT66 (IPv6-to-IPv6 NAT) is available but should be a last resort:

configure

# NAT66 - use only if absolutely necessary
set nat66 source rule 100 outbound-interface name 'eth0'
set nat66 source rule 100 source prefix 'fd00::/64'
set nat66 source rule 100 translation address 'masquerade'

commit

This would NAT your ULA (fd00::/64) internal addresses to your public prefix. Again, avoid this if possible — it defeats IPv6's end-to-end connectivity benefits.

Debugging IPv6 Issues

When IPv6 breaks, here's the diagnostic flow:

1. Check Interface Addressing

show interfaces

Both WAN and LAN need global IPv6 addresses (not just fe80:: link-local).

2. Verify Router Advertisements

# On VyOS
show ipv6 route

# On Linux client
rdisc6 eth0

If RA isn't working, clients won't get addresses or know the default gateway.

3. Check Neighbor Discovery

# On VyOS
show ipv6 neighbors

# On Linux client
ip -6 neigh show

ND is like ARP for IPv6. Missing entries mean L2 connectivity issues or firewall blocking.

4. Test Connectivity Layer by Layer

# From VyOS: can we reach the internet?
ping 2600:: -c 3

# From client: can we reach the gateway?
ping6 fe80::1%eth0

# From client: can we reach the internet?
ping6 google.com

5. Check Firewall Counters

show firewall ipv6 name WANv6-TO-LANv6
show firewall ipv6 name WANv6-LOCAL

High drop counts on specific rules indicate what's being blocked.

Common Issues and Fixes

Symptom	Likely Cause	Fix
No global address on LAN client	RA not working	Check router-advert config, verify eth1 has global address
Can ping gateway but not internet	Missing default route or firewall	Check `show ipv6 route`, verify firewall allows outbound
Intermittent connectivity	ICMPv6 blocked	Allow ICMPv6 in firewall
Works then stops after minutes	DAD failure or RA timeout	Check RA interval, look for duplicate addresses
DHCPv6 not assigning addresses	Missing managed-flag in RA	Set `managed-flag` on router-advert interface

Privacy Extensions

By default, SLAAC creates addresses based on MAC address — potentially trackable. Modern systems use Privacy Extensions (RFC 4941) to generate random addresses.

VyOS can control this via RA:

# Suggest clients use privacy addresses
set service router-advert interface eth1 prefix ::/64 preferred-lifetime '14400'
set service router-advert interface eth1 prefix ::/64 valid-lifetime '86400'

Shorter lifetimes encourage address rotation. Client OS controls whether to actually use privacy extensions.

Complete IPv6 Configuration

# WAN: DHCPv6-PD
set interfaces ethernet eth0 ipv6 address autoconf
set interfaces ethernet eth0 dhcpv6-options pd 0 interface eth1 sla-id '0'
set interfaces ethernet eth0 dhcpv6-options pd 0 length '56'

# LAN: Router Advertisements
set service router-advert interface eth1 prefix ::/64
set service router-advert interface eth1 name-server 2606:4700:4700::1111
set service router-advert interface eth1 name-server 2001:4860:4860::8888

# Firewall: WAN inbound
set firewall ipv6 name WANv6-TO-LANv6 default-action 'drop'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state 'established'
set firewall ipv6 name WANv6-TO-LANv6 rule 10 state 'related'
set firewall ipv6 name WANv6-TO-LANv6 rule 20 action 'accept'
set firewall ipv6 name WANv6-TO-LANv6 rule 20 protocol 'ipv6-icmp'

# Firewall: WAN to local
set firewall ipv6 name WANv6-LOCAL default-action 'drop'
set firewall ipv6 name WANv6-LOCAL rule 10 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 10 state 'established'
set firewall ipv6 name WANv6-LOCAL rule 10 state 'related'
set firewall ipv6 name WANv6-LOCAL rule 20 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 20 protocol 'ipv6-icmp'
set firewall ipv6 name WANv6-LOCAL rule 30 action 'accept'
set firewall ipv6 name WANv6-LOCAL rule 30 protocol 'udp'
set firewall ipv6 name WANv6-LOCAL rule 30 destination port '546'
set firewall ipv6 name WANv6-LOCAL rule 30 source port '547'

# Firewall: LAN outbound
set firewall ipv6 name LANv6-TO-WANv6 default-action 'accept'

# Apply firewall to forward/input chains
set firewall ipv6 forward filter default-action 'accept'
set firewall ipv6 forward filter rule 10 inbound-interface name 'eth0'
set firewall ipv6 forward filter rule 10 action 'jump'
set firewall ipv6 forward filter rule 10 jump-target 'WANv6-TO-LANv6'

set firewall ipv6 input filter default-action 'drop'
set firewall ipv6 input filter rule 10 inbound-interface name 'eth0'
set firewall ipv6 input filter rule 10 action 'jump'
set firewall ipv6 input filter rule 10 jump-target 'WANv6-LOCAL'

The Lesson

IPv6 doesn't break mysteriously. When it fails, check in order:

RA configuration: Is the router advertising the prefix?
Firewall rules: Is ICMPv6 allowed? Is DHCPv6 (port 546/547) allowed for WAN-local?
Prefix delegation: Did the router actually receive a prefix from the ISP?

Once you understand that RA replaces much of what DHCP does in IPv4, and that ICMPv6 is mandatory (not optional), IPv6 becomes predictable. The debugging is different, but the methodology is the same: verify each layer, check what's being blocked, and read the counters.

VyOS Isn't Scary: Building Your First Production-Ready Router

berik@ashimov.com (Berik Ashimov) — Fri, 03 Oct 2025 00:00:00 GMT

VyOS has a reputation for being intimidating. The CLI-only interface, the commit/save model, the sheer number of configuration options — it can feel overwhelming. But here's the thing: VyOS isn't complicated, it's just explicit. Every setting you'd configure through a consumer router's web UI exists here too, just visible and version-controllable.

This guide walks through building a basic but production-ready router configuration. We'll validate each piece before moving to the next. By the end, you'll have a working router and the confidence to extend it.

The Mental Model

Before touching any commands, understand VyOS's configuration model:

Configuration tree: Settings are organized hierarchically (like a filesystem)
Edit sessions: Changes are staged, then committed atomically
Rollback capability: Bad commit? Roll back instantly
Show vs Configure mode: show reads running state, configure modifies it

# Enter configuration mode
configure

# Make changes (staged, not active yet)
set interfaces ethernet eth0 address 192.168.1.1/24

# See what would change
compare

# Apply changes atomically
commit

# Persist across reboots
save

# Exit configuration mode
exit

This model prevents half-applied configurations. Either everything commits successfully, or nothing changes.

Initial Setup: Interfaces

Let's assume a typical setup:

eth0: WAN (gets address via DHCP from ISP)
eth1: LAN (our internal network, 10.0.0.0/24)

configure

# WAN interface - DHCP from ISP
set interfaces ethernet eth0 description 'WAN'
set interfaces ethernet eth0 address dhcp

# LAN interface - static address, this router is the gateway
set interfaces ethernet eth1 description 'LAN'
set interfaces ethernet eth1 address '10.0.0.1/24'

commit

Validation step: Check interfaces are up and addressed correctly.

show interfaces

You should see eth0 with an address from your ISP and eth1 with 10.0.0.1/24. If eth0 shows no address, check cable and ISP connectivity.

NAT: Making LAN Traffic Reach the Internet

Without NAT, your LAN devices can't reach the internet — their private IPs aren't routable. We need source NAT (masquerade) on outbound traffic.

configure

# Source NAT for outbound traffic
set nat source rule 100 outbound-interface name 'eth0'
set nat source rule 100 source address '10.0.0.0/24'
set nat source rule 100 translation address 'masquerade'

commit

Validation step: From a LAN device with manual IP (10.0.0.100, gateway 10.0.0.1), try pinging 8.8.8.8.

# On VyOS, check NAT is working
show nat source statistics

If pings fail, verify:

LAN device has correct gateway (10.0.0.1)
VyOS can ping 8.8.8.8 itself (routing works)
NAT rule matches your LAN subnet

DHCP: Automatic Addressing for LAN Clients

Manual IPs work for testing, but clients need DHCP.

configure

# DHCP server for LAN
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 subnet-id 1
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 start '10.0.0.100'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 stop '10.0.0.254'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 default-router '10.0.0.1'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 name-server '10.0.0.1'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 lease '86400'

commit

Validation step: Release/renew DHCP on a LAN client, verify it gets an address in the 10.0.0.100-254 range.

show dhcp server leases

DNS Forwarding: Local Resolution

Clients point to 10.0.0.1 for DNS. VyOS needs to forward those queries upstream.

configure

# DNS forwarding
set service dns forwarding cache-size '1000'
set service dns forwarding listen-address '10.0.0.1'
set service dns forwarding allow-from '10.0.0.0/24'

# Use ISP's DNS or public resolvers
set service dns forwarding name-server '1.1.1.1'
set service dns forwarding name-server '8.8.8.8'

commit

Validation step: From LAN client, resolve a domain.

# On VyOS
show dns forwarding statistics

At this point, LAN clients should have full internet access with automatic addressing. Test by browsing from a client device.

Firewall: The Foundation

VyOS firewall works on zone/interface pairs. The mental model:

Traffic flows between zones (WAN zone, LAN zone)
Rules apply to traffic direction (in, out, local)
Default policy should be drop (explicit allow)

Modern VyOS (1.4+) uses a zone-based approach. Here's a clean baseline:

configure

# Create firewall groups for organization
set firewall group network-group LAN-NETS network '10.0.0.0/24'

# WAN to LAN: only established/related connections (no inbound initiation)
set firewall ipv4 name WAN-TO-LAN default-action 'drop'
set firewall ipv4 name WAN-TO-LAN rule 10 action 'accept'
set firewall ipv4 name WAN-TO-LAN rule 10 state 'established'
set firewall ipv4 name WAN-TO-LAN rule 10 state 'related'

# WAN to router itself (local): very restrictive
set firewall ipv4 name WAN-LOCAL default-action 'drop'
set firewall ipv4 name WAN-LOCAL rule 10 action 'accept'
set firewall ipv4 name WAN-LOCAL rule 10 state 'established'
set firewall ipv4 name WAN-LOCAL rule 10 state 'related'

# LAN to WAN: allow all outbound (NAT handles the rest)
set firewall ipv4 name LAN-TO-WAN default-action 'accept'

# LAN to router: allow DHCP, DNS, SSH
set firewall ipv4 name LAN-LOCAL default-action 'drop'
set firewall ipv4 name LAN-LOCAL rule 10 action 'accept'
set firewall ipv4 name LAN-LOCAL rule 10 state 'established'
set firewall ipv4 name LAN-LOCAL rule 10 state 'related'
set firewall ipv4 name LAN-LOCAL rule 20 action 'accept'
set firewall ipv4 name LAN-LOCAL rule 20 protocol 'udp'
set firewall ipv4 name LAN-LOCAL rule 20 destination port '67,68'
set firewall ipv4 name LAN-LOCAL rule 30 action 'accept'
set firewall ipv4 name LAN-LOCAL rule 30 protocol 'udp'
set firewall ipv4 name LAN-LOCAL rule 30 destination port '53'
set firewall ipv4 name LAN-LOCAL rule 40 action 'accept'
set firewall ipv4 name LAN-LOCAL rule 40 protocol 'tcp'
set firewall ipv4 name LAN-LOCAL rule 40 destination port '53'
set firewall ipv4 name LAN-LOCAL rule 50 action 'accept'
set firewall ipv4 name LAN-LOCAL rule 50 protocol 'tcp'
set firewall ipv4 name LAN-LOCAL rule 50 destination port '22'
set firewall ipv4 name LAN-LOCAL rule 50 source group network-group 'LAN-NETS'

# Apply firewall to forward chain (traffic passing through router)
set firewall ipv4 forward filter default-action 'accept'
set firewall ipv4 forward filter rule 10 inbound-interface name 'eth0'
set firewall ipv4 forward filter rule 10 action 'jump'
set firewall ipv4 forward filter rule 10 jump-target 'WAN-TO-LAN'

# Apply firewall to input chain (traffic TO the router)
set firewall ipv4 input filter default-action 'drop'
set firewall ipv4 input filter rule 10 inbound-interface name 'eth0'
set firewall ipv4 input filter rule 10 action 'jump'
set firewall ipv4 input filter rule 10 jump-target 'WAN-LOCAL'
set firewall ipv4 input filter rule 20 inbound-interface name 'eth1'
set firewall ipv4 input filter rule 20 action 'jump'
set firewall ipv4 input filter rule 20 jump-target 'LAN-LOCAL'

commit

Validation step: Test each service still works after firewall is applied.

# Check firewall rule hit counts
show firewall

# From LAN: verify DNS, DHCP, internet access
# From WAN: verify no response to unsolicited connections

Systematic Validation Checklist

Before calling this "done", verify each component:

Component	Test	Expected Result
WAN connectivity	`ping 8.8.8.8` from VyOS	Success
DNS on VyOS	`ping google.com` from VyOS	Resolves and pings
LAN addressing	`show interfaces`	eth1 has 10.0.0.1/24
DHCP	Client gets address	IP in 10.0.0.100-254
NAT	Client pings 8.8.8.8	Success
DNS forwarding	Client resolves domains	Success
Firewall (LAN→WAN)	Client browses internet	Success
Firewall (WAN→LAN)	External port scan	All filtered

The Complete Configuration

Here's everything in one block for reference:

# Interfaces
set interfaces ethernet eth0 description 'WAN'
set interfaces ethernet eth0 address dhcp
set interfaces ethernet eth1 description 'LAN'
set interfaces ethernet eth1 address '10.0.0.1/24'

# NAT
set nat source rule 100 outbound-interface name 'eth0'
set nat source rule 100 source address '10.0.0.0/24'
set nat source rule 100 translation address 'masquerade'

# DHCP
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 subnet-id 1
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 start '10.0.0.100'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 range 0 stop '10.0.0.254'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 default-router '10.0.0.1'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 name-server '10.0.0.1'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 lease '86400'

# DNS
set service dns forwarding cache-size '1000'
set service dns forwarding listen-address '10.0.0.1'
set service dns forwarding allow-from '10.0.0.0/24'
set service dns forwarding name-server '1.1.1.1'
set service dns forwarding name-server '8.8.8.8'

# Firewall (see detailed rules above)
# ... firewall rules ...

# System basics
set system host-name 'router'
set system name-server '1.1.1.1'
set system time-zone 'UTC'

# SSH access (LAN only via firewall)
set service ssh port '22'

What's Next

This configuration handles basic routing, but there's more to explore:

IPv6: Modern networks should support it (covered in the next article)
VLANs: Segment your network further
VPN: WireGuard or IPsec for remote access
Monitoring: Logs and metrics for visibility

The key lesson from this exercise: build the foundation first, validate each piece, then extend. VyOS rewards methodical configuration. When something breaks later, you'll know exactly which commit introduced the problem.

Save your configuration, export it (show configuration commands > config.txt), and version control it. Your router is now reproducible.

GPU / PCI Passthrough: The Path That Works (and What Breaks It)

berik@ashimov.com (Berik Ashimov) — Tue, 30 Sep 2025 00:00:00 GMT

GPU passthrough lets a VM directly access a physical GPU. No emulation, no virtualization overhead — the VM sees real hardware and gets real performance. It's the only way to run GPU workloads (gaming, machine learning, transcoding) in VMs without massive performance loss.

It's also one of the most finicky things to configure. Hardware compatibility, IOMMU groups, driver issues — any of these can break passthrough completely. When it works, it's magical. When it doesn't, debugging is painful.

Passthrough is hardware compatibility plus attention to detail.

Prerequisites

Hardware Requirements

CPU:

Intel: VT-d (IOMMU support)
AMD: AMD-Vi (IOMMU support)

Check BIOS for "VT-d," "AMD-Vi," "IOMMU," or "Virtualization Technology for Directed I/O."

Motherboard:

Must support IOMMU
Consumer boards often have poor IOMMU groups
Server/workstation boards usually work better

GPU:

Most discrete GPUs work
NVIDIA consumer cards have "Code 43" issues (we'll address)
AMD cards generally work well
Intel integrated graphics: limited support

Check IOMMU Support

# Intel
dmesg | grep -e DMAR -e IOMMU

# AMD
dmesg | grep AMD-Vi

# Should see messages like:
# DMAR: IOMMU enabled
# AMD-Vi: Enabling IOMMU

If no messages, enable IOMMU in BIOS.

Enable IOMMU

Edit GRUB Configuration

# Edit GRUB config
nano /etc/default/grub

For Intel:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

For AMD:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

Apply changes:

update-grub
reboot

Verify IOMMU Enabled

dmesg | grep -e DMAR -e IOMMU -e AMD-Vi

# Should see:
# DMAR: IOMMU enabled
# or
# AMD-Vi: AMD IOMMUv2 loaded

IOMMU Groups

IOMMU groups are sets of devices that must be passed through together. You can't pass a single device if it's in a group with other devices.

View IOMMU Groups

#!/bin/bash
# Save as /root/iommu-groups.sh
shopt -s nullglob
for g in $(find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V); do
    echo "IOMMU Group ${g##*/}:"
    for d in $g/devices/*; do
        echo -e "\t$(lspci -nns ${d##*/})"
    done
done

Example output:

IOMMU Group 1:
    00:01.0 PCI bridge [0604]: Intel Corporation...
    01:00.0 VGA compatible controller [0300]: NVIDIA Corporation... [10de:2204]
    01:00.1 Audio device [0403]: NVIDIA Corporation... [10de:1aef]

The GPU (01:00.0) and its audio device (01:00.1) are in the same group. You must pass both.

ACS Override (If Needed)

Poor IOMMU grouping (everything in one group) can be fixed with ACS override patch. Use with caution — it reduces isolation security.

# Add to GRUB
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"

update-grub
reboot

After reboot, check groups again. They should be smaller.

VFIO Configuration

VFIO (Virtual Function I/O) binds devices for passthrough.

Identify Device IDs

lspci -nn | grep -i nvidia
# 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204]
# 01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef]

Device IDs: 10de:2204 (GPU), 10de:1aef (Audio)

Configure VFIO

# Add VFIO modules
echo "vfio" >> /etc/modules
echo "vfio_iommu_type1" >> /etc/modules
echo "vfio_pci" >> /etc/modules
echo "vfio_virqfd" >> /etc/modules

# Bind devices to VFIO (use YOUR device IDs)
echo "options vfio-pci ids=10de:2204,10de:1aef disable_vga=1" > /etc/modprobe.d/vfio.conf

# Blacklist host drivers (so host doesn't grab GPU)
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia_drm" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidiafb" >> /etc/modprobe.d/blacklist.conf

# For AMD
echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf
echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf

# Update initramfs
update-initramfs -u

# Reboot
reboot

Verify VFIO Binding

lspci -nnk -s 01:00
# Should show:
# Kernel driver in use: vfio-pci

If it shows nvidia or nouveau, the blacklist didn't work. Check modprobe configuration.

Create VM with GPU Passthrough

VM Configuration

# Create VM
qm create 100 --name gpu-vm --memory 16384 --cores 8 --sockets 1 \
  --bios ovmf --machine q35 \
  --net0 virtio,bridge=vmbr0 \
  --scsihw virtio-scsi-pci

# Add EFI disk
qm set 100 --efidisk0 local-zfs:1,format=raw

# Add main disk
qm set 100 --scsi0 local-zfs:100,ssd=1

# CPU settings (important for passthrough)
qm set 100 --cpu host,hidden=1,flags=+pcid

# Add PCI devices
qm set 100 --hostpci0 01:00,pcie=1,x-vga=1

# Add audio device if in same IOMMU group
qm set 100 --hostpci1 01:00.1,pcie=1

Important Settings

BIOS: OVMF (UEFI)

Required for modern GPUs
Enables PCI passthrough features

Machine: q35

Modern chipset with proper PCIe support

CPU: host,hidden=1

host: Pass through all CPU features
hidden=1: Hide hypervisor from VM (needed for NVIDIA)

hostpci: pcie=1,x-vga=1

pcie=1: Use PCIe mode (required)
x-vga=1: Primary graphics device

NVIDIA-Specific Fixes

NVIDIA drivers detect virtualization and refuse to work ("Error 43"). Several workarounds:

Hide Hypervisor

Already done with cpu: host,hidden=1. For older NVIDIA drivers, you may also need:

# In /etc/pve/qemu-server/100.conf, add to args (if Error 43 persists):
args: -cpu 'host,hv_vendor_id=NV43FIX,+kvm_pv_unhalt,+kvm_pv_eoi'

Note: Modern NVIDIA drivers (535+) usually work with just hidden=1.

Vendor ID Spoofing

# Add to VM config
qm set 100 --args "-cpu 'host,hv_vendor_id=randomid'"

ROM File (Sometimes Needed)

Some GPUs need their VBIOS dumped and passed:

# Dump ROM (from another system or Windows)
# Or download from TechPowerUp

# Add to VM config
qm set 100 --hostpci0 01:00,pcie=1,x-vga=1,romfile=gpu.rom

Place ROM file in /usr/share/kvm/.

AMD GPU Passthrough

AMD is generally easier:

# VFIO config (use AMD device IDs)
echo "options vfio-pci ids=1002:xxxx,1002:xxxx" > /etc/modprobe.d/vfio.conf

# Blacklist
echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf
echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf

update-initramfs -u
reboot

AMD GPUs usually work without additional tweaks.

AMD Reset Bug

Some AMD GPUs (Polaris, Navi) have reset bugs — VM shutdown leaves GPU in bad state, requiring host reboot.

Workaround:

# Install build dependencies
apt install pve-headers-$(uname -r) git build-essential dkms

# Clone and build vendor-reset module
git clone https://github.com/gnif/vendor-reset.git
cd vendor-reset
dkms install .

# Verify module loads
modprobe vendor-reset

# Make persistent after successful test
echo "vendor-reset" >> /etc/modules

Other PCI Devices

Passthrough works for any PCI device, not just GPUs:

USB Controller

Pass entire USB controller for low-latency USB:

# Find USB controller
lspci | grep USB

# Pass through
qm set 100 --hostpci2 00:14.0,pcie=1

NVMe Controller

Pass NVMe for direct storage access:

# Find NVMe
lspci | grep NVMe

# Pass through
qm set 100 --hostpci3 03:00.0,pcie=1

Network Card

Pass dedicated NIC:

qm set 100 --hostpci4 04:00.0,pcie=1

Troubleshooting

Device Not Bound to VFIO

# Check current driver
lspci -nnk -s 01:00

# If not vfio-pci:
# 1. Check blacklist
cat /etc/modprobe.d/blacklist.conf

# 2. Check VFIO config
cat /etc/modprobe.d/vfio.conf

# 3. Rebuild initramfs
update-initramfs -u -k all

# 4. Reboot

VM Won't Start

# Check QEMU log
cat /var/log/pve/qemu-server/100.log

# Common issues:
# - IOMMU not enabled
# - Device in use by host
# - Wrong device ID

No Display Output

# Verify x-vga=1 is set
grep hostpci /etc/pve/qemu-server/100.conf

# Try different video output
# Monitor on GPU should show VM boot

# Check if VM is actually running
qm status 100

NVIDIA Error 43

# Verify hidden flag
grep cpu /etc/pve/qemu-server/100.conf
# Should include hidden=1

# Try vendor ID spoof
# Add to args: hv_vendor_id=NV43FIX

# Ensure BIOS is OVMF, not SeaBIOS

Poor Performance

# Inside VM:
# Check if GPU is using correct driver
nvidia-smi  # Should show GPU

# Check PCIe link speed
lspci -vv -s 01:00 | grep -i width
# Should show x16 or at least x8

# Ensure cpu type is 'host'

Single GPU Passthrough

Using your only GPU in a VM (no display on host):

# Host boots headless
# VM gets GPU
# Reconnect display to VM

# Challenges:
# - Host has no display
# - Must manage via SSH/remote
# - GPU must unbind from host console

# Scripts needed for:
# 1. Unbind GPU from host
# 2. Start VM
# 3. Stop VM
# 4. Rebind GPU to host

This is complex. Search for "single GPU passthrough scripts" for examples.

Live Migration Limitations

VMs with passthrough cannot live migrate:

Hardware is physically on one host
Must stop VM, move, start on new host
Not compatible with HA auto-failover

Plan accordingly: critical passthrough VMs can't be HA.

The Lesson

Passthrough is hardware compatibility plus attention to detail.

When passthrough doesn't work, it's usually:

IOMMU not enabled: Check BIOS and kernel parameters
Bad IOMMU groups: ACS override or different hardware
Driver conflict: Host driver grabs device before VFIO
NVIDIA detection: Hidden flags and vendor ID spoof
Reset bugs: AMD GPUs need vendor-reset module

The debugging process:

Verify IOMMU enabled (dmesg | grep IOMMU)
Check IOMMU groups (all needed devices in one group)
Verify VFIO binding (lspci -nnk)
Check VM logs (/var/log/pve/qemu-server/)
Try minimal config, add features one at a time

Passthrough isn't guaranteed to work with all hardware. Some motherboards have terrible IOMMU groups. Some GPUs have bugs. Do research before buying hardware specifically for passthrough.

When it works, you get bare-metal GPU performance in a VM. When it doesn't, you need patience and systematic debugging. There's no magic fix — just working through each requirement methodically.

Performance Clinic: CPU Pinning, Hugepages, VirtIO, and Storage Tuning

berik@ashimov.com (Berik Ashimov) — Fri, 26 Sep 2025 00:00:00 GMT

Performance tuning is seductive. Forums are full of "enable this setting for 20% more speed." Most of it is cargo culting — copying settings without understanding why.

Real performance optimization follows a process: measure, identify bottleneck, address bottleneck, measure again. Tweaking random settings without measuring is just superstition.

Optimization starts with measurement, not with tweaks.

Measure First

Before changing anything, understand your current performance.

Host Metrics

# Overall system performance
htop

# CPU usage per core
mpstat -P ALL 1

# Memory usage
free -h
vmstat 1

# Disk I/O
iostat -xz 1

# Network
iftop -i vmbr0

VM Performance

# Inside VM: Check for virtualization overhead
# CPU steal time (other VMs taking your CPU)
top  # Look at %st column

# Disk latency
iostat -x 1

# From host: VM-specific metrics
qm monitor 100
info cpus
info block

Benchmark Tools

# CPU benchmark
apt install sysbench
sysbench cpu run

# Disk benchmark
apt install fio

# Random 4K (database-like)
fio --name=rand --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=30 --group_reporting

# Sequential (large file)
fio --name=seq --ioengine=libaio --iodepth=1 --rw=read --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=30 --group_reporting

# Network benchmark (between VMs)
apt install iperf3
# Server: iperf3 -s
# Client: iperf3 -c <server-ip>

VirtIO Drivers

VirtIO is paravirtualized I/O. Instead of emulating real hardware, the VM knows it's virtualized and uses optimized drivers.

Performance Impact

Device	Emulated	VirtIO
Network	E1000: ~1 Gbps	virtio-net: 10+ Gbps
Disk	IDE: slow, high CPU	virtio-blk: fast, low CPU
Display	VGA: basic	virtio-gpu: better

Configuring VirtIO

# Disk: Use virtio-scsi controller
qm set 100 --scsihw virtio-scsi-pci
qm set 100 --scsi0 local-zfs:vm-100-disk-0

# Network: Use virtio
qm set 100 --net0 virtio,bridge=vmbr0

# Display: Use virtio (Linux VMs)
qm set 100 --vga virtio

Windows VirtIO Drivers

Windows doesn't include VirtIO drivers. Install them:

Download ISO from Fedora: https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/
Attach ISO to VM
During Windows install: Load driver from ISO
After install: Run virtio-win-gt-x64.msi for guest tools

Storage Cache Modes

Cache mode affects performance vs. data safety:

Mode	Speed	Safety	Use Case
none	Fast	Safe	Production (default)
writeback	Fastest	Less safe	Benchmarks, non-critical
writethrough	Slower	Safest	Critical data
directsync	Slowest	Safest	Maximum safety

Configure Cache

# No cache (recommended for production)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=none

# Writeback (faster, less safe)
qm set 100 --scsi0 local-zfs:vm-100-disk-0,cache=writeback

# With ZFS, cache=none is usually best
# ZFS has its own caching (ARC)

When to Use Writeback

Only with:

Battery-backed write cache (enterprise storage)
Non-critical VMs (dev, test)
Understanding that power loss = potential data loss

IO Threads

By default, all VM disk I/O goes through one QEMU thread. With IO threads, each disk gets its own thread.

Enable IO Threads

# Enable iothread for disk
qm set 100 --scsi0 local-zfs:vm-100-disk-0,iothread=1

# For multiple disks, each can have its own thread
qm set 100 --scsi1 local-zfs:vm-100-disk-1,iothread=1

When IO Threads Help

Multiple disks per VM
High IOPS workloads
VMs with concurrent disk access

CPU Configuration

CPU Type

# Host passthrough (best performance, limits migration)
qm set 100 --cpu host

# Specific type (allows migration between similar CPUs)
qm set 100 --cpu kvm64

# With flags (enable specific features)
qm set 100 --cpu host,flags=+aes

host gives best performance but limits live migration to identical CPUs.

NUMA Awareness

NUMA (Non-Uniform Memory Access) matters on multi-socket systems. Memory attached to one socket is faster for CPUs on that socket.

# Check host NUMA topology
numactl --hardware

# Example output:
# node 0 cpus: 0 1 2 3 4 5 6 7
# node 1 cpus: 8 9 10 11 12 13 14 15
# node 0 size: 32768 MB
# node 1 size: 32768 MB

Configure NUMA for VMs

# Enable NUMA for VM
qm set 100 --numa 1

# Pin VM to specific NUMA node
qm set 100 --numa0 cpus=0-3,memory=8192

# For large VMs spanning nodes
qm set 100 --numa0 cpus=0-3,memory=4096
qm set 100 --numa1 cpus=8-11,memory=4096

CPU Pinning

Dedicate specific CPUs to a VM (reduces context switching):

# Pin VM to CPUs 0-3
qm set 100 --affinity 0-3

# Or via NUMA config
qm set 100 --numa0 cpus=0-3,memory=8192

Caution: Over-pinning leaves other VMs fighting for remaining CPUs.

Hugepages

Normal memory pages are 4KB. Hugepages (2MB or 1GB) reduce TLB misses for memory-intensive workloads.

Enable Hugepages

# Reserve hugepages on host
echo 4096 > /proc/sys/vm/nr_hugepages  # 4096 × 2MB = 8GB

# Make persistent
echo "vm.nr_hugepages = 4096" >> /etc/sysctl.conf

# Verify
grep Huge /proc/meminfo

Configure VM for Hugepages

# Enable hugepages for VM
qm set 100 --hugepages 2

# Values: 2 (2MB pages), 1024 (1GB pages), any (auto)

When Hugepages Help

Large VMs (8GB+ RAM)
Memory-intensive workloads (databases)
Many VMs with significant memory

Memory Ballooning

Balloon driver lets host reclaim unused VM memory.

# Enable ballooning
qm set 100 --balloon 2048  # Minimum memory
qm set 100 --memory 8192   # Maximum memory

# VM starts with 8GB, can shrink to 2GB if host needs RAM

Ballooning Trade-offs

Pro: Better memory utilization across VMs
Con: Performance impact when balloon inflates
Con: Swap inside VM if balloon too aggressive

For latency-sensitive VMs, disable ballooning:

qm set 100 --balloon 0

Network Performance

Multiqueue

Enable multiple queues for virtio-net:

# Enable multiqueue (match to VM vCPUs, max 8)
qm set 100 --net0 virtio,bridge=vmbr0,queues=4

Inside VM:

# Set queues on interface
ethtool -L eth0 combined 4

Vhost-net

Offload network processing to kernel (usually enabled by default):

# Verify vhost-net is loaded
lsmod | grep vhost_net

# If not loaded
modprobe vhost_net

Storage Performance

ZFS Tuning

# Check ARC size
arc_summary | grep "ARC size"

# Increase ARC max (if you have RAM)
echo "options zfs zfs_arc_max=8589934592" > /etc/modprobe.d/zfs.conf  # 8GB

# For SSDs, adjust transaction group timing
# (faster sync, lower latency)
echo 5 > /sys/module/zfs/parameters/zfs_txg_timeout

LVM-thin Tuning

# Check thin pool status
lvs -o+data_percent

# Zeroing (disable for SSD, faster provisioning)
lvchange --zero n pve/data

Ceph Tuning

# Check pool settings
ceph osd pool get vmpool all

# Increase pg_num if needed
ceph osd pool set vmpool pg_num 256

# Adjust recovery (if impacting production)
ceph config set osd osd_recovery_max_active 1

Common Bottlenecks

CPU Bottleneck

Symptoms: High CPU usage, steal time in VMs

# Check host CPU
mpstat -P ALL 1

# Check VM steal time
top  # %st column

# Solutions:
# - Reduce VM count
# - Pin VMs to specific CPUs
# - Upgrade host CPU

Memory Bottleneck

Symptoms: Swapping, OOM, balloon activity

# Check host memory
free -h
cat /proc/meminfo | grep -E "MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree"

# Check ZFS ARC (consuming RAM)
arc_summary | head -20

# Solutions:
# - Reduce ZFS ARC max
# - Reduce VM memory
# - Add more host RAM

Storage Bottleneck

Symptoms: High I/O wait, slow disk operations

# Check disk latency
iostat -x 1

# Look for:
# - await > 10ms (spinning disk) or > 1ms (SSD)
# - %util > 80%

# Solutions:
# - Move to faster storage
# - Enable IO threads
# - Reduce concurrent I/O (fewer VMs)

Network Bottleneck

Symptoms: Low throughput, high latency

# Check interface utilization
iftop -i vmbr0

# Check for errors
ip -s link show vmbr0

# Solutions:
# - Enable virtio multiqueue
# - Bond multiple NICs
# - Upgrade to faster network

Performance Testing Workflow

Baseline: Measure current performance
Identify: Find the bottleneck (CPU, RAM, disk, network)
Change: Make ONE change
Measure: Test the same workload
Compare: Did it improve?
Iterate: Repeat until satisfied

Never change multiple things at once. You won't know what helped.

The Lesson

Optimization starts with measurement, not with tweaks.

Random performance settings from forums:

Might help your workload
Might hurt your workload
Might do nothing
You won't know which without measuring

The process:

Measure baseline
Identify bottleneck
Research solutions for THAT bottleneck
Apply change
Measure again
Keep or revert

Performance tuning isn't about knowing magic settings. It's about understanding your workload, measuring it, and systematically removing bottlenecks.

The best optimization is often avoiding the problem: use VirtIO, use SSDs, have enough RAM. The tweaks come after the fundamentals are right.

Observability: Metrics, Logs, Alerts — What I Monitor on Proxmox

berik@ashimov.com (Berik Ashimov) — Tue, 23 Sep 2025 00:00:00 GMT

The Proxmox web UI shows current state. It doesn't show trends. It doesn't show "disk was filling up for weeks before it failed." It doesn't wake you up at 3 AM when something is about to break.

Observability means knowing what's happening before users tell you. Metrics show trends. Logs show context. Alerts notify you before failures become outages.

You can't manage what you can't see. And the Proxmox UI isn't enough to see.

What to Monitor

Host Metrics

Metric	Why	Alert Threshold
CPU usage	Overloaded host	>90% for 5 min
Memory usage	OOM risk	>85%
Load average	System stress	>cores×2
Disk I/O	Storage bottleneck	Latency >50ms
Network I/O	Bandwidth saturation	>80% capacity

Storage Metrics

Metric	Why	Alert Threshold
Disk space	Running out	>80%
ZFS pool health	Data integrity	Any non-ONLINE
ZFS ARC hit rate	Cache efficiency	Below 80%
Ceph health	Cluster state	Any non-HEALTH_OK
SMART status	Disk failure prediction	Any warning

VM Metrics

Metric	Why	Alert Threshold
VM count	Capacity planning	Depends
Running vs stopped	Unexpected states	Any unexpected
CPU steal time	Overcommit	>10%
Balloon memory	Memory pressure	Significant deflation

Prometheus + Grafana Setup

The standard stack: Prometheus scrapes metrics, Grafana visualizes.

Install on Separate VM

Don't monitor Proxmox from Proxmox. If the host dies, monitoring dies.

# On monitoring VM - install Prometheus
apt update
apt install -y prometheus prometheus-node-exporter

# Add Grafana repository
apt install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
apt update
apt install -y grafana

systemctl enable --now prometheus prometheus-node-exporter grafana-server

Proxmox PVE Exporter

Prometheus exporter specifically for Proxmox:

# Install
pip install prometheus-pve-exporter

# Create config
cat > /etc/pve-exporter.yml << 'EOF'
default:
  user: monitoring@pve
  token_name: prometheus
  token_value: "xxxx-xxxx-xxxx"
  verify_ssl: false
EOF

# Create systemd service
cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Prometheus PVE Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/pve-exporter.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now pve-exporter

Node Exporter on Proxmox Hosts

Install on each Proxmox node:

apt install prometheus-node-exporter
systemctl enable --now prometheus-node-exporter

Prometheus Configuration

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus self
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporters on Proxmox hosts
  - job_name: 'proxmox-nodes'
    static_configs:
      - targets:
          - 'pve1:9100'
          - 'pve2:9100'
          - 'pve3:9100'

  # PVE exporter
  - job_name: 'proxmox-pve'
    static_configs:
      - targets:
          - 'localhost:9221'
    metrics_path: /pve
    params:
      module: [default]

Grafana Dashboards

Import community dashboards:

Proxmox VE: Dashboard ID 10347
Node Exporter Full: Dashboard ID 1860
ZFS: Dashboard ID 11337
Ceph: Dashboard ID 2842

Or create custom dashboards for your specific needs.

ZFS Monitoring

ZFS Exporter

# Install
pip install prometheus-zfs-exporter

# Run
zfs_exporter --port 9134

Key ZFS Metrics

# Prometheus queries

# Pool capacity
zfs_pool_allocated_bytes / zfs_pool_size_bytes * 100

# ARC hit rate
rate(zfs_arc_hits_total[5m]) / (rate(zfs_arc_hits_total[5m]) + rate(zfs_arc_misses_total[5m])) * 100

# Scrub errors
zfs_pool_scrub_errors_total

# Pool state (1 = ONLINE)
zfs_pool_health == 1

ZFS Alerts

# /etc/prometheus/rules/zfs.yml
groups:
  - name: zfs
    rules:
      - alert: ZFSPoolDegraded
        expr: zfs_pool_health != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is degraded"

      - alert: ZFSPoolSpaceLow
        expr: (zfs_pool_allocated_bytes / zfs_pool_size_bytes) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ZFS pool {{ $labels.pool }} is {{ $value }}% full"

      - alert: ZFSScrubErrors
        expr: zfs_pool_scrub_errors_total > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "ZFS pool {{ $labels.pool }} has scrub errors"

Ceph Monitoring

Built-in Ceph Metrics

Ceph exposes Prometheus metrics natively:

# On Ceph manager node
ceph mgr module enable prometheus

# Metrics at
# http://ceph-mgr:9283/metrics

Prometheus Config for Ceph

# Add to prometheus.yml
- job_name: 'ceph'
  static_configs:
    - targets:
        - 'pve1:9283'  # Ceph manager

Ceph Alerts

# /etc/prometheus/rules/ceph.yml
groups:
  - name: ceph
    rules:
      - alert: CephHealthWarning
        expr: ceph_health_status == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ceph cluster health is WARN"

      - alert: CephHealthCritical
        expr: ceph_health_status == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ceph cluster health is CRITICAL"

      - alert: CephOSDDown
        expr: ceph_osd_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ceph OSD {{ $labels.osd }} is down"

      - alert: CephPGsUnclean
        expr: ceph_pg_total - ceph_pg_active_clean > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ceph has {{ $value }} unclean PGs"

SMART Monitoring

Predict disk failures before they happen:

Install smartmontools

# On each Proxmox node
apt install smartmontools

# Enable SMART on disks
smartctl --smart=on /dev/sda

Prometheus SMART Exporter

# Install
pip install prometheus-smart-exporter

# Run
smart_exporter --port 9110

SMART Alerts

# /etc/prometheus/rules/smart.yml
groups:
  - name: smart
    rules:
      - alert: DiskSMARTWarning
        expr: smart_device_health != 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Disk {{ $labels.device }} SMART health warning"

      - alert: DiskReallocationCount
        expr: smart_raw_value{attribute_name="Reallocated_Sector_Ct"} > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Disk {{ $labels.device }} has reallocated sectors"

Log Aggregation

Loki for Logs

# docker-compose.yml for Loki
version: "3"
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:

Promtail on Proxmox Nodes

# /etc/promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: proxmox
    static_configs:
      - targets:
          - localhost
        labels:
          job: proxmox
          host: pve1
          __path__: /var/log/*.log

  - job_name: pve-cluster
    static_configs:
      - targets:
          - localhost
        labels:
          job: pve-cluster
          host: pve1
          __path__: /var/log/pve/tasks/*

Key Logs to Monitor

# Proxmox-specific logs
/var/log/pveproxy.log      # Web UI access
/var/log/pve/tasks/         # Task logs
/var/log/pve-firewall.log  # Firewall logs

# System logs
/var/log/syslog            # General system
/var/log/auth.log          # Authentication
/var/log/kern.log          # Kernel messages

# Ceph logs (if using)
/var/log/ceph/*.log

Alerting Rules Summary

Critical Alerts (Page Immediately)

groups:
  - name: critical
    rules:
      - alert: HostDown
        expr: up{job="proxmox-nodes"} == 0
        for: 1m
        labels:
          severity: critical

      - alert: StorageCritical
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 1m
        labels:
          severity: critical

      - alert: MemoryExhausted
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 5
        for: 1m
        labels:
          severity: critical

      - alert: ZFSPoolFailed
        expr: zfs_pool_health == 2  # DEGRADED or worse
        for: 1m
        labels:
          severity: critical

Warning Alerts (Check Soon)

groups:
  - name: warnings
    rules:
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 15m
        labels:
          severity: warning

      - alert: StorageWarning
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
        for: 5m
        labels:
          severity: warning

      - alert: BackupFailed
        expr: pve_storage_backup_last_success_time < (time() - 86400)
        for: 1h
        labels:
          severity: warning

Alertmanager Configuration

Route alerts appropriately:

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack'
      repeat_interval: 4h

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

Dashboard Overview

My Grafana home dashboard shows:

Row 1: Cluster Overview
├── Total nodes (up/down)
├── Total VMs (running/stopped)
├── Cluster storage usage
└── Active alerts

Row 2: Per-Node Resources
├── CPU usage per node
├── Memory usage per node
├── Network I/O per node
└── Disk I/O per node

Row 3: Storage Health
├── ZFS pool status
├── Ceph health (if using)
├── Storage capacity trends
└── SMART warnings

Row 4: Backups
├── Last backup time
├── Backup success rate
├── Backup storage usage
└── Restore test status

The Lesson

You can't manage what you can't see.

The Proxmox UI shows now. Observability shows:

What happened (logs)
How things are trending (metrics)
What's about to break (alerts)

The investment in monitoring pays off when:

Disk fills up → you knew 2 weeks ago
Host overloaded → you saw the trend
Ceph degraded → alerted immediately
Backup failed → notified same day

Without monitoring, you find out when users complain. With monitoring, you find out before users notice. That's the difference between reactive and proactive operations.

IP Management: Getting VM IPs Reliably (DHCP, MAC Mapping, Integrations)

berik@ashimov.com (Berik Ashimov) — Fri, 19 Sep 2025 00:00:00 GMT

"What's the IP of that VM?" shouldn't require logging into Proxmox, checking DHCP leases, or guessing. IP addresses are infrastructure data. They should be queryable, predictable, and documented automatically.

Manual IP tracking breaks at scale. Spreadsheets get stale. DHCP gives different IPs after reboot. Static IPs require manual configuration. None of this scales.

IP addresses are data. They need to be collected automatically.

The IP Problem

VMs need IP addresses. Getting them reliably is harder than it looks:

DHCP challenges:

IP changes on reboot (unless reserved)
Lease expires, new IP assigned
"What IP did that VM get?"

Static IP challenges:

Manual configuration per VM
Easy to have conflicts
Doesn't work with templates (need customization)

Cloud-init challenges:

Works great for initial setup
Changing IP requires VM recreation
Need to track assigned IPs somewhere

Strategy 1: DHCP with MAC Reservations

Most reliable for dynamic environments. DHCP server reserves IP based on MAC address.

How It Works

1. VM created with specific MAC address
2. MAC registered in DHCP server with reserved IP
3. VM boots, requests DHCP
4. DHCP server gives reserved IP
5. IP is consistent across reboots

Proxmox Side

Specify MAC address when creating VM:

# Create VM with specific MAC
qm create 100 --name web-server --net0 virtio=BC:24:11:00:01:00,bridge=vmbr0

# Or update existing
qm set 100 --net0 virtio=BC:24:11:00:01:00,bridge=vmbr0

Use a MAC address scheme:

BC:24:11:XX:YY:ZZ
         │  │  └─ Sequence (00-FF)
         │  └──── VM ID low byte
         └─────── VM ID high byte

Example:
VM 100: BC:24:11:00:64:00 (0x64 = 100)
VM 101: BC:24:11:00:65:00
VM 256: BC:24:11:01:00:00

Router Side (MikroTik Example)

# Add DHCP reservation
/ip dhcp-server lease add address=10.0.0.100 mac-address=BC:24:11:00:64:00 server=dhcp1 comment="web-server"

# Or via script for bulk
:foreach mac,ip in={
  "BC:24:11:00:64:00"="10.0.0.100";
  "BC:24:11:00:65:00"="10.0.0.101";
  "BC:24:11:00:66:00"="10.0.0.102"
} do={
  /ip dhcp-server lease add address=$ip mac-address=$mac server=dhcp1
}

Router Side (OPNsense/pfSense)

Services → DHCPv4 → [Interface] → DHCP Static Mappings

MAC address: BC:24:11:00:64:00
IP address: 10.0.0.100
Hostname: web-server

Router Side (VyOS)

configure
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 static-mapping web-server mac-address 'BC:24:11:00:64:00'
set service dhcp-server shared-network-name LAN subnet 10.0.0.0/24 static-mapping web-server ip-address '10.0.0.100'
commit

Strategy 2: Cloud-Init Static IPs

For immutable VMs where IP is set at creation.

Terraform Example

resource "proxmox_vm_qemu" "server" {
  name   = "web-server"
  clone  = "ubuntu-template"

  ipconfig0 = "ip=10.0.0.100/24,gw=10.0.0.1"

  # IP is set via cloud-init at first boot
}

Manual Cloud-Init

qm set 100 --ipconfig0 ip=10.0.0.100/24,gw=10.0.0.1

Tracking Static IPs

Maintain IP allocation in code:

# inventory/ip-allocation.yml
networks:
  production:
    subnet: 10.0.0.0/24
    gateway: 10.0.0.1
    allocated:
      10.0.0.10: proxmox-host
      10.0.0.100: web-server-1
      10.0.0.101: web-server-2
      10.0.0.150: database

  management:
    subnet: 10.10.0.0/24
    gateway: 10.10.0.1
    allocated:
      10.10.0.10: proxmox-mgmt
      10.10.0.100: monitoring

Strategy 3: IPAM Integration

For larger environments, use dedicated IPAM (IP Address Management).

phpIPAM Integration

# Query IPAM for next available IP
curl -X POST "https://ipam.example.com/api/app/addresses/first_free/3/" \
  -H "token: xxx" \
  -d "hostname=new-server"

# Register IP
curl -X POST "https://ipam.example.com/api/app/addresses/" \
  -H "token: xxx" \
  -d "subnetId=3&ip=10.0.0.105&hostname=new-server&description=Web server"

NetBox Integration

import pynetbox

nb = pynetbox.api('https://netbox.example.com', token='xxx')

# Get next available IP
prefix = nb.ipam.prefixes.get(prefix='10.0.0.0/24')
next_ip = prefix.available_ips.list()[0]

# Create IP assignment
nb.ipam.ip_addresses.create(
    address=str(next_ip),
    dns_name='web-server.lab.local',
    description='Web server',
    assigned_object_type='virtualization.virtualmachine',
    assigned_object_id=vm_id
)

Collecting VM IPs from Proxmox

Via QEMU Guest Agent

Requires qemu-guest-agent installed in VM:

# Get network info from running VM
qm guest cmd 100 network-get-interfaces

# Output includes IP addresses
# Parse with jq
qm guest cmd 100 network-get-interfaces | jq -r '.[] | select(.name != "lo") | .["ip-addresses"][] | select(.["ip-address-type"] == "ipv4") | .["ip-address"]'

Via API

# Get VM status including network
pvesh get /nodes/pve1/qemu/100/agent/network-get-interfaces

# Or for all VMs
for vmid in $(pvesh get /nodes/pve1/qemu --output-format json | jq -r '.[].vmid'); do
  echo "VM ${vmid}:"
  pvesh get /nodes/pve1/qemu/${vmid}/agent/network-get-interfaces 2>/dev/null | jq -r '.result[] | select(.name != "lo") | "\(.name): \(.["ip-addresses"][0]["ip-address"])"'
done

Inventory Script

#!/usr/bin/env python3
# collect-inventory.py

import json
import subprocess
from proxmoxer import ProxmoxAPI

proxmox = ProxmoxAPI('proxmox.lab.local', user='root@pam', password='xxx', verify_ssl=False)

inventory = {}

for node in proxmox.nodes.get():
    node_name = node['node']

    for vm in proxmox.nodes(node_name).qemu.get():
        vmid = vm['vmid']
        name = vm['name']
        status = vm['status']

        vm_info = {
            'name': name,
            'node': node_name,
            'status': status,
            'ip_addresses': []
        }

        if status == 'running':
            try:
                interfaces = proxmox.nodes(node_name).qemu(vmid).agent('network-get-interfaces').get()
                for iface in interfaces['result']:
                    if iface['name'] != 'lo':
                        for addr in iface.get('ip-addresses', []):
                            if addr['ip-address-type'] == 'ipv4':
                                vm_info['ip_addresses'].append(addr['ip-address'])
            except:
                pass  # Guest agent not available

        inventory[vmid] = vm_info

print(json.dumps(inventory, indent=2))

Dynamic Ansible Inventory

Generate Ansible inventory from Proxmox:

#!/usr/bin/env python3
# proxmox_inventory.py

import json
from proxmoxer import ProxmoxAPI

def get_inventory():
    proxmox = ProxmoxAPI('proxmox.lab.local',
                         user='ansible@pve!inventory',
                         token_name='inventory',
                         token_value='xxx',
                         verify_ssl=False)

    inventory = {
        '_meta': {'hostvars': {}},
        'all': {'children': ['proxmox_vms']},
        'proxmox_vms': {'hosts': []}
    }

    for node in proxmox.nodes.get():
        for vm in proxmox.nodes(node['node']).qemu.get():
            if vm['status'] != 'running':
                continue

            vmid = vm['vmid']
            name = vm['name']

            # Get IP from guest agent
            try:
                interfaces = proxmox.nodes(node['node']).qemu(vmid).agent('network-get-interfaces').get()
                for iface in interfaces['result']:
                    if iface['name'] != 'lo':
                        for addr in iface.get('ip-addresses', []):
                            if addr['ip-address-type'] == 'ipv4':
                                ip = addr['ip-address']
                                inventory['proxmox_vms']['hosts'].append(name)
                                inventory['_meta']['hostvars'][name] = {
                                    'ansible_host': ip,
                                    'proxmox_vmid': vmid,
                                    'proxmox_node': node['node']
                                }
                                break
            except:
                pass

    return inventory

if __name__ == '__main__':
    print(json.dumps(get_inventory(), indent=2))

Usage:

# Use dynamic inventory
ansible -i proxmox_inventory.py all -m ping

# In ansible.cfg
[defaults]
inventory = ./proxmox_inventory.py

DNS Integration

Automatically register VMs in DNS:

With PowerDNS API

#!/bin/bash
# register-dns.sh

VM_NAME=$1
IP=$2
DOMAIN="lab.local"
PDNS_API="http://dns.lab.local:8081/api/v1"
PDNS_KEY="xxx"

# Add A record
curl -X PATCH "${PDNS_API}/servers/localhost/zones/${DOMAIN}." \
  -H "X-API-Key: ${PDNS_KEY}" \
  -H "Content-Type: application/json" \
  -d "{
    \"rrsets\": [{
      \"name\": \"${VM_NAME}.${DOMAIN}.\",
      \"type\": \"A\",
      \"ttl\": 300,
      \"changetype\": \"REPLACE\",
      \"records\": [{\"content\": \"${IP}\", \"disabled\": false}]
    }]
  }"

With nsupdate (BIND)

#!/bin/bash
# update-dns.sh

VM_NAME=$1
IP=$2
DOMAIN="lab.local"
DNS_SERVER="10.0.0.53"
KEY_FILE="/etc/bind/keys/update.key"

nsupdate -k ${KEY_FILE} << EOF
server ${DNS_SERVER}
zone ${DOMAIN}
update delete ${VM_NAME}.${DOMAIN} A
update add ${VM_NAME}.${DOMAIN} 300 A ${IP}
send
EOF

Automation Pipeline

Complete workflow:

# vm-creation.yml (Ansible)
- name: Create VM with managed IP
  hosts: localhost
  vars:
    vm_name: web-server
    vm_id: 100
    mac_address: "BC:24:11:00:64:00"
    ip_address: "10.0.0.100"

  tasks:
    - name: Create VM in Proxmox
      community.general.proxmox_kvm:
        api_host: proxmox.lab.local
        api_token_id: terraform
        api_token_secret: "{{ vault_proxmox_token }}"
        node: pve1
        vmid: "{{ vm_id }}"
        name: "{{ vm_name }}"
        clone: ubuntu-template
        net:
          net0: "virtio={{ mac_address }},bridge=vmbr0"

    - name: Register DHCP reservation on router
      community.routeros.command:
        commands:
          - /ip dhcp-server lease add address={{ ip_address }} mac-address={{ mac_address }} server=dhcp1 comment="{{ vm_name }}"
      delegate_to: router

    - name: Register DNS
      community.general.nsupdate:
        server: "10.0.0.53"
        zone: "lab.local"
        record: "{{ vm_name }}"
        type: "A"
        value: "{{ ip_address }}"

    - name: Start VM
      community.general.proxmox_kvm:
        api_host: proxmox.lab.local
        api_token_id: terraform
        api_token_secret: "{{ vault_proxmox_token }}"
        node: pve1
        vmid: "{{ vm_id }}"
        state: started

    - name: Wait for VM to be reachable
      wait_for:
        host: "{{ ip_address }}"
        port: 22
        delay: 10
        timeout: 300

    - name: Update inventory
      lineinfile:
        path: inventory/hosts
        line: "{{ vm_name }} ansible_host={{ ip_address }}"

The Lesson

IP addresses are data. They must be collected automatically.

Manual IP management fails because:

Humans forget to update documentation
Spreadsheets get stale
"What IP is that?" becomes a daily question
Conflicts happen because no one checked

Automated IP management works because:

DHCP reservations are code, versioned and reviewable
Inventory is generated from actual state
DNS updates automatically
Conflicts are detected before deployment

Choose your strategy based on scale:

Small (1-20 VMs): DHCP reservations, manual tracking
Medium (20-100 VMs): Cloud-init static IPs, generated inventory
Large (100+ VMs): IPAM integration (NetBox, phpIPAM)

The goal is always the same: asking "what's the IP?" should return an answer in seconds, from automation, not from hunting through UIs and logs.

Golden Images Pipeline: Building Templates Like a Factory

berik@ashimov.com (Berik Ashimov) — Tue, 16 Sep 2025 00:00:00 GMT

Manual template creation works once. Install OS, configure, convert to template. Done. Until you need to update it. Then you do it again manually. And again. Eventually, no one remembers what's in the template or how it was built.

A golden image pipeline treats templates like software: version controlled, automatically built, tested before use. When you need an update, you change the code and the pipeline builds a new template.

Images must be reproducible. If you can't rebuild an identical image from code, you have a unique snowflake, not a template.

What Makes a Golden Image

A golden image is a VM template with:

Known contents: Every package, config, and file is documented
Reproducible build: Same inputs = same output
Versioned: v1, v2, v3 with change history
Tested: Verified before production use
Immutable: Never modified after creation

Manual Pipeline (Simple Start)

Before Packer, understand the process:

1. Download Cloud Image

# Ubuntu cloud image
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Debian cloud image
wget https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2

# AlmaLinux cloud image
wget https://repo.almalinux.org/almalinux/9/cloud/x86_64/images/AlmaLinux-9-GenericCloud-latest.x86_64.qcow2

2. Create VM and Import

# Create VM
qm create 9000 --name "ubuntu-2404-base" --memory 2048 --cores 2 \
  --net0 virtio,bridge=vmbr0 --scsihw virtio-scsi-pci

# Import cloud image
qm importdisk 9000 noble-server-cloudimg-amd64.img local-zfs

# Attach disk
qm set 9000 --scsi0 local-zfs:vm-9000-disk-0

# Add cloud-init drive
qm set 9000 --ide2 local-zfs:cloudinit

# Boot settings
qm set 9000 --boot c --bootdisk scsi0
qm set 9000 --serial0 socket --vga serial0

3. Customize (Optional)

# Start VM with temporary cloud-init
qm set 9000 --ciuser temp --cipassword temp123
qm start 9000

# SSH in and customize
ssh temp@<ip>
sudo apt update && sudo apt upgrade -y
sudo apt install -y qemu-guest-agent vim htop curl git

# Clean up
sudo cloud-init clean
sudo rm -rf /var/log/*.log
sudo rm -rf /tmp/*
sudo shutdown -h now

4. Convert to Template

# Remove temporary cloud-init settings
qm set 9000 --delete ciuser,cipassword

# Convert to template
qm template 9000

5. Version and Document

# Rename with version
qm set 9000 --name "ubuntu-2404-v1"

# Add description
qm set 9000 --description "Ubuntu 24.04 LTS
Version: 1
Date: 2025-01-08
Base: noble-server-cloudimg-amd64.img
Packages: qemu-guest-agent, vim, htop, curl, git
Changes: Initial release"

Packer Pipeline (Automated)

Packer automates the entire process.

Install Packer

# HashiCorp repository
wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
apt update && apt install packer

# Install Proxmox plugin
packer plugins install github.com/hashicorp/proxmox

Directory Structure

packer-templates/
├── templates/
│   ├── ubuntu-2404/
│   │   ├── ubuntu-2404.pkr.hcl
│   │   ├── variables.pkr.hcl
│   │   ├── http/
│   │   │   └── user-data
│   │   └── scripts/
│   │       ├── base.sh
│   │       ├── cleanup.sh
│   │       └── cloud-init.sh
│   └── debian-12/
│       └── ...
├── common/
│   ├── scripts/
│   │   └── common-packages.sh
│   └── files/
│       └── motd
└── Makefile

Packer Template

# templates/ubuntu-2404/ubuntu-2404.pkr.hcl

packer {
  required_plugins {
    proxmox = {
      version = ">= 1.1.0"
      source  = "github.com/hashicorp/proxmox"
    }
  }
}

source "proxmox-iso" "ubuntu-2404" {
  # Proxmox connection
  proxmox_url              = var.proxmox_url
  username                 = var.proxmox_username
  token                    = var.proxmox_token
  node                     = var.proxmox_node
  insecure_skip_tls_verify = true

  # VM settings
  vm_id   = var.vm_id
  vm_name = "ubuntu-2404-v${var.version}"

  # ISO (get checksum from releases.ubuntu.com)
  iso_url          = "https://releases.ubuntu.com/24.04/ubuntu-24.04-live-server-amd64.iso"
  iso_checksum     = "sha256:YOUR_CHECKSUM_HERE"  # Get from SHA256SUMS file
  iso_storage_pool = "local"
  unmount_iso      = true

  # Hardware
  cores    = 2
  memory   = 2048
  cpu_type = "host"

  # Disk
  scsi_controller = "virtio-scsi-pci"
  disks {
    disk_size    = "32G"
    storage_pool = var.storage_pool
    type         = "scsi"
  }

  # Network
  network_adapters {
    model  = "virtio"
    bridge = "vmbr0"
  }

  # Cloud-init
  cloud_init              = true
  cloud_init_storage_pool = var.storage_pool

  # Boot
  boot_command = [
    "<esc><wait>",
    "e<wait>",
    "<down><down><down><end>",
    " autoinstall ds=nocloud-net;s=http://{{ .HTTPIP }}:{{ .HTTPPort }}/",
    "<f10>"
  ]

  boot_wait = "5s"

  # HTTP server for autoinstall
  http_directory = "http"

  # SSH
  ssh_username = "packer"
  ssh_password = "packer"
  ssh_timeout  = "20m"

  # Template
  template_name        = "ubuntu-2404-v${var.version}"
  template_description = "Ubuntu 24.04 LTS - Built ${timestamp()}"
}

build {
  sources = ["source.proxmox-iso.ubuntu-2404"]

  # Base configuration
  provisioner "shell" {
    scripts = [
      "scripts/base.sh",
      "../../common/scripts/common-packages.sh"
    ]
  }

  # Copy files
  provisioner "file" {
    source      = "../../common/files/motd"
    destination = "/tmp/motd"
  }

  provisioner "shell" {
    inline = ["sudo mv /tmp/motd /etc/motd"]
  }

  # Cleanup
  provisioner "shell" {
    scripts = [
      "scripts/cleanup.sh",
      "scripts/cloud-init.sh"
    ]
  }
}

Variables

# templates/ubuntu-2404/variables.pkr.hcl

variable "proxmox_url" {
  type    = string
  default = "https://proxmox.lab.local:8006/api2/json"
}

variable "proxmox_username" {
  type    = string
  default = "packer@pve!automation"
}

variable "proxmox_token" {
  type      = string
  sensitive = true
}

variable "proxmox_node" {
  type    = string
  default = "pve1"
}

variable "vm_id" {
  type    = number
  default = 9000
}

variable "storage_pool" {
  type    = string
  default = "local-zfs"
}

variable "version" {
  type    = string
  default = "1"
}

Provisioning Scripts

# scripts/base.sh
#!/bin/bash
set -ex

# Wait for cloud-init
cloud-init status --wait

# Update system
sudo apt update
sudo apt upgrade -y

# Install packages
sudo apt install -y \
  qemu-guest-agent \
  vim \
  htop \
  curl \
  wget \
  git \
  jq \
  unzip

# Enable guest agent
sudo systemctl enable qemu-guest-agent

# scripts/cleanup.sh
#!/bin/bash
set -ex

# Clean apt
sudo apt autoremove -y
sudo apt clean

# Clean logs
sudo journalctl --rotate
sudo journalctl --vacuum-time=1s
sudo rm -rf /var/log/*.log
sudo rm -rf /var/log/*.gz

# Clean temp
sudo rm -rf /tmp/*
sudo rm -rf /var/tmp/*

# Clean SSH keys (regenerate on first boot)
sudo rm -f /etc/ssh/ssh_host_*

# Clean machine-id
sudo truncate -s 0 /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id

# Clean history
history -c
sudo rm -f /root/.bash_history
sudo rm -f /home/*/.bash_history

# scripts/cloud-init.sh
#!/bin/bash
set -ex

# Reset cloud-init
sudo cloud-init clean

# Remove network config (cloud-init will regenerate)
sudo rm -f /etc/netplan/*.yaml

# The template is now ready
echo "Template preparation complete"

Build Template

# Set token
export PKR_VAR_proxmox_token="xxxx-xxxx-xxxx"

# Validate
packer validate templates/ubuntu-2404/

# Build
packer build -var "version=2" templates/ubuntu-2404/

Cloud Image Pipeline (Faster)

Skip ISO install by starting with cloud images:

# templates/ubuntu-2404-cloud/ubuntu-2404-cloud.pkr.hcl

source "proxmox-clone" "ubuntu-2404" {
  # Clone from uploaded cloud image
  clone_vm = "ubuntu-2404-cloud-base"

  # ... rest of config
}

Pre-upload cloud image:

# Download and upload cloud image to Proxmox
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Create base VM from cloud image (one-time)
qm create 8000 --name "ubuntu-2404-cloud-base" ...
qm importdisk 8000 noble-server-cloudimg-amd64.img local-zfs
qm set 8000 --scsi0 local-zfs:vm-8000-disk-0
qm template 8000

Packer then clones from base, customizes, creates versioned template.

CI/CD Integration

GitLab CI

# .gitlab-ci.yml
stages:
  - validate
  - build
  - test

variables:
  PACKER_VERSION: "1.10.0"

validate:
  stage: validate
  script:
    - packer init templates/ubuntu-2404/
    - packer validate templates/ubuntu-2404/

build:
  stage: build
  script:
    - packer build -var "version=${CI_PIPELINE_IID}" templates/ubuntu-2404/
  only:
    - main

test:
  stage: test
  script:
    - ./scripts/test-template.sh ubuntu-2404-v${CI_PIPELINE_IID}
  only:
    - main

GitHub Actions

# .github/workflows/build-template.yml
name: Build Template

on:
  push:
    branches: [main]
    paths:
      - 'templates/**'

jobs:
  build:
    runs-on: self-hosted  # Need access to Proxmox
    steps:
      - uses: actions/checkout@v4

      - name: Setup Packer
        uses: hashicorp/setup-packer@main
        with:
          version: "1.10.0"

      - name: Init Packer
        run: packer init templates/ubuntu-2404/

      - name: Build Template
        env:
          PKR_VAR_proxmox_token: ${{ secrets.PROXMOX_TOKEN }}
        run: |
          packer build -var "version=${{ github.run_number }}" templates/ubuntu-2404/

Testing Templates

Automated Test Script

#!/bin/bash
# scripts/test-template.sh

TEMPLATE=$1
TEST_VM_ID=9999

echo "Testing template: ${TEMPLATE}"

# Clone template
qm clone $(qm list | grep "${TEMPLATE}" | awk '{print $1}') ${TEST_VM_ID} --name "template-test"

# Configure cloud-init
qm set ${TEST_VM_ID} --ciuser test --sshkeys ~/.ssh/id_ed25519.pub
qm set ${TEST_VM_ID} --ipconfig0 ip=dhcp

# Start VM
qm start ${TEST_VM_ID}

# Wait for boot
sleep 60

# Get IP
IP=$(qm guest cmd ${TEST_VM_ID} network-get-interfaces | jq -r '.[1]["ip-addresses"][0]["ip-address"]')

# Run tests
echo "Testing SSH..."
ssh -o StrictHostKeyChecking=no test@${IP} "echo 'SSH OK'"

echo "Testing packages..."
ssh test@${IP} "which vim htop curl git"

echo "Testing guest agent..."
qm agent ${TEST_VM_ID} ping

# Cleanup
qm stop ${TEST_VM_ID}
qm destroy ${TEST_VM_ID}

echo "All tests passed!"

Test Checklist

[ ] VM boots successfully
[ ] Cloud-init completes
[ ] SSH works with key auth
[ ] Required packages installed
[ ] Guest agent responds
[ ] No leftover sensitive data
[ ] Machine-id regenerated
[ ] SSH host keys regenerated

Versioning Strategy

Semantic Versioning

ubuntu-2404-v1.0.0    # Major: Breaking changes
ubuntu-2404-v1.1.0    # Minor: New features
ubuntu-2404-v1.1.1    # Patch: Bug fixes

Date-Based Versioning

ubuntu-2404-20250108  # Date of build
ubuntu-2404-202501    # Month of build

Build Number

ubuntu-2404-b42       # CI build number

Tracking Active Version

# Create symlink-style alias
qm set 9002 --description "... LATEST: true"

# Or use tags
qm set 9002 --tags "latest,ubuntu,2404"

The Lesson

Images must be reproducible. Otherwise, they're unique snowflakes.

A template you clicked together manually is:

Undocumented (what's in it?)
Unreproducible (can you rebuild it exactly?)
Untested (does it actually work?)
Unversioned (which version is this?)

A golden image pipeline produces templates that are:

Documented (code shows everything)
Reproducible (same code = same image)
Tested (automated tests before use)
Versioned (clear history of changes)

The investment in automation pays off when:

Security update needed → rebuild all templates
New requirement → change code, rebuild
"What's in this template?" → read the code
Audit requirement → show build history

Start simple: document your manual process. Then automate it. Then add CI/CD. Each step makes your templates more reliable and less mysterious.

Infrastructure as Code: Terraform Proxmox Provider — Patterns That Won't Rot

berik@ashimov.com (Berik Ashimov) — Fri, 12 Sep 2025 00:00:00 GMT

Clicking through the Proxmox UI works for one VM. It doesn't work for thirty VMs that need to be consistent. It doesn't work when you need to recreate an environment. It doesn't work when "what changed?" matters.

Terraform brings Infrastructure as Code to Proxmox: define VMs in files, track changes in Git, apply reproducibly. But Terraform with Proxmox has quirks. The provider has limitations. State can drift. Changes can be destructive.

This is how to use Terraform with Proxmox in patterns that won't rot.

Why Terraform for Proxmox

Terraform solves:

Reproducible environments (dev = staging = prod)
Change tracking (what changed, when, why)
Collaboration (PRs, code review for infrastructure)
Documentation (code is documentation)
Disaster recovery (rebuild from code)

Terraform doesn't solve:

Day-2 operations inside VMs (use Ansible)
Configuration management (use Ansible, Chef, Puppet)
One-off tasks (just use the UI)

Provider Setup

Install Provider

In your Terraform project:

# versions.tf
terraform {
  required_version = ">= 1.0"

  required_providers {
    proxmox = {
      source  = "Telmate/proxmox"
      version = "~> 3.0"
    }
  }
}

Provider Configuration

# provider.tf
provider "proxmox" {
  pm_api_url          = "https://proxmox.lab.local:8006/api2/json"
  pm_api_token_id     = "terraform@pve!automation"
  pm_api_token_secret = var.proxmox_api_secret

  # TLS verification
  pm_tls_insecure = false  # Set true only for self-signed certs

  # Parallel operations
  pm_parallel = 4

  # Logging (for debugging)
  pm_log_enable = true
  pm_log_file   = "terraform-plugin-proxmox.log"
  pm_log_levels = {
    _default    = "debug"
    _capturelog = ""
  }
}

API Token Creation

On Proxmox:

# Create dedicated Terraform user
pveum user add terraform@pve

# Create token with privilege separation disabled
pveum user token add terraform@pve automation --privsep 0

# Grant permissions
pveum acl modify / --user terraform@pve --role PVEAdmin

Store token in environment or secrets manager:

export PM_API_TOKEN_SECRET="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

# variables.tf
variable "proxmox_api_secret" {
  description = "Proxmox API token secret"
  type        = string
  sensitive   = true
  default     = "" # Use TF_VAR_proxmox_api_secret env var
}

Basic VM Resource

Clone from Template

# vm.tf
resource "proxmox_vm_qemu" "web_server" {
  name        = "web-server-01"
  target_node = "pve1"

  # Clone from template
  clone = "ubuntu-2404-template"
  full_clone = true

  # Hardware
  cores   = 2
  sockets = 1
  memory  = 4096

  # Agent (required for IP retrieval)
  agent = 1

  # Disk
  disks {
    scsi {
      scsi0 {
        disk {
          size    = "32G"
          storage = "local-zfs"
        }
      }
    }
  }

  # Network
  network {
    model  = "virtio"
    bridge = "vmbr0"
    tag    = 10
  }

  # Cloud-init
  os_type    = "cloud-init"
  ciuser     = "admin"
  cipassword = var.vm_password
  sshkeys    = file("~/.ssh/id_ed25519.pub")

  ipconfig0 = "ip=10.10.0.100/24,gw=10.10.0.1"

  # Lifecycle
  lifecycle {
    ignore_changes = [
      network,  # Don't recreate on network changes
    ]
  }
}

Output VM Info

# outputs.tf
output "web_server_ip" {
  value       = proxmox_vm_qemu.web_server.default_ipv4_address
  description = "Web server IP address"
}

output "web_server_id" {
  value       = proxmox_vm_qemu.web_server.vmid
  description = "VM ID in Proxmox"
}

Module Structure

For reusable, maintainable code:

proxmox-terraform/
├── modules/
│   ├── vm/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── lxc/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   └── ...
│   └── prod/
│       └── ...
├── .gitignore
└── README.md

VM Module

# modules/vm/variables.tf
variable "name" {
  description = "VM name"
  type        = string
}

variable "target_node" {
  description = "Proxmox node to create VM on"
  type        = string
  default     = "pve1"
}

variable "template" {
  description = "Template to clone from"
  type        = string
  default     = "ubuntu-2404-template"
}

variable "cores" {
  description = "Number of CPU cores"
  type        = number
  default     = 2
}

variable "memory" {
  description = "Memory in MB"
  type        = number
  default     = 2048
}

variable "disk_size" {
  description = "Disk size"
  type        = string
  default     = "32G"
}

variable "storage" {
  description = "Storage pool"
  type        = string
  default     = "local-zfs"
}

variable "network_bridge" {
  description = "Network bridge"
  type        = string
  default     = "vmbr0"
}

variable "vlan_tag" {
  description = "VLAN tag"
  type        = number
  default     = null
}

variable "ip_address" {
  description = "Static IP address with CIDR"
  type        = string
}

variable "gateway" {
  description = "Default gateway"
  type        = string
}

variable "ssh_keys" {
  description = "SSH public keys"
  type        = string
}

variable "tags" {
  description = "VM tags"
  type        = list(string)
  default     = []
}

# modules/vm/main.tf
resource "proxmox_vm_qemu" "vm" {
  name        = var.name
  target_node = var.target_node
  clone       = var.template
  full_clone  = true

  cores   = var.cores
  sockets = 1
  memory  = var.memory
  agent   = 1

  disks {
    scsi {
      scsi0 {
        disk {
          size    = var.disk_size
          storage = var.storage
        }
      }
    }
  }

  network {
    model  = "virtio"
    bridge = var.network_bridge
    tag    = var.vlan_tag
  }

  os_type = "cloud-init"
  sshkeys = var.ssh_keys

  ipconfig0 = "ip=${var.ip_address},gw=${var.gateway}"

  tags = join(",", var.tags)

  lifecycle {
    ignore_changes = [network]
  }
}

# modules/vm/outputs.tf
output "id" {
  value = proxmox_vm_qemu.vm.vmid
}

output "name" {
  value = proxmox_vm_qemu.vm.name
}

output "ip_address" {
  value = proxmox_vm_qemu.vm.default_ipv4_address
}

Using Modules

# environments/dev/main.tf
module "web_servers" {
  source = "../../modules/vm"

  count = 2

  name        = "web-${count.index + 1}"
  target_node = "pve1"
  template    = "ubuntu-2404-template"

  cores    = 2
  memory   = 4096
  disk_size = "32G"

  ip_address = "10.10.0.${100 + count.index}/24"
  gateway    = "10.10.0.1"

  ssh_keys = file("~/.ssh/id_ed25519.pub")

  tags = ["web", "dev"]
}

module "database" {
  source = "../../modules/vm"

  name        = "db-1"
  target_node = "pve1"
  template    = "ubuntu-2404-template"

  cores    = 4
  memory   = 8192
  disk_size = "100G"

  ip_address = "10.10.0.50/24"
  gateway    = "10.10.0.1"

  ssh_keys = file("~/.ssh/id_ed25519.pub")

  tags = ["database", "dev"]
}

State Management

Remote State

Never use local state for teams:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "proxmox/dev/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Or with Terraform Cloud:

terraform {
  cloud {
    organization = "my-org"
    workspaces {
      name = "proxmox-dev"
    }
  }
}

State Drift

Proxmox changes outside Terraform cause drift:

# Check for drift
terraform plan

# If drift detected, either:
# 1. Import the change into state
# 2. Revert the change in Proxmox
# 3. Update Terraform to match

Import Existing Resources

# Import existing VM
terraform import proxmox_vm_qemu.existing 'pve1/qemu/100'

# Then add to your .tf file
resource "proxmox_vm_qemu" "existing" {
  name        = "existing-vm"
  target_node = "pve1"
  # ... match existing config
}

Safe Changes

Lifecycle Rules

Prevent accidental destruction:

resource "proxmox_vm_qemu" "production_db" {
  name = "prod-db"
  # ...

  lifecycle {
    prevent_destroy = true

    # Don't recreate for these changes
    ignore_changes = [
      network,
      disk,
    ]
  }
}

Plan Before Apply

Always review:

# Generate plan
terraform plan -out=tfplan

# Review plan file
terraform show tfplan

# Only if plan looks good
terraform apply tfplan

Targeted Changes

Limit blast radius:

# Only apply to specific resource
terraform apply -target=module.web_servers

# Only apply to specific instance
terraform apply -target='module.web_servers[0]'

Variables and Environments

Environment-Specific Variables

# environments/dev/terraform.tfvars
environment = "dev"
vm_count    = 2
vm_size     = "small"

# environments/prod/terraform.tfvars
environment = "prod"
vm_count    = 5
vm_size     = "large"

Variable Validation

variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "vm_size" {
  description = "VM size preset"
  type        = string
  default     = "small"

  validation {
    condition     = contains(["small", "medium", "large"], var.vm_size)
    error_message = "VM size must be small, medium, or large."
  }
}

Size Presets

# locals.tf
locals {
  vm_sizes = {
    small = {
      cores  = 2
      memory = 2048
      disk   = "32G"
    }
    medium = {
      cores  = 4
      memory = 4096
      disk   = "64G"
    }
    large = {
      cores  = 8
      memory = 8192
      disk   = "128G"
    }
  }

  selected_size = local.vm_sizes[var.vm_size]
}

# Usage
resource "proxmox_vm_qemu" "vm" {
  cores  = local.selected_size.cores
  memory = local.selected_size.memory
  # ...
}

Common Patterns

Count vs For Each

# Count: Simple numbered resources
resource "proxmox_vm_qemu" "worker" {
  count = 3
  name  = "worker-${count.index + 1}"
  # ...
}

# For_each: Named resources (more stable)
variable "vms" {
  default = {
    web    = { ip = "10.10.0.100", cores = 2 }
    api    = { ip = "10.10.0.101", cores = 4 }
    worker = { ip = "10.10.0.102", cores = 2 }
  }
}

resource "proxmox_vm_qemu" "server" {
  for_each = var.vms

  name   = each.key
  cores  = each.value.cores

  ipconfig0 = "ip=${each.value.ip}/24,gw=10.10.0.1"
}

for_each is safer — removing middle item doesn't shift others.

Dynamic Blocks

variable "additional_disks" {
  default = [
    { size = "100G", storage = "local-zfs" },
    { size = "200G", storage = "ceph-pool" }
  ]
}

resource "proxmox_vm_qemu" "vm" {
  # ...

  dynamic "disk" {
    for_each = var.additional_disks
    content {
      size    = disk.value.size
      storage = disk.value.storage
      type    = "scsi"
    }
  }
}

Conditional Resources

variable "create_backup_server" {
  default = false
}

resource "proxmox_vm_qemu" "backup" {
  count = var.create_backup_server ? 1 : 0
  name  = "backup-server"
  # ...
}

Debugging

Provider Logs

provider "proxmox" {
  pm_log_enable = true
  pm_log_file   = "terraform-plugin-proxmox.log"
  pm_log_levels = {
    _default = "debug"
  }
}

Common Issues

1. Template not found:

Error: 500 Configuration file 'nodes/pve1/qemu-server/xyz.conf' does not exist

Fix: Verify template name matches exactly.

2. IP not detected:

Output: default_ipv4_address = ""

Fix: Ensure agent = 1 and qemu-guest-agent installed in template.

3. Disk changes cause recreation: Fix: Add disk to ignore_changes in lifecycle block.

The Lesson

IaC is about predictability, not faster clicking.

The goal of Terraform isn't to create VMs faster than the UI. It's to:

Know what exists: Code defines reality
Know what changed: Git history shows when and why
Reproduce reliably: Same code = same infrastructure
Collaborate safely: Code review before apply

The patterns that survive:

Modules for reusability
Remote state for teams
Lifecycle rules for safety
Variables for flexibility
Plan before apply always

Terraform with Proxmox has rough edges. The provider isn't perfect. But imperfect IaC beats clicking through a UI every time you need to remember "how did I configure that?"

Security & Multi-Tenancy: Roles, Pools, API Tokens, and Isolation

berik@ashimov.com (Berik Ashimov) — Tue, 09 Sep 2025 00:00:00 GMT

A single admin with root access works for a homelab. It doesn't work when multiple people or teams share the same Proxmox cluster. Who can see what? Who can modify what? What happens when someone leaves?

Access control isn't a feature you enable. It's a product you design. Every permission is a decision: who needs this access, why, and what's the blast radius if it's misused?

Proxmox has robust RBAC (Role-Based Access Control). The question is whether you use it intentionally or let it grow organically into chaos.

Access Control Model

Proxmox permissions combine:

Permission = User/Group + Role + Path

Example:
- User: developer@pve
- Role: PVEVMUser
- Path: /pool/dev-team

Result: developer can use VMs in dev-team pool

Users and Groups

Users: Individual accounts. Can be in multiple groups.

# Create user in Proxmox realm
pveum user add developer@pve --password <password>

# Create user in PAM realm (Linux user)
pveum user add admin@pam

# List users
pveum user list

Groups: Collections of users. Simplify permission management.

# Create group
pveum group add developers --comment "Development team"

# Add user to group
pveum user modify developer@pve --groups developers

# List groups
pveum group list

Authentication Realms

Realm	Use Case	Notes
pam	Linux admins who need SSH	System users
pve	Web UI only users	Proxmox internal
ldap	Enterprise integration	External directory
ad	Active Directory	Windows integration

For multi-tenancy, usually:

Admins: PAM (SSH + Web UI)
Regular users: PVE realm (Web UI only)

Built-in Roles

Proxmox includes these roles:

Role	Permissions
Administrator	Everything (dangerous)
PVEAdmin	Almost everything (no system access)
PVEAuditor	Read-only access
PVEDatastoreAdmin	Manage datastores
PVEDatastoreUser	Use datastores
PVEPoolAdmin	Manage pools
PVEPoolUser	Use pools
PVEVMAdmin	Full VM control
PVEVMUser	Use VMs (console, start/stop)
PVETemplateUser	Clone templates
PVEUserAdmin	Manage users
NoAccess	Explicit deny

Custom Roles

Create roles for specific needs:

# Create role with specific privileges
pveum role add VMOperator --privs "VM.Console VM.PowerMgmt VM.Monitor"

# List available privileges
pveum privilege list

# View role
pveum role list

Common custom roles:

# Developer: Can create/manage own VMs
pveum role add Developer --privs "VM.Allocate VM.Clone VM.Config.CDROM VM.Config.CPU VM.Config.Cloudinit VM.Config.Disk VM.Config.Memory VM.Config.Network VM.Console VM.Migrate VM.Monitor VM.PowerMgmt VM.Snapshot VM.Snapshot.Rollback Datastore.AllocateSpace"

# Observer: Can view, nothing else
pveum role add Observer --privs "VM.Audit Datastore.Audit"

# Backup Operator: Can backup/restore
pveum role add BackupOperator --privs "VM.Backup VM.Snapshot Datastore.AllocateSpace"

Resource Pools

Pools group resources (VMs, storage, nodes) for delegation:

# Create pool
pveum pool add dev-team --comment "Development team resources"

# Add VM to pool
qm set 100 --pool dev-team

# Add storage to pool
pveum pool modify dev-team --storage local-lvm

# List pools
pveum pool list

Pool-Based Permissions

Grant access to pool, not individual VMs:

# Developers can manage VMs in their pool
pveum acl modify /pool/dev-team --users developer@pve --roles Developer

# Or by group
pveum acl modify /pool/dev-team --groups developers --roles Developer

New VMs in the pool automatically inherit permissions.

Pool Strategy

Organize by:

Option 1: By team
/pool/dev-team
/pool/qa-team
/pool/production

Option 2: By environment
/pool/development
/pool/staging
/pool/production

Option 3: By project
/pool/project-alpha
/pool/project-beta

Match your organization structure.

API Tokens

API tokens are better than passwords for automation:

Separate from user password
Can have different permissions
Easily revoked without changing user password
Audit trail shows token ID

Creating Tokens

# Create token for user
pveum user token add developer@pve automation --privsep 0

# Output shows token secret (save it!)
# Token: developer@pve!automation
# Secret: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

# With privilege separation (token can have fewer privs than user)
pveum user token add admin@pam ci-cd --privsep 1
pveum acl modify /pool/production --tokens admin@pam!ci-cd --roles BackupOperator

Token Best Practices

# Good: Specific tokens for specific purposes
admin@pam!terraform      # Infrastructure automation
admin@pam!ansible        # Configuration management
admin@pam!monitoring     # Read-only metrics
developer@pve!ci-build   # CI pipeline builds

# Bad: Generic tokens with admin access
admin@pam!api           # Too broad, no purpose documented

Using Tokens in Automation

# API call with token
curl -k -H "Authorization: PVEAPIToken=developer@pve!automation=xxxx-xxxx-xxxx" \
  https://proxmox:8006/api2/json/version

# Terraform provider
provider "proxmox" {
  pm_api_url          = "https://proxmox:8006/api2/json"
  pm_api_token_id     = "terraform@pve!automation"
  pm_api_token_secret = var.proxmox_token
}

# Ansible
proxmox_kvm:
  api_host: proxmox
  api_user: ansible@pve
  api_token_id: automation
  api_token_secret: "{{ vault_proxmox_token }}"

Permission Paths

Permissions apply to paths in the resource tree:

/                           # Root (everything)
├── /access                 # User/group management
├── /nodes                  # All nodes
│   ├── /nodes/pve1         # Specific node
├── /pool                   # All pools
│   └── /pool/dev-team      # Specific pool
├── /storage                # All storage
│   └── /storage/local      # Specific storage
└── /vms                    # All VMs (by ID)
    └── /vms/100            # Specific VM

Permission Inheritance

Permissions cascade down:

# Grant access to all VMs in a pool
pveum acl modify /pool/dev-team --users developer@pve --roles PVEVMUser

# Developer can now access:
# - /pool/dev-team (pool itself)
# - All VMs in that pool
# - Storage assigned to that pool

Explicit Deny

NoAccess role blocks inheritance:

# User has pool access
pveum acl modify /pool/dev-team --users developer@pve --roles Developer

# But NOT this specific VM
pveum acl modify /vms/105 --users developer@pve --roles NoAccess

Multi-Tenant Architecture

Example: Web Hosting Provider

Tenants: customer-a, customer-b, customer-c

Structure:
├── /pool/customer-a
│   ├── VMs 100-199
│   └── Storage quota
├── /pool/customer-b
│   ├── VMs 200-299
│   └── Storage quota
└── /pool/customer-c
    ├── VMs 300-399
    └── Storage quota

Users:
- customer-a-admin@pve → /pool/customer-a (PVEVMAdmin)
- customer-a-user@pve  → /pool/customer-a (PVEVMUser)
- customer-b-admin@pve → /pool/customer-b (PVEVMAdmin)
...

Isolation:
- Network: Separate VLANs per customer
- Storage: Pool quotas, separate datastores
- Compute: Resource limits on pools

Example: Corporate IT

Departments: dev, qa, production, infrastructure

Structure:
├── /pool/development
│   └── All non-prod VMs
├── /pool/qa
│   └── Test environments
├── /pool/production
│   └── Production workloads (restricted)
└── /pool/infrastructure
    └── DNS, monitoring, etc.

Groups and roles:
- developers     → /pool/development (Developer)
- qa-engineers   → /pool/qa (Developer)
- sre-team       → /pool/production (PVEVMUser)
- sre-leads      → /pool/production (PVEVMAdmin)
- infra-admins   → / (PVEAdmin)

Audit Logging

Track who did what:

Task History

# Recent tasks (via API)
pvesh get /cluster/tasks

# Node-specific tasks
pvesh get /nodes/pve1/tasks

Web UI: Datacenter → Tasks → Filter by user

System Logs

# Auth logs
journalctl -u pveproxy | grep auth

# API access logs
tail -f /var/log/pveproxy/access.log

External Audit

For compliance, forward logs:

# Syslog forwarding
echo "*.* @syslog-server:514" >> /etc/rsyslog.d/remote.conf
systemctl restart rsyslog

Quotas and Limits

Prevent resource exhaustion:

Pool Quotas

Not built-in, but enforceable via custom roles and monitoring:

# Create role without VM.Allocate
pveum role add PoolUser --privs "VM.Console VM.PowerMgmt VM.Monitor"

# Users can use VMs but not create new ones
# Admins create VMs, assign to pool

VM Resource Limits

# Limit CPU
qm set 100 --cpulimit 2  # Max 2 cores worth

# Limit memory
qm set 100 --memory 4096 --balloon 2048

# Limit disk I/O
qm set 100 --bwlimit "backup=10240,restore=10240"

Storage Quotas

Ceph/ZFS can enforce quotas:

# ZFS quota
zfs set quota=100G rpool/data/customer-a

# Ceph quota
ceph osd pool set-quota customer-pool max_bytes 107374182400

Security Checklist

User Management:
[ ] No shared accounts
[ ] Each person has individual user
[ ] Users in appropriate groups
[ ] Unused users disabled/deleted

Roles:
[ ] Custom roles for common use cases
[ ] No one uses Administrator role directly
[ ] Principle of least privilege applied

Pools:
[ ] Resources organized into pools
[ ] Permissions at pool level (not individual VMs)
[ ] Clear ownership per pool

API Tokens:
[ ] Automation uses tokens, not passwords
[ ] Tokens have specific purposes
[ ] Tokens documented
[ ] Unused tokens revoked

Audit:
[ ] Logs retained appropriately
[ ] Regular review of access
[ ] Alerts on sensitive operations

Token Rotation

Regular token rotation:

# Create new token
pveum user token add admin@pam ansible-v2 --privsep 0

# Update automation to use new token

# Verify new token works

# Remove old token
pveum user token remove admin@pam ansible-v1

Schedule this quarterly or when personnel changes.

The Lesson

Access control is a product. It needs to be designed.

The lazy approach:

Everyone is admin
One shared account
Permissions "we'll figure out later"

The result:

No audit trail
Blast radius is entire cluster
Personnel change = security nightmare

The designed approach:

Users in groups
Groups have roles
Roles are minimal
Resources in pools
Permissions at pool level
Automation uses tokens
Regular access reviews

Access control isn't overhead — it's what makes multi-tenancy possible. Design it upfront, enforce it consistently, and review it regularly.

Ceph on Proxmox: Honest Guide (When It's Worth It, When It's Pain)

berik@ashimov.com (Berik Ashimov) — Fri, 05 Sep 2025 00:00:00 GMT

Ceph is incredible technology. Distributed, self-healing storage that scales horizontally. No single point of failure. Built into Proxmox with a nice UI. Sounds perfect.

It's not perfect. Ceph has costs: hardware (more nodes, more disks, more network), complexity (distributed systems are hard), and operational overhead (recovery can saturate your network). These costs are worth it for the right use case. For the wrong use case, Ceph is pain with no benefit.

This is an honest guide: when Ceph makes sense, what it really requires, and what to expect.

When Ceph Makes Sense

Good fit:

3+ nodes with dedicated storage networks
Need for truly shared storage (HA, live migration)
Can accept Ceph's resource overhead
Want to scale storage by adding nodes
No external SAN/NAS available

Bad fit:

Single node (Ceph needs 3+ for reliability)
Tight hardware budget (Ceph needs resources)
Simple backup/restore is sufficient
Already have enterprise SAN
Can't dedicate network bandwidth

Minimum Requirements (Real Minimums)

Nodes

Minimum: 3 nodes

Ceph uses replication (default 3 copies). With 2 nodes, one failure = data at risk.

3 nodes: Can lose 1 node
4 nodes: Can lose 1 node (more capacity)
5 nodes: Can lose 2 nodes

CPU and RAM

Per OSD (disk):

1 CPU core minimum
2GB RAM minimum (4GB recommended)

Example: 3 nodes × 4 OSDs each = 12 OSDs
Minimum: 12 cores, 24GB RAM just for Ceph
Recommended: 24 cores, 48GB RAM

Plus RAM for VMs, Proxmox, monitors...

Network

Minimum: 10GbE dedicated

1GbE works for testing but not production. Recovery after disk failure saturates the network:

1TB disk fails, recovery speed:
- 1GbE: ~3 hours (if nothing else uses network)
- 10GbE: ~15 minutes

During recovery, performance is degraded

Recommended: Separate networks

┌─────────────────────────────────────┐
│          Public Network             │
│  (client access, VM traffic)        │
│           10.0.0.0/24               │
└─────────────┬───────────────────────┘
              │
        ┌─────┼─────┐
        │     │     │
     pve1   pve2   pve3
        │     │     │
        └─────┼─────┘
              │
┌─────────────┴───────────────────────┐
│         Cluster Network             │
│  (OSD replication, recovery)        │
│          10.10.0.0/24               │
└─────────────────────────────────────┘

Cluster network handles heavy replication traffic. Public network serves VMs.

Storage

SSDs strongly recommended

HDDs work but:

Recovery is slow (hours to days)
Random I/O performance is poor
Write latency affects all VMs

Flash for metadata

If using HDDs, use SSDs for:

WAL (Write-Ahead Log)
DB (RocksDB metadata)

HDD OSD with SSD metadata:
Performance: 10x better than HDD-only
Complexity: Higher, more failure modes

Installing Ceph on Proxmox

Initialize Ceph

From any node (installs Ceph packages):

pveceph install

Or via Web UI: Node → Ceph → Install

Create Ceph Cluster

# Initialize on first node
pveceph init --network 10.10.0.0/24

# This sets cluster network
# Default: uses same as public (not recommended)

Create Monitors

Each node needs a monitor for quorum:

# On each node
pveceph mon create

# Verify
ceph mon stat

Need at least 3 monitors for quorum (one per node in 3-node cluster).

Create Manager Daemons

# On each node (2+ recommended)
pveceph mgr create

# Verify
ceph mgr stat

Create OSDs

Each disk becomes an OSD:

# List available disks
ceph-volume lvm list

# Create OSD on /dev/sdb
pveceph osd create /dev/sdb

# With separate WAL/DB device (SSD for HDD OSDs)
pveceph osd create /dev/sdb --wal-dev /dev/nvme0n1 --db-dev /dev/nvme0n1

Via Web UI: Node → Ceph → OSD → Create OSD

Create Pool

Pools contain data with specific replication rules:

# Create pool with size 3 (3 replicas), min_size 2
pveceph pool create vmpool --size 3 --min_size 2 --pg_num 128

# Add as Proxmox storage
pvesm add rbd ceph-pool --pool vmpool --content images,rootdir

Ceph Configuration

Understanding PGs (Placement Groups)

PGs distribute data across OSDs. Too few = uneven distribution. Too many = overhead.

Rule of thumb:
Total PGs = (OSDs × 100) / replica count

12 OSDs, 3 replicas:
(12 × 100) / 3 = 400 PGs per pool

Divide among pools based on expected size

Pool Configuration

# Check pool settings
ceph osd pool get vmpool all

# Adjust replication
ceph osd pool set vmpool size 3
ceph osd pool set vmpool min_size 2

# Enable compression (optional)
ceph osd pool set vmpool compression_mode aggressive

CRUSH Rules

CRUSH determines data placement. Default spreads across hosts:

# View CRUSH map
ceph osd crush tree

# Data placement: 1 replica per host
# Protects against single host failure

For single-node testing (NOT production):

# Allow replicas on same host (DANGEROUS)
ceph osd crush rule create-replicated single_host default osd
ceph osd pool set vmpool crush_rule single_host

Monitoring Ceph Health

Basic Status

# Overall health
ceph status

# Should show:
# health: HEALTH_OK

# Detailed health
ceph health detail

OSD Status

# OSD tree
ceph osd tree

# OSD stats
ceph osd stat

# Individual OSD
ceph osd perf

Pool Usage

# Pool stats
ceph df

# Detailed pool info
rados df

Dashboard

Enable Ceph dashboard:

# Dashboard is included with Proxmox
# Access via: https://<node>:8006 → Node → Ceph → Status

What to Expect: Performance

Write Performance

Single SSD OSD: ~50-100 MB/s per OSD
Single NVMe OSD: ~200-500 MB/s per OSD
Aggregate: Scales with OSDs

Latency: 1-5ms (SSD), 5-20ms (HDD)

Read Performance

Read from primary OSD, scales with OSDs
Cache helps repeated reads

Real-World VM Performance

Random 4K IOPS (single VM):
- Ceph SSD: 5,000-20,000 IOPS
- Local SSD: 50,000-100,000 IOPS

Latency matters more than throughput for VMs
Ceph adds network round-trip to every I/O

Ceph won't match local NVMe performance. It provides redundancy and shared access, not speed.

Recovery Behavior

When an OSD Fails

1. Ceph detects OSD down (10-30 seconds)
2. Marks OSD out (default: 5 minutes)
3. Begins recovery (re-replicating data)
4. Recovery uses cluster network bandwidth
5. Performance degraded until complete

Recovery Impact

1TB OSD failure:
- Data to re-replicate: 1TB
- 10GbE network: ~15 minutes
- 1GbE network: ~3 hours
- During recovery: Degraded performance

Tuning Recovery

# Limit recovery bandwidth (default is aggressive)
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_sleep 0.1

# Check recovery status
ceph status
# Should show recovery progress

Balance: Fast recovery vs. production performance impact.

Common Problems

HEALTH_WARN: Too Few PGs

# Increase PGs
ceph osd pool set vmpool pg_num 256
ceph osd pool set vmpool pgp_num 256

HEALTH_WARN: OSDs Near Full

# Check usage
ceph osd df

# Options:
# 1. Add more OSDs
# 2. Delete data
# 3. Rebalance (if uneven)

Ceph stops writes at 95% full. Plan capacity.

Slow Requests

# Check for slow ops
ceph daemon osd.0 ops

# Common causes:
# - HDD latency
# - Network congestion
# - Undersized cluster

Clock Skew

# Monitors are sensitive to time
# Check NTP
timedatectl status

# Fix: Ensure NTP is working on all nodes

Ceph vs Alternatives

Ceph vs Local ZFS

Aspect	Ceph	Local ZFS
Redundancy	Across nodes	Within node (mirror/RAIDZ)
Shared storage	Yes	No (without replication)
Performance	Network-bound	Local disk speed
Complexity	High	Low
Node failure	VMs continue	VMs stop

Choose local ZFS if you don't need shared storage.

Ceph vs NFS

Aspect	Ceph	NFS
Redundancy	Built-in	Requires HA NFS
Performance	Parallel access	Single server bottleneck
Complexity	High	Low
Scaling	Add nodes	Limited

Choose NFS for simpler setups with existing NAS.

Ceph vs iSCSI SAN

Aspect	Ceph	SAN
Cost	Hardware only	Hardware + licensing
Scaling	Add nodes	Add shelves/licenses
Complexity	Self-managed	Vendor support
Performance	Good	Often better

Choose SAN if budget allows and you want vendor support.

Sizing Example

Small Production Cluster

3 nodes:
- 32GB RAM each (16GB Ceph, 16GB VMs)
- 4-core CPU each
- 4× 1TB SSD per node (12 OSDs total)
- 10GbE cluster network
- 10GbE public network

Usable storage: ~4TB (12TB raw ÷ 3 replicas)
VM capacity: ~20-40 VMs depending on size

Medium Production Cluster

5 nodes:
- 128GB RAM each
- 16-core CPU each
- 8× 2TB NVMe per node (40 OSDs total)
- 25GbE cluster network
- 10GbE public network

Usable storage: ~26TB (80TB raw ÷ 3 replicas)
VM capacity: ~100-200 VMs

The Lesson

Ceph is great when you accept its costs: hardware, network, and operational complexity.

Ceph provides:

True shared storage
Self-healing
Horizontal scaling
No single point of failure

Ceph costs:

3+ nodes minimum
Significant RAM (2-4GB per OSD)
10GbE+ network (dedicated)
Operational knowledge
Recovery impacts performance

For a 3-node homelab with 10GbE networking, Ceph is a solid choice. For a single node, Ceph is pointless complexity. For a budget cluster with 1GbE, Ceph will frustrate you.

Match the tool to the problem. Ceph solves "I need shared, redundant storage across multiple nodes." If that's not your problem, Ceph isn't your solution.

High Availability: HA Groups, Fencing Mindset, and Failure Testing

berik@ashimov.com (Berik Ashimov) — Tue, 02 Sep 2025 00:00:00 GMT

High availability sounds like a feature you enable. Click "HA," and VMs automatically restart when a node fails. Magic.

It's not magic. It's fencing, quorum, shared storage, and very specific failure handling. Get any of these wrong and HA either doesn't work, or worse — causes split-brain where VMs run on multiple nodes simultaneously, corrupting data.

HA without testing is just a checkbox. A checkbox that might destroy your data when you actually need it.

HA Prerequisites

Before enabling HA, you need:

1. Cluster (3+ Nodes Recommended)

# Check cluster status
pvecm status

# Need quorum for HA decisions
# 2 nodes = no node can fail without losing quorum
# 3 nodes = 1 node can fail

Two-node clusters need a QDevice for HA to work reliably.

2. Shared Storage

HA VMs must be on storage accessible from all nodes:

# Check shared storage
pvesm status

# Valid for HA:
# - Ceph (RBD)
# - NFS
# - iSCSI
# - GlusterFS

# NOT valid:
# - local
# - local-lvm
# - local-zfs (unless Ceph ZFS)

If storage isn't shared, VM can't start on another node.

3. Fencing Capability

Fencing ensures a failed node is truly dead before starting VMs elsewhere. Without fencing, you risk:

Node 1: Appears dead (network issue)
Node 2: Starts VM copy
Node 1: Actually alive, VM still running
Result: Two VMs, same disk, corruption

Fencing (The Critical Part)

What Fencing Does

Fencing forces a failed node to stop before HA restarts VMs:

Node detected as failed
HA manager tries to fence (kill) the node
Only after successful fence → start VMs on other node

Fencing Methods

Hardware fencing (recommended):

IPMI/iLO/DRAC power off
PDU power cut
SBD (Storage-Based Death)

Software fencing:

Watchdog timer (self-fence)
SSH fence (tell node to shutdown)

Configuring Watchdog Fencing

Most common in homelab. Node kills itself if it loses quorum:

# Enable hardware watchdog
echo "softdog" >> /etc/modules

# Load module
modprobe softdog

# Verify
ls /dev/watchdog

Proxmox HA uses watchdog automatically. If node loses quorum and can't reach cluster, watchdog triggers reboot.

IPMI Fencing (Production)

For reliable fencing, use IPMI:

# Install fence agents
apt install fence-agents

# Test IPMI fencing manually
ipmitool -H 10.0.0.200 -U admin -P password power status
ipmitool -H 10.0.0.200 -U admin -P password power off

Configure in /etc/pve/ha/fence.cfg:

# Fence configuration
# Not directly supported in PVE GUI, but can use with custom scripts

Storage-Based Fencing (SBD)

Nodes write heartbeats to shared storage. Missing heartbeat = fence:

# Create SBD device on shared storage
sbd -d /dev/sdb create

# Configure SBD
sbd -d /dev/sdb -1 60 -4 120 create

Enabling HA for VMs

Add VM to HA

# Enable HA for VM 100
ha-manager add vm:100

# With specific group
ha-manager add vm:100 --group production

# Check HA status
ha-manager status

Via Web UI: Datacenter → HA → Add → Select VM

HA States

State	Meaning
started	HA will ensure VM is running
stopped	HA will ensure VM is stopped
disabled	HA ignores this VM
ignored	Temporarily ignore (migration)

HA Groups

Groups define which nodes can run HA VMs:

# Create group preferring pve1 and pve2
ha-manager groupadd production --nodes pve1,pve2

# Add VM to group
ha-manager set vm:100 --group production

# Node priority (lower = preferred)
ha-manager groupadd production --nodes pve1:1,pve2:2,pve3:3

With priorities, VMs prefer pve1, failover to pve2, last resort pve3.

Restricted Groups

Only allow VMs on specific nodes:

# Create restricted group
ha-manager groupadd gpu-nodes --nodes pve2,pve3 --restricted

# VMs in this group can ONLY run on pve2 or pve3
ha-manager set vm:200 --group gpu-nodes

Useful for VMs needing specific hardware (GPU, special storage).

HA Manager Behavior

Node Failure Sequence

1. Node stops responding to cluster heartbeats
2. Other nodes detect failure (after timeout)
3. Quorum check: Do remaining nodes have majority?
4. If quorate:
   a. Attempt to fence failed node
   b. Wait for fence confirmation
   c. Start VMs on surviving nodes
5. If not quorate:
   a. Cluster freezes
   b. No HA actions (prevents split-brain)

Failover Timing

Detection timeout:    30 seconds (default)
Fence attempt:        Variable (IPMI: seconds, watchdog: 60s)
VM startup:           10-60 seconds

Total failover time:  1-3 minutes typical

For faster failover, tune detection but beware false positives.

Resource Migration

When node comes back online, VMs don't automatically migrate back:

# VMs stay on failover node until:
# 1. Manual migration
# 2. Next failure
# 3. Maintenance mode + recovery

# To migrate back manually
qm migrate 100 pve1 --online

This is intentional. Automatic "failback" risks unnecessary disruption.

Maintenance Mode

Before working on a node, use maintenance mode:

# Request maintenance (HA migrates VMs away)
ha-manager set-maintenance pve1 --enable

# Check status
ha-manager status

# Wait for migrations to complete
# Do maintenance work

# Disable maintenance
ha-manager set-maintenance pve1 --disable

This gracefully moves VMs, unlike a failure which is disruptive.

Manual VM Migration

For HA VMs, use:

# Request HA to migrate
ha-manager migrate vm:100 pve2

# Or set VM to ignored temporarily
ha-manager set vm:100 --state ignored
qm migrate 100 pve2 --online
ha-manager set vm:100 --state started

Don't just qm migrate an HA VM — HA manager might fight you.

Testing HA (Critical)

Test 1: Simulated Node Failure

# On node to "fail"
systemctl stop pve-cluster corosync

# Watch from another node
ha-manager status

# VMs should migrate to other nodes
# After 1-2 minutes, check VMs are running elsewhere

# Restore node
systemctl start corosync pve-cluster

Test 2: Hard Power Off

Warning: These commands immediately crash the node without graceful shutdown.

# Physical power button or:
echo b > /proc/sysrq-trigger  # Immediate reboot (no sync)

# Or IPMI (preferred for remote testing):
ipmitool chassis power off

# This tests actual fencing behavior

Test 3: Network Partition

# On node, drop cluster traffic
iptables -A INPUT -p udp --dport 5405:5412 -j DROP
iptables -A OUTPUT -p udp --dport 5405:5412 -j DROP

# Node should fence itself (watchdog) or be fenced (IPMI)
# VMs should migrate

# Restore
iptables -F

Test 4: Storage Failure

# If using NFS, unmount it
umount -l /mnt/nfs-storage

# HA behavior depends on configuration
# VMs using that storage should fail
# Other VMs should continue

# Document what happens!

Document Test Results

HA Test Report - 2025-01-08

Test: Node power off (pve2)
Method: IPMI power off
Expected: VMs 100, 101 migrate to pve1 or pve3

Timeline:
- 00:00 Power off pve2
- 00:32 Cluster detects failure
- 00:45 Fence confirmed
- 01:15 VM 100 started on pve1
- 01:28 VM 101 started on pve3

Total failover: 1 minute 28 seconds
Result: PASS

Issues: None
Tested by: Admin

Common HA Problems

"No quorum" — Nothing Happens

# Check quorum
pvecm status | grep Quorate

# If "Quorate: No", cluster can't make decisions
# Need majority of nodes online

Fix: Add more nodes, add QDevice, or manually set expected votes (dangerous).

VMs Won't Start After Failover

# Check HA manager logs
journalctl -u pve-ha-lrm -f

# Common causes:
# - Shared storage not available
# - Resource constraints (RAM, CPU)
# - Start dependencies

Split-Brain Detected

If somehow VMs ran on multiple nodes:

# IMMEDIATELY stop VMs on one node
qm stop 100 --skiplock

# Check for disk corruption
# Restore from backup if needed

This is catastrophic. Prevent with proper fencing.

HA Service Stuck

# Restart HA services
systemctl restart pve-ha-crm
systemctl restart pve-ha-lrm

# Check status
ha-manager status

HA Architecture

Minimum Viable HA

3 nodes minimum (for quorum)
Shared storage (NFS, Ceph, iSCSI)
Fencing (watchdog at minimum)

Production HA

3+ nodes
Redundant network (bonding)
Dedicated cluster network
Ceph or enterprise SAN
Hardware fencing (IPMI)
UPS with monitoring

HA Network Topology

         ┌───────────────────────────────────┐
         │        Cluster Network            │
         │    (Corosync, fencing, HA)        │
         └───────────┬───────────┬───────────┘
                     │           │
         ┌───────────┴───┐   ┌───┴───────────┐
         │     pve1      │   │     pve2      │
         │   (node 1)    │   │   (node 2)    │
         └───────┬───────┘   └───────┬───────┘
                 │                   │
         ┌───────┴───────────────────┴───────┐
         │           Storage Network          │
         │        (Ceph, iSCSI, NFS)          │
         └───────────────────────────────────┘

Separate networks for cluster and storage prevents storage issues from affecting HA decisions.

The Lesson

HA without tests is just a checkbox.

Enabling HA takes 30 seconds. Testing it takes hours. But that testing is what determines whether HA works when you need it.

The checkbox says "HA enabled." The test proves:

Fencing actually works
VMs actually migrate
Storage is actually shared
Recovery time meets requirements

Every HA setup has edge cases. The node that takes 5 minutes to fence. The VM that won't start because of resource constraints. The storage path that fails under load.

You find these in testing, or you find them in production. Testing is cheaper.

Schedule regular HA tests. Document what happens. Fix what's broken. That's how you turn a checkbox into actual high availability.

Snapshots vs Backups vs Replication: What Saved Me and What Didn't

berik@ashimov.com (Berik Ashimov) — Fri, 29 Aug 2025 00:00:00 GMT

I've lost data three times in production. Each time taught me something different about what "protected" actually means.

First time: snapshot on same disk that failed. Snapshot died with the disk. Second time: backup existed, but retention policy had pruned the version I needed. Third time: replication was running, but it replicated the corruption.

Snapshots, backups, and replication are different tools solving different problems. Using the wrong one for your failure scenario means learning the hard way.

The Three Protection Layers

Feature	Snapshot	Backup	Replication
Location	Same storage	Different storage	Different node
Speed	Instant	Minutes-hours	Continuous
Protection	Human error	Hardware failure	Site failure
Point-in-time	Yes	Yes	Near-real-time
Survives disk failure	No	Yes	Depends
Survives site failure	No	If off-site	If different site

Each layer protects against different failures. You need all three.

Snapshots

A snapshot captures VM state at a point in time — disk and optionally RAM.

How Snapshots Work

ZFS/LVM snapshots are copy-on-write:

Before snapshot:
  Disk blocks: [A][B][C][D][E]

After snapshot:
  Current:     [A][B][C][D][E]
  Snapshot:    → points to same blocks

After modification (block C changed):
  Current:     [A][B][C'][D][E]
  Snapshot:    [A][B][C][D][E]  (old C preserved)

Snapshots are instant because nothing is copied initially. Only changed blocks are preserved.

Creating Snapshots

# VM snapshot (disk + state)
qm snapshot 100 before-upgrade --description "Before kernel upgrade"

# List snapshots
qm listsnapshot 100

# Rollback
qm rollback 100 before-upgrade

# Delete snapshot
qm delsnapshot 100 before-upgrade

What Snapshots Are Good For

Before risky changes: Upgrade, config change, experimental work
Quick rollback: "Oops, that broke it" → 30-second recovery
Testing: Try something, snapshot, try variations, rollback

What Snapshots Don't Protect Against

Failure scenario: Disk dies
Snapshots: Also dead (same disk)
Result: Total loss

Failure scenario: Storage controller fails
Snapshots: Also dead (same storage)
Result: Total loss

Failure scenario: Ransomware encrypts VM
Snapshots: Might survive if attacker doesn't find them
Result: Maybe recoverable, maybe not

Failure scenario: Accidental snapshot deletion
Snapshots: Gone
Result: No protection

Rule: Snapshots are convenience, not protection.

Backups

Backups copy data to separate storage.

Backup to PBS

# Full backup to PBS
vzdump 100 --storage pbs-store --mode snapshot --compress zstd

# Incremental (only changed since last)
# PBS does this automatically

What Backups Protect Against

Failure scenario: Primary storage dies
Backups on PBS: Safe
Result: Restore from backup

Failure scenario: Host fails completely
Backups on PBS: Safe (different hardware)
Result: Restore to new host

Failure scenario: Accidental VM deletion
Backups: Safe (separate system)
Result: Restore deleted VM

Failure scenario: Ransomware encrypts VM
Backups: Safe if not mounted/accessible to VM
Result: Restore clean version

Backup Limitations

Failure scenario: Backup storage also fails
Result: Both copies lost

Failure scenario: Retention pruned the backup you need
Result: Can't restore that point in time

Failure scenario: Site-wide disaster (fire, flood)
On-site backups: Also destroyed
Result: Total loss without off-site copy

RPO: Recovery Point Objective

How much data can you lose?

Daily backups at 1 AM:
- Failure at 11 PM = 22 hours of data loss
- RPO = 24 hours

Hourly backups:
- Maximum 1 hour of data loss
- RPO = 1 hour

Real-time replication:
- Seconds of data loss
- RPO ≈ 0

Match backup frequency to acceptable data loss.

RTO: Recovery Time Objective

How fast must you recover?

Full VM restore from PBS:
- 100GB VM ≈ 10-30 minutes
- RTO ≈ 30 minutes

Restore from off-site:
- Download time + restore time
- RTO = hours

Rebuild from scratch + restore data:
- RTO = hours to days

Match recovery method to acceptable downtime.

Replication

Replication continuously copies data to another location.

Proxmox Replication

Built-in ZFS replication between cluster nodes:

# Create replication job
pvesr create-local-job 100-0 pve2 --schedule '*/15'  # Every 15 min

# Check replication status
pvesr status

# List jobs
pvesr list

How Replication Works

Node 1 (primary)           Node 2 (replica)
┌──────────────┐           ┌──────────────┐
│   VM 100     │           │  VM 100      │
│   (active)   │──────────►│  (standby)   │
│              │  ZFS send │              │
└──────────────┘           └──────────────┘

Every 15 minutes: incremental sync

What Replication Protects Against

Failure scenario: Node 1 hardware failure
Replica on Node 2: Ready to start
Result: Activate replica, minimal downtime

Failure scenario: Storage failure on Node 1
Replica on Node 2: Has recent copy
Result: Start replica (with potential 15-min data loss)

What Replication Does NOT Protect Against

Failure scenario: VM data corruption (application bug)
Replication: Replicates the corruption to Node 2
Result: Both copies corrupted

Failure scenario: Ransomware encrypts VM
Replication: Replicates encrypted data
Result: Both copies encrypted

Failure scenario: Accidental VM deletion
Replication: Deletion replicates
Result: Both copies deleted

Failure scenario: Cluster-wide issue
Replication: Both nodes affected
Result: No protection

Rule: Replication protects against hardware failure, not data corruption.

The Three-Layer Strategy

For critical VMs, use all three:

Layer 1: Snapshots
- Before changes
- Quick rollback
- Same-disk convenience

Layer 2: Backups (PBS)
- Daily/hourly
- Different storage
- Historical retention

Layer 3: Replication
- Near-real-time
- Different node
- Fast failover

Example Configuration

# VM 100: Critical web application

# Layer 1: Manual snapshots before changes
qm snapshot 100 pre-upgrade

# Layer 2: Hourly backups to PBS, 30-day retention
# Backup job: hourly to pbs-store
# Retention: keep-hourly=24,keep-daily=30

# Layer 3: 15-minute replication to second node
pvesr create-local-job 100-0 pve2 --schedule '*/15'

Recovery scenarios:

Scenario	Recovery Method	Data Loss
Bad config change	Rollback snapshot	0
Host hardware failure	Start replica	Up to 15 min
Storage failure	Restore from PBS	Up to 1 hour
Data corruption	Restore from PBS (earlier point)	Variable
Site disaster	Restore from off-site PBS	Up to 24 hours

Real Failure Scenarios

Scenario 1: Disk Failure

Situation: ZFS pool loses a disk in mirror
Snapshots: Still available (pool degraded but working)
Replication: Working
Backups: Working

Action: Replace disk, resilver, no VM downtime

Scenario 2: Complete Storage Loss

Situation: Storage controller failure, pool unimportable
Snapshots: Lost
Replication: Available on other node

Action: Start replica, 15 minutes data loss

Scenario 3: Database Corruption

Situation: App bug corrupts database on Tuesday
Discovered: Thursday
Replication: Has corrupted data
Recent backups: Have corrupted data
Older backup from Monday: Clean

Action: Restore Monday backup, replay transaction logs if possible
Lesson: Longer backup retention matters

Scenario 4: Ransomware

Situation: Ransomware encrypts VM on Friday night
Replication: Encrypted copy on second node
Snapshots: Might be encrypted (if attacker accessed)
PBS backups: Clean (PBS not mounted inside VM)

Action: Restore from PBS backup before infection
Lesson: Air-gapped backups survive ransomware

Calculating Your Strategy

For Each VM, Answer:

RPO: How much data loss is acceptable?
- Minutes → Replication + frequent backups
- Hours → Hourly backups
- Days → Daily backups
RTO: How fast must it recover?
- Minutes → Replication + HA
- Hours → Local PBS restore
- Days → Off-site restore okay
Retention: How far back might you need?
- Days → Short retention
- Months → Longer retention
- Compliance → Years (archive separately)

Example: Different VM Classes

Class A: Critical (database, ERP)
- RPO: 15 minutes
- RTO: 30 minutes
- Retention: 90 days
Strategy: Replication (15 min) + Hourly PBS + Monthly off-site

Class B: Important (web servers, apps)
- RPO: 1 hour
- RTO: 4 hours
- Retention: 30 days
Strategy: Hourly PBS backup

Class C: Development (test VMs)
- RPO: 24 hours
- RTO: Next business day
- Retention: 7 days
Strategy: Daily PBS backup

Class D: Ephemeral (CI runners)
- RPO: N/A (rebuild from config)
- RTO: Minutes (just recreate)
- Retention: None
Strategy: No backup (infrastructure as code)

Testing Your Strategy

Monthly Tests

# 1. Snapshot rollback test
qm snapshot 100 test-snap
# Make a change
qm rollback 100 test-snap
# Verify rollback worked

# 2. Backup restore test
qmrestore pbs-store:backup/vm/100/... 900
qm start 900
# Verify VM works
qm destroy 900

# 3. Replication failover test
# Stop source VM
qm stop 100
# Start replica on other node
# Verify it works
# Fail back to primary

Document Results

Test Date: 2025-01-08
Tested by: Admin

Snapshot rollback: PASS (30 seconds)
PBS restore (100GB VM): PASS (12 minutes)
Replication failover: PASS (2 minutes)

Issues found: None
Next test: 2025-02-08

The Lesson

Replication is not a replacement for PBS. It's a different layer.

Each protection layer handles different failures:

Snapshots: Undo mistakes (same disk)
Backups: Recover from hardware failure (different storage)
Replication: Fast failover (different node)
Off-site: Survive site disasters (different location)

The failure you'll have is the one you didn't plan for. If you only have replication, you'll face data corruption. If you only have daily backups, you'll have the failure at 11 PM. If you only have on-site backups, you'll have the site disaster.

Layer your protection. Test your recovery. Know exactly what each layer protects against and what it doesn't.

Backups Done Right: Proxmox Backup Server, Schedules, Retention, and Restore Drills

berik@ashimov.com (Berik Ashimov) — Tue, 26 Aug 2025 00:00:00 GMT

Everyone has backups until they need to restore. Then they discover: the backup never completed, the retention deleted what they needed, or worse — they've never actually tested a restore.

Proxmox Backup Server (PBS) is excellent backup software. Deduplication, incremental forever, encryption, verification. But software doesn't matter if your process is broken.

A backup exists only after a successful restore test. Everything else is hope.

Why Proxmox Backup Server

PBS is purpose-built for Proxmox VE:

Incremental forever: Only changed blocks transfer after first backup
Deduplication: Identical data stored once, even across VMs
Encryption: Client-side encryption, PBS never sees plaintext
Verification: Built-in integrity checking
Pruning: Automatic retention policies

Compared to vzdump-to-NFS:

10x less storage for typical workloads (dedup)
5x faster backups (incremental)
Actual verification (not just "file exists")

Installing PBS

Dedicated Machine (Recommended)

PBS should run on separate hardware. If your Proxmox host dies, your backups shouldn't die with it.

# Download PBS ISO from proxmox.com
# Install on dedicated hardware or VM (on different host)

# After install, configure network
nano /etc/network/interfaces

# Update repositories (same as PVE)
mv /etc/apt/sources.list.d/pbs-enterprise.list /etc/apt/sources.list.d/pbs-enterprise.list.disabled
echo "deb http://download.proxmox.com/debian/pbs bookworm pbs-no-subscription" > /etc/apt/sources.list.d/pbs-no-subscription.list
apt update && apt full-upgrade -y

Access web UI at https://<pbs-ip>:8007

PBS as VM on Proxmox

Acceptable for homelab, not ideal:

# Create VM for PBS
qm create 999 --name pbs --memory 4096 --cores 2 \
  --net0 virtio,bridge=vmbr0 \
  --scsi0 local-zfs:32 \
  --cdrom local:iso/proxmox-backup-server.iso

Critical: Store PBS VM on different storage than what you're backing up. If your main storage fails, PBS VM should survive.

Storage Configuration

Datastore Setup

PBS organizes backups into datastores:

# Create datastore directory
mkdir -p /backup/datastore1

# Via web UI: Administration → Storage/Disks → Directory → Create: Datastore

# Or via CLI
proxmox-backup-manager datastore create store1 /backup/datastore1

Storage Sizing

Deduplication means storage math is different:

Without dedup: 10 VMs × 100GB × 30 backups = 30TB
With dedup:    10 VMs × 100GB × 30 backups ≈ 500GB - 2TB

Actual ratio depends on:

How similar VMs are (templates = high dedup)
How much data changes between backups
How many retention points

Start with 2x your VM total size, monitor, adjust.

Connecting Proxmox VE to PBS

Add PBS to Proxmox

On Proxmox VE:

# Add PBS storage
pvesm add pbs pbs-store \
  --server 10.0.0.50 \
  --datastore store1 \
  --username backup@pbs \
  --password \
  --fingerprint <fingerprint>

Get fingerprint from PBS:

# On PBS
proxmox-backup-manager cert info | grep Fingerprint

Or via Web UI: Datacenter → Storage → Add → Proxmox Backup Server

Create Backup User

On PBS, create dedicated backup user:

# Create user
proxmox-backup-manager user create backup@pbs

# Create API token (more secure than password)
proxmox-backup-manager user generate-token backup@pbs automation

# Grant permissions
proxmox-backup-manager acl update / Datastore.Backup --auth-id backup@pbs

Use API token in Proxmox connection.

Backup Jobs

Manual Backup

# Backup single VM
vzdump 100 --storage pbs-store --mode snapshot

# Backup multiple VMs
vzdump 100 101 102 --storage pbs-store --mode snapshot

Scheduled Backups

Via Web UI: Datacenter → Backup → Add

Storage: pbs-store
Schedule: 01:00 (daily at 1 AM)
Selection mode: Include all (or specific VMs)
Mode: Snapshot
Compression: ZSTD

Or via CLI:

# Create backup job
pvesh create /cluster/backup --storage pbs-store --schedule "0 1 * * *" --all 1 --mode snapshot --compress zstd

Backup Modes

Mode	Description	Downtime	Consistency
Snapshot	Atomic snapshot, VM keeps running	None	Good
Suspend	Pause VM, backup, resume	Seconds	Better
Stop	Shutdown, backup, start	Minutes	Best

Recommendation: Use snapshot mode. If you need perfect consistency, use application-level tools (database dumps, etc.) before backup.

Retention Policies

Retention determines how many backups to keep. PBS supports GFS (Grandfather-Father-Son):

Keep last:    3      # Always keep last 3 backups
Keep hourly:  24     # Keep 24 hourly backups
Keep daily:   7      # Keep 7 daily backups
Keep weekly:  4      # Keep 4 weekly backups
Keep monthly: 6      # Keep 6 monthly backups
Keep yearly:  2      # Keep 2 yearly backups

This gives you:

Recent: Multiple restore points
Medium-term: Daily granularity
Long-term: Monthly/yearly for compliance

Configure Retention on Proxmox

In backup job configuration:

# Edit backup job
pvesh set /cluster/backup/<jobid> --prune-backups keep-last=3,keep-daily=7,keep-weekly=4,keep-monthly=6

Prune Schedule on PBS

PBS runs pruning automatically. Check/configure:

# On PBS
proxmox-backup-manager prune-job list
proxmox-backup-manager prune-job create store1 --schedule "daily"

Verification

Backups without verification are hopes, not backups.

Automatic Verification

PBS can verify backups automatically:

# On PBS - create verification job
proxmox-backup-manager verify-job create store1 \
  --schedule "weekly" \
  --outdated-after 7

This reads all backup chunks and verifies checksums. Catches:

Bit rot
Storage corruption
Incomplete backups

Manual Verification

# Verify specific backup
proxmox-backup-client verify <backup-id>

# Verify all backups in datastore
proxmox-backup-manager verify store1

Restore Testing

Verification proves data integrity. Restore testing proves you can actually recover.

Schedule Regular Restore Tests

Monthly restore drill:

Pick a random VM backup
Restore to temporary VM
Boot it, verify it works
Delete temporary VM
Document the test

# Restore to new VM
qmrestore pbs-store:backup/vm/100/2025-01-08T01:00:00Z 900 --storage local-zfs

# Boot and verify
qm start 900

# After verification
qm stop 900
qm destroy 900

Restore Test Checklist

Date: 2025-01-08
Backup tested: vm/100/2025-01-08T01:00:00Z
Original VM: web-server (100)
Restored as: test-restore (900)

[ ] Restore completed without errors
[ ] VM boots successfully
[ ] Services start (nginx, database, etc.)
[ ] Application responds (curl localhost)
[ ] Data integrity (sample data check)
[ ] Time to restore: 5 minutes

Tested by: Admin
Result: PASS

Encryption

For off-site backups, enable encryption:

Generate Key

# On Proxmox
proxmox-backup-client key create /etc/pve/priv/backup-key.enc

# Protect the key!
cp /etc/pve/priv/backup-key.enc /secure/location/

Configure Encrypted Backups

# Add PBS storage with encryption
pvesm add pbs pbs-encrypted \
  --server 10.0.0.50 \
  --datastore store1 \
  --username backup@pbs \
  --encryption-key /etc/pve/priv/backup-key.enc \
  --fingerprint <fingerprint>

Critical: If you lose the encryption key, backups are unrecoverable. Store key securely, separately from backups.

Monitoring Backups

Check Backup Status

On Proxmox:

# List recent backups
pvesh get /nodes/pve1/storage/pbs-store/content

# Check backup job status
pvesh get /cluster/backup

On PBS:

# Datastore status
proxmox-backup-manager datastore list

# Recent backup tasks
proxmox-backup-manager task list

Alerting

Configure email alerts on PBS:

# Set notification email
proxmox-backup-manager acl update / --notify admin@example.com

Key alerts:

Backup job failures
Verification failures
Datastore space warnings
Pruning issues

Disaster Scenarios

Scenario: VM Accidentally Deleted

# List available backups
pvesh get /nodes/pve1/storage/pbs-store/content --vmid 100

# Restore
qmrestore pbs-store:backup/vm/100/2025-01-08T01:00:00Z 100

Recovery time: Minutes.

Scenario: Proxmox Host Failed

# On new/rebuilt host, add PBS storage
pvesm add pbs pbs-store ...

# List available backups
pvesh get /nodes/pve2/storage/pbs-store/content

# Restore all VMs
pvesh get /nodes/pve2/storage/pbs-store/content --output-format json | \
  jq -r '.[] | "\(.volid) \(.vmid)"' | \
  while read volid vmid; do
    qmrestore "$volid" "$vmid"
  done

Recovery time: Hours (depends on VM count and size).

Scenario: PBS Server Lost

This is why you need off-site copies:

# If you have replication to second PBS
# Use backup PBS as primary

# If no replication... restore from:
# - Off-site tape/cloud
# - Secondary backup system
# - Old-school vzdump files

Off-Site Backups

PBS-to-PBS replication:

# On source PBS, create sync job
proxmox-backup-manager sync-job create remote-sync \
  --store store1 \
  --remote pbs-remote \
  --remote-store offsite \
  --schedule "daily"

This syncs deduplicated data to remote PBS. Only changed chunks transfer.

The Lesson

A backup exists only after a successful restore test.

Having backup software running is step one. Having backups completing is step two. But the backup doesn't exist until you've proven you can restore from it.

The process:

Configure: PBS, jobs, retention
Automate: Scheduled backups, verification
Test: Monthly restore drills
Document: What was tested, what was the result
Improve: Fix issues found during tests

The companies that recover from disasters aren't the ones with the best backup software. They're the ones who practiced recovery before they needed it.

Cluster Setup: Joining Nodes, Quorum, and Corosync Realities

berik@ashimov.com (Berik Ashimov) — Fri, 22 Aug 2025 00:00:00 GMT

A Proxmox cluster looks simple: join nodes, share configuration, migrate VMs between them. Click a button, cluster created. The web UI makes it seem like magic.

It's not magic. It's distributed systems, and distributed systems fail in ways that single nodes don't. Split-brain scenarios, quorum loss, network partitions — these aren't theoretical. They happen, and when they do, your VMs stop or corrupt.

Clustering is not a button. It's network discipline and failure planning.

What a Cluster Actually Is

A Proxmox cluster is:

Shared configuration: /etc/pve is replicated across all nodes via pmxcfs (a cluster filesystem)
Corosync: Cluster communication layer handling membership and messaging
Quorum: Voting system to prevent split-brain
Optional: Shared storage, HA, live migration

┌─────────────┐    Corosync    ┌─────────────┐
│    Node 1   │◄──────────────►│    Node 2   │
│  (vote: 1)  │                │  (vote: 1)  │
└──────┬──────┘                └──────┬──────┘
       │                              │
       │         Corosync             │
       │◄────────────────────────────►│
       │                              │
       ▼                              ▼
┌─────────────┐
│    Node 3   │
│  (vote: 1)  │
└─────────────┘

Quorum: 2 of 3 votes required (majority)

Before You Cluster

Network Requirements

Corosync needs reliable, low-latency networking:

Dedicated network recommended: Separate from VM traffic
Same subnet: All nodes must be on same L2 network for Corosync
Low latency: Under 2ms round-trip ideally
Redundant links: For production, use bonding or multiple Corosync rings

Bad ideas:

Corosync over WAN (latency kills it)
Corosync over congested VM network
Single network link (any failure = cluster issues)

Hostname and DNS

Before clustering, every node needs:

# Correct hostname
hostnamectl set-hostname pve1.lab.local

# /etc/hosts must resolve all cluster nodes
cat /etc/hosts
127.0.0.1 localhost
10.0.0.10 pve1.lab.local pve1
10.0.0.11 pve2.lab.local pve2
10.0.0.12 pve3.lab.local pve3

Critical: Hostnames cannot change after clustering. Get them right now.

Time Synchronization

All nodes must have synchronized time:

# Check time sync
timedatectl status

# Should show "System clock synchronized: yes"

Time drift causes certificate issues and cluster instability.

Creating the Cluster

On First Node (pve1)

# Create cluster
pvecm create my-cluster

# Verify
pvecm status

Output shows:

Cluster information
-------------------
Name:             my-cluster
Config Version:   1
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             ...
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1

Joining Additional Nodes (pve2, pve3)

From each node to join:

# Join cluster (run on pve2)
pvecm add 10.0.0.10

# Enter root password for pve1 when prompted

After joining:

# Check status from any node
pvecm status

# Should show all nodes
pvecm nodes

Quorum: Why It Matters

Quorum prevents split-brain — where two halves of a cluster both think they're in charge, making conflicting decisions.

How Quorum Works

Each node has votes (default: 1). Quorum requires majority:

Nodes	Votes	Quorum Needed	Can Lose
1	1	1	0 nodes
2	2	2	0 nodes
3	3	2	1 node
4	4	3	1 node
5	5	3	2 nodes

Two-node problem: With 2 nodes, losing either means no quorum. Both nodes freeze.

Two-Node Cluster Solutions

Option 1: QDevice (recommended)

External quorum device provides tie-breaking vote:

# On a separate lightweight VM/LXC (not on cluster nodes!)
apt install corosync-qnetd

# On each cluster node
apt install corosync-qdevice
pvecm qdevice setup 10.0.0.100  # QDevice IP

Now you have 2 nodes + 1 QDevice = 3 votes. Can survive 1 node failure.

Option 2: Expected votes override (dangerous)

# On surviving node during split
pvecm expected 1

This tells the node "expect only 1 vote for quorum." Dangerous — only use when you're certain the other node is truly dead.

Checking Quorum Status

# Detailed quorum info
pvecm status

# Is cluster quorate?
pvecm status | grep Quorate
# Quorate: Yes  (means cluster can operate)
# Quorate: No   (means cluster is frozen)

Corosync Configuration

View Current Config

cat /etc/pve/corosync.conf

Redundant Corosync Links

For production, use multiple networks:

# View current links
pvecm status

# Add second link
pvecm addlink 0 10.10.0.10  # Node 0, second network IP
pvecm addlink 1 10.10.0.11  # Node 1
pvecm addlink 2 10.10.0.12  # Node 2

Now Corosync uses both networks. If one fails, the other maintains cluster.

Network Interface Configuration

Ensure Corosync interfaces are correctly configured:

# Check which interfaces Corosync uses
corosync-cfgtool -s

# Should show ring status for each link

Common Cluster Operations

Node Maintenance

Before working on a node:

# Migrate all VMs off the node
# Via Web UI or:
for vmid in $(qm list | awk 'NR>1 {print $1}'); do
    qm migrate $vmid pve2 --online
done

# If using HA, disable it temporarily
ha-manager set vm:100 --state disabled

Removing a Node

# On node being removed - stop cluster services
systemctl stop pve-cluster corosync

# On remaining node
pvecm delnode pve3

# On removed node - clean up
rm -rf /etc/pve/nodes/pve3
rm /etc/corosync/*
rm /var/lib/corosync/*

Adding Node Back After Removal

The node must be completely clean:

# On the node to re-add
systemctl stop pve-cluster corosync
rm -rf /etc/pve/*
rm -rf /etc/corosync/*
rm -rf /var/lib/corosync/*

# Then join fresh
pvecm add 10.0.0.10

Split-Brain Scenarios

What Happens

Network partition between nodes:

┌─────────┐         X         ┌─────────┐
│  pve1   │─────────X─────────│  pve2   │
│ (alone) │         X         │ (alone) │
└─────────┘   (network cut)   └─────────┘

Both nodes think: "Is the other dead, or just unreachable?"

Without quorum:

Neither can be sure the other is truly dead
Both freeze rather than risk conflicting operations
VMs stop (better than corruption)

With quorum (3+ nodes or QDevice):

Majority side continues operating
Minority side freezes
Clear decision, no conflict

Recovering from Split-Brain

If both sides made changes (shouldn't happen with proper quorum):

# Check pmxcfs status
cat /etc/pve/.members

# Force resync (DANGEROUS - data loss possible)
systemctl stop pve-cluster
pmxcfs -l  # Local mode
# Review /etc/pve, fix conflicts manually
systemctl start pve-cluster

This is why you prevent split-brain rather than recover from it.

Troubleshooting

Cluster Won't Form

# Check Corosync status
systemctl status corosync

# Check logs
journalctl -u corosync -f

# Common issues:
# - Firewall blocking ports 5405-5412/udp
# - Hostname mismatch
# - Time drift

Node Shows as Offline

# Check from "offline" node
pvecm status

# Check network connectivity
ping pve1
ping pve2

# Check Corosync communication
corosync-cfgtool -s
# Ring should show "no faults"

"Cluster not quorate" Error

# Check how many nodes are visible
pvecm nodes

# If nodes are missing, check network
# If all nodes present but not quorate, check vote count
pvecm status | grep -E "Expected|Total"

Network Design for Clusters

Minimum (Lab)

                    ┌─────────────┐
All traffic ───────►│   Switch    │
                    └──────┬──────┘
              ┌────────────┼────────────┐
              ▼            ▼            ▼
           pve1         pve2         pve3
        10.0.0.10    10.0.0.11    10.0.0.12

Single network for everything. Works, but any network issue affects cluster.

Recommended (Production)

Corosync Network (dedicated)
          ┌─────────────┐
          │  Switch A   │
          └──────┬──────┘
    ┌────────────┼────────────┐
    ▼            ▼            ▼
 pve1         pve2         pve3
10.10.0.10  10.10.0.11  10.10.0.12

Management + VM Network
          ┌─────────────┐
          │  Switch B   │
          └──────┬──────┘
    ┌────────────┼────────────┐
    ▼            ▼            ▼
 pve1         pve2         pve3
10.0.0.10   10.0.0.11   10.0.0.12

Separate networks. Corosync traffic isolated from VM traffic.

Best (Production + Redundancy)

Corosync Ring 0          Corosync Ring 1
    Switch A                 Switch B
       │                        │
   ┌───┼───┐                ┌───┼───┐
   ▼   ▼   ▼                ▼   ▼   ▼
 pve1 pve2 pve3           pve1 pve2 pve3

Both rings active. Either can fail without cluster impact.

The Lesson

A cluster is not a button. It's network discipline and failure planning.

Clicking "Create Cluster" is the easy part. The hard part is:

Network reliability (Corosync needs it)
Quorum planning (how many nodes can you lose?)
Split-brain prevention (QDevice for 2 nodes)
Failure testing (does it actually fail over?)

A cluster that hasn't been failure-tested is a cluster that will surprise you. Test node failures. Test network partitions. Know what happens before production depends on it.

The goal isn't "we have a cluster." The goal is "we understand how our cluster fails and have planned for it."

LXC vs VM: When Containers Are a Gift (and When They Bite)

berik@ashimov.com (Berik Ashimov) — Tue, 19 Aug 2025 00:00:00 GMT

Proxmox gives you two virtualization options: KVM virtual machines and LXC containers. Both run workloads. Both appear as separate systems. But under the hood, they're fundamentally different — and that difference matters more than most people realize.

Containers are fast. Boot in seconds, minimal overhead, efficient resource use. VMs are slower to start, use more memory, but provide real isolation.

The question isn't "which is better" — it's "which is appropriate." And getting that wrong means either wasting resources or creating security problems.

How They Actually Work

KVM Virtual Machines

Each VM runs its own kernel. Complete isolation from the host.

Host Kernel (Proxmox)
     │
     └── QEMU/KVM Hypervisor
              │
              ├── VM1 (Linux kernel) ─── Processes
              ├── VM2 (Windows kernel) ── Processes
              └── VM3 (Linux kernel) ─── Processes

The hypervisor virtualizes hardware. Each VM thinks it has its own CPU, memory, disk. Complete isolation — a bug in VM1's kernel can't affect VM2.

LXC Containers

All containers share the host kernel. Isolation via namespaces and cgroups.

Host Kernel (Proxmox)
     │
     ├── Container 1 (namespace) ─── Processes
     ├── Container 2 (namespace) ─── Processes
     └── Container 3 (namespace) ─── Processes

Containers are isolated userspace instances. They share the host kernel, just in different namespaces. Faster, lighter — but a kernel vulnerability affects everything.

Performance Comparison

Aspect	LXC Container	KVM VM
Boot time	1-5 seconds	15-60 seconds
Memory overhead	~10-50MB	~200-500MB
CPU overhead	Near zero	2-5%
Disk I/O	Native speed	Near-native (virtio)
Network I/O	Native speed	Near-native (virtio)
Density	50-100+ per host	10-30 per host

For equivalent workload, containers use significantly fewer resources.

Creating Containers

Download Template

# List available templates
pveam available

# Download Ubuntu template
pveam download local ubuntu-24.04-standard_24.04-2_amd64.tar.zst

Or via Web UI: Datacenter → local → CT Templates → Templates

Create Container

pct create 200 local:vztmpl/ubuntu-24.04-standard_24.04-2_amd64.tar.zst \
  --hostname web-container \
  --memory 1024 \
  --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=10.0.0.200/24,gw=10.0.0.1 \
  --storage local-zfs \
  --rootfs local-zfs:8 \
  --password "temporary" \
  --unprivileged 1

# Start container
pct start 200

# Enter container
pct enter 200

Via Web UI: Create CT → follow wizard.

Unprivileged vs Privileged

Unprivileged (default, recommended):

pct create 200 ... --unprivileged 1

Container root (UID 0) maps to unprivileged user on host (UID 100000+)
Even if container is compromised, attacker can't escalate to host root
Some things don't work (NFS mounts, raw disk access)

Privileged:

pct create 200 ... --unprivileged 0

Container root is host root (UID 0)
Container escape = host root access
Needed for: NFS, some filesystems, hardware passthrough
Use only when necessary, with additional security

Security Boundaries

This is where the choice matters most.

Container Security Reality

Containers share the kernel. This means:

Kernel vulnerability
     ↓
Affects host AND all containers
     ↓
Container escape possible

Real examples:

Dirty COW (CVE-2016-5195): Write to read-only memory. Container escape.
Dirty Pipe (CVE-2022-0847): Overwrite files. Container escape.
Various cgroup escapes: Break out of isolation.

Containers are NOT a security boundary against malicious actors. They're a convenience boundary for trusted workloads.

VM Security Reality

VMs have their own kernel. Attacker must escape hypervisor:

Guest kernel vulnerability
     ↓
Only affects that VM
     ↓
Hypervisor escape required for host access

Hypervisor escapes exist (Spectre, Meltdown, VENOM) but are rarer and usually patched quickly. VMs are a real security boundary.

When Security Matters

Use VMs when:

Running untrusted code
Multi-tenant (different customers)
Security-critical workloads
Compliance requirements (PCI-DSS, HIPAA often require VMs)
Windows workloads
Different OS requirements

Use containers when:

Single-tenant (all your own workloads)
Trusted code only
Resource efficiency matters more than perfect isolation
Linux-only workloads
Development environments

Practical Use Cases

Container-Appropriate

Pi-hole DNS:

pct create 201 local:vztmpl/debian-12-standard_12.5-1_amd64.tar.zst \
  --hostname pihole \
  --memory 512 \
  --cores 1 \
  --net0 name=eth0,bridge=vmbr0,ip=10.0.0.53/24,gw=10.0.0.1 \
  --unprivileged 1

DNS is trusted, internal, lightweight. Perfect for container.

Reverse proxy (nginx/traefik):

pct create 202 ... --hostname proxy --memory 256

Forwards traffic, minimal state. Container is ideal.

Internal monitoring (Prometheus, Grafana):

Internal tools, trusted environment. Containers save resources.

VM-Appropriate

Database server:

qm create 300 --name db-server --memory 8192 --cores 4 ...

Critical data. If something breaks, don't risk it affecting other workloads.

Customer-facing web application:

Untrusted input from internet. VM provides real isolation.

Windows anything:

qm create 301 --name windows-server --ostype win11 ...

Windows doesn't run in LXC. VMs only.

Kubernetes nodes:

Docker-in-LXC works but is fragile. VMs are more reliable for k8s.

Container Features

Bind Mounts

Share host directories with container:

pct set 200 --mp0 /data/shared,mp=/mnt/shared

Useful for config management, but widens attack surface.

Device Passthrough

For hardware access (requires privileged or specific config):

# GPU passthrough
pct set 200 --features nesting=1
echo "lxc.cgroup2.devices.allow: c 195:* rwm" >> /etc/pve/lxc/200.conf
echo "lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file" >> /etc/pve/lxc/200.conf

Complex and fragile. Consider VM for hardware passthrough.

Nesting (Docker in LXC)

Run Docker inside container:

pct set 200 --features nesting=1

Works for simple cases. For production Docker workloads, use VMs.

Resource Limits

Container Limits

# Memory (hard limit)
pct set 200 --memory 2048

# CPU cores
pct set 200 --cores 2

# CPU limit (percentage)
pct set 200 --cpulimit 1.5  # Max 1.5 cores worth

# Disk quota
pct resize 200 rootfs 20G

VM Limits

# Memory
qm set 100 --memory 4096

# Balloonig (dynamic memory)
qm set 100 --balloon 2048  # Minimum, can grow to --memory

# CPU
qm set 100 --cores 4 --sockets 1

# CPU type
qm set 100 --cpu host  # Pass through host CPU features

Migration

Container Migration

Fast — only filesystem moves:

# Offline (stopped)
pct migrate 200 pve2

# Online (running) - requires shared storage
pct migrate 200 pve2 --online

VM Migration

Slower — memory state must transfer:

# Offline
qm migrate 100 pve2

# Live migration - requires shared storage
qm migrate 100 pve2 --online

Backup and Restore

Both work similarly:

# Backup container
vzdump 200 --storage backup --mode snapshot

# Backup VM
vzdump 100 --storage backup --mode snapshot

# Restore
pct restore 200 /backup/vzdump-lxc-200-*.tar.zst
qm restore 100 /backup/vzdump-qemu-100-*.vma.zst

Containers backup faster (smaller, no memory state).

Hybrid Approach

In practice, use both:

Production Layout:
├── LXC Containers (internal services)
│   ├── 200: pihole (DNS)
│   ├── 201: nginx (reverse proxy)
│   ├── 202: prometheus (monitoring)
│   └── 203: grafana (dashboards)
│
└── VMs (security-sensitive)
    ├── 100: web-app (internet-facing)
    ├── 101: database (critical data)
    ├── 102: backup-server (recovery)
    └── 103: windows-dc (Active Directory)

Containers for internal, trusted, lightweight. VMs for external, critical, or Windows.

Troubleshooting

Container Won't Start

# Check logs
pct start 200 --debug

# Common issues:
# - AppArmor blocking: check /var/log/kern.log
# - Disk full: check storage
# - Network collision: check IP conflicts

Container Networking Issues

# Enter container
pct enter 200

# Check interface
ip a

# Check gateway
ip route

# From host, check bridge
brctl show vmbr0

Unprivileged Container Limitations

If something fails in unprivileged container:

# Try enabling features
pct set 200 --features nesting=1,keyctl=1

# If still fails, might need privileged
# Consider: is this really appropriate for a container?

The Lesson

Containers are speed, but not always isolation.

The temptation is to use containers everywhere — they're faster, lighter, easier. But containers share a kernel. That kernel is your security boundary. A container escape becomes a host compromise.

When to contain:

Trusted workloads
Single-owner environment
Resource efficiency priority
Linux services
Internal tools

When to virtualize:

Untrusted inputs
Multi-tenant
Security-critical
Windows
Compliance requirements

The hybrid approach works best: containers for the lightweight stuff, VMs for what matters. Don't let container efficiency seduce you into container-izing everything. Sometimes the VM overhead is the security boundary you need.

Templates & Cloud-Init: Faster VMs Without Chaos

berik@ashimov.com (Berik Ashimov) — Fri, 15 Aug 2025 00:00:00 GMT

Installing an OS from ISO takes 15-30 minutes. Do that for every VM and you'll spend more time waiting for installers than doing actual work. Templates solve this: install once, clone many times. A new VM in seconds instead of minutes.

But templates have a hidden cost. When the template changes, everything cloned from it is different. When the template is inconsistent, every VM is a surprise. The template is a contract — if it floats, everything downstream breaks.

This is how to build templates that work reliably.

The Basic Workflow

Install OS from ISO (once)
Configure base system (packages, settings)
Add cloud-init for per-VM customization
Convert to template
Clone template for new VMs
Cloud-init configures hostname, network, SSH keys

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   ISO       │───▶│  Base VM    │───▶│  Template   │
│   Install   │    │  Configure  │    │  (frozen)   │
└─────────────┘    └─────────────┘    └──────┬──────┘
                                             │
                   ┌─────────────────────────┼─────────────────────────┐
                   ▼                         ▼                         ▼
            ┌─────────────┐           ┌─────────────┐           ┌─────────────┐
            │   Clone 1   │           │   Clone 2   │           │   Clone 3   │
            │  web-server │           │  db-server  │           │  app-server │
            └─────────────┘           └─────────────┘           └─────────────┘

Creating a Base VM

Start with minimal install:

# Download cloud image (Ubuntu example)
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Create VM
qm create 9000 --name "ubuntu-24.04-template" --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0

# Import cloud image as disk
qm importdisk 9000 noble-server-cloudimg-amd64.img local-zfs

# Attach disk
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-zfs:vm-9000-disk-0

# Add cloud-init drive
qm set 9000 --ide2 local-zfs:cloudinit

# Set boot order
qm set 9000 --boot c --bootdisk scsi0

# Enable serial console (for cloud-init)
qm set 9000 --serial0 socket --vga serial0

Or via Web UI:

Create VM with name like ubuntu-2404-template
OS: Do not use any media (we'll import cloud image)
System: SCSI Controller = VirtIO SCSI
Disks: Delete default disk
After creation: Hardware → Add → CloudInit Drive

Using ISO Instead of Cloud Image

If you prefer traditional install:

# Create VM with ISO
qm create 9001 --name "debian-12-template" --memory 2048 --cores 2 --cdrom local:iso/debian-12.iso --net0 virtio,bridge=vmbr0

# Add disk
qm set 9001 --scsihw virtio-scsi-pci --scsi0 local-zfs:32

# Start and install via console
qm start 9001

Install OS, then prepare for templating (see next section).

Preparing for Template

Before converting to template, clean up the VM:

On Debian/Ubuntu

# Update everything
apt update && apt full-upgrade -y

# Install cloud-init and QEMU guest agent
apt install -y cloud-init qemu-guest-agent

# Enable guest agent
systemctl enable qemu-guest-agent

# Clean package cache
apt clean
apt autoremove -y

# Remove machine-specific data
rm -f /etc/machine-id
rm -f /var/lib/dbus/machine-id
truncate -s 0 /etc/machine-id

# Remove SSH host keys (regenerate on first boot)
rm -f /etc/ssh/ssh_host_*

# Remove cloud-init state (so it runs fresh on clone)
cloud-init clean

# Clear logs
journalctl --rotate
journalctl --vacuum-time=1s
rm -rf /var/log/*.log
rm -rf /var/log/*.gz

# Clear bash history
history -c
rm -f /root/.bash_history
rm -f /home/*/.bash_history

# Shutdown
shutdown -h now

On RHEL/AlmaLinux/Rocky

# Update
dnf update -y

# Install cloud-init and guest agent
dnf install -y cloud-init qemu-guest-agent

# Enable services
systemctl enable qemu-guest-agent
systemctl enable cloud-init

# Clean
dnf clean all
rm -rf /var/cache/dnf/*

# Same cleanup as Debian
rm -f /etc/machine-id
rm -f /etc/ssh/ssh_host_*
cloud-init clean
# ... etc

Cloud-Init Configuration

Cloud-init reads metadata at boot and configures the VM. Proxmox provides this via a special drive.

Proxmox Cloud-Init Settings

# Set cloud-init options
qm set 9000 --ciuser admin
qm set 9000 --cipassword 'temporary-password'
qm set 9000 --sshkeys ~/.ssh/id_ed25519.pub
qm set 9000 --ipconfig0 ip=dhcp

Or via Web UI: VM → Cloud-Init tab:

User: admin
Password: (set or leave empty for SSH-only)
SSH public key: paste your key
IP Config: DHCP or static

Custom Cloud-Init

For advanced configuration, use snippets:

# Create snippets storage if needed
pvesm add dir snippets --path /var/lib/vz/snippets --content snippets

# Create custom cloud-init config
cat > /var/lib/vz/snippets/custom-user-data.yaml << 'EOF'
#cloud-config
package_update: true
package_upgrade: true
packages:
  - vim
  - htop
  - curl
  - git

users:
  - name: admin
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - ssh-ed25519 AAAA... your-key

write_files:
  - path: /etc/motd
    content: |
      Welcome to the VM
      Provisioned by cloud-init

runcmd:
  - systemctl enable --now qemu-guest-agent
  - timedatectl set-timezone UTC
EOF

Apply to VM:

qm set 9000 --cicustom "user=snippets:snippets/custom-user-data.yaml"

Converting to Template

Once the VM is prepared:

# Convert to template
qm template 9000

Or via Web UI: Right-click VM → Convert to template

The VM icon changes to indicate it's a template. Templates cannot be started — only cloned.

Cloning VMs

Full Clone

Creates independent copy. Disk is duplicated.

qm clone 9000 100 --name "web-server" --full

Linked Clone

Shares base disk with template. Uses less space but depends on template.

qm clone 9000 101 --name "test-server" --full 0

Warning: If you delete the template, linked clones break. Use full clones for production.

Post-Clone Configuration

After cloning, customize via cloud-init:

# Set hostname (cloud-init will apply on boot)
qm set 100 --name "web-server"

# Set static IP
qm set 100 --ipconfig0 ip=10.0.0.100/24,gw=10.0.0.1

# Start VM
qm start 100

Cloud-init runs on first boot, setting hostname, network, and SSH keys.

Template Versioning

Templates evolve. Kernel updates, package changes, security patches. Track versions:

Naming Convention

ubuntu-2404-v1       # Initial release
ubuntu-2404-v2       # Security update
ubuntu-2404-v3       # Added monitoring agent

Version in Description

qm set 9000 --description "Ubuntu 24.04 LTS
Version: 3
Date: 2025-01-08
Changes:
- Added node_exporter
- Updated to kernel 6.8
- Fixed cloud-init network"

Golden Image Process

1. Monthly: Create new template from fresh ISO
2. Weekly: Update packages on existing template (requires unconverting)
3. Document: What changed, why, who approved

Template lifecycle:
  new-template → testing → production → deprecated → deleted

Multiple Templates

Different workloads need different templates:

9000: ubuntu-2404-minimal     # Base, SSH only
9001: ubuntu-2404-web         # + nginx, certbot
9002: ubuntu-2404-docker      # + docker, compose
9003: debian-12-minimal       # Different OS
9004: almalinux-9-minimal     # RHEL-compatible

Build specialized templates from minimal:

# Clone minimal as base for web template
qm clone 9000 9100 --name "ubuntu-2404-web-prep" --full
qm start 9100

# SSH in, install web packages
ssh admin@<ip>
sudo apt install -y nginx certbot python3-certbot-nginx
# ... configure ...
sudo cloud-init clean
sudo shutdown -h now

# Convert to template
qm template 9100

Troubleshooting Cloud-Init

Cloud-Init Not Running

# Check if cloud-init ran
cloud-init status

# View logs
cat /var/log/cloud-init.log
cat /var/log/cloud-init-output.log

# Re-run cloud-init
cloud-init clean
cloud-init init

Network Not Configured

Check Proxmox cloud-init settings:

# View current cloud-init config
qm cloudinit dump 100 user
qm cloudinit dump 100 network

Inside VM:

# Check cloud-init network config
cat /etc/netplan/*.yaml  # Ubuntu
cat /etc/sysconfig/network-scripts/*  # RHEL

SSH Key Not Working

# Verify key was injected
cat /home/admin/.ssh/authorized_keys

# Check cloud-init log for errors
grep -i ssh /var/log/cloud-init.log

Hostname Not Set

Cloud-init sets hostname early. If it's still "localhost":

# Check cloud-init status
cloud-init status --long

# Force hostname update
hostnamectl set-hostname web-server

Automation with Templates

Combine templates with automation:

Terraform

resource "proxmox_vm_qemu" "web_servers" {
  count       = 3
  name        = "web-${count.index + 1}"
  target_node = "pve1"
  clone       = "ubuntu-2404-minimal"
  full_clone  = true

  cores   = 2
  memory  = 4096

  network {
    model  = "virtio"
    bridge = "vmbr0"
    tag    = 20
  }

  ipconfig0 = "ip=10.20.0.${count.index + 10}/24,gw=10.20.0.1"

  ciuser  = "admin"
  sshkeys = file("~/.ssh/id_ed25519.pub")
}

Ansible

- name: Create VM from template
  community.general.proxmox_kvm:
    api_host: pve1.lab.local
    api_user: admin@pam
    api_token_id: ansible
    api_token_secret: "{{ vault_proxmox_token }}"
    node: pve1
    name: "web-server"
    clone: "ubuntu-2404-minimal"
    full: yes
    ciuser: admin
    sshkeys: "{{ lookup('file', '~/.ssh/id_ed25519.pub') }}"
    ipconfig:
      ipconfig0: "ip=10.0.0.100/24,gw=10.0.0.1"

The Lesson

A template is a contract. If it floats, everything downstream breaks.

The template defines what every cloned VM starts with:

Installed packages
Security configuration
User accounts
Base services

When you change the template, new VMs get the change. Existing VMs don't — they're already deployed. This creates drift.

Treat templates like production artifacts:

Version them (ubuntu-2404-v3, not just ubuntu-2404)
Document changes (what, when, why)
Test before promoting (clone, verify, then use for production)
Retire old versions (don't let 5 versions of "ubuntu template" accumulate)

A stable template means predictable deployments. An unstable template means debugging why "this VM is different" every time something breaks.

Networking Baseline: Bridges, VLANs, Bonding — and the Mistakes I Made

berik@ashimov.com (Berik Ashimov) — Tue, 12 Aug 2025 00:00:00 GMT

Networking in Proxmox breaks more setups than storage and compute combined. Not because it's complicated — it's actually simpler than most people expect. It breaks because virtualization networking requires consistency at Layer 2, and inconsistency is invisible until nothing works.

I've debugged countless "my VM has no network" issues. 99% were: wrong VLAN tag, wrong bridge, or physical switch misconfiguration. This is how to get networking right from the start.

The Default Setup

After a clean Proxmox install, you have:

Physical NIC (eno1/eth0)
    └── vmbr0 (Linux bridge)
            ├── Proxmox host (management IP)
            └── VMs connect here

This is correct and works. Don't overcomplicate it until you need to.

Check your current config:

cat /etc/network/interfaces

Default looks like:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

Key points:

eno1 has no IP (manual) — it's just a bridge port
vmbr0 has the IP — this is your management address
VMs attach to vmbr0 and share the same network

Linux Bridges Explained

A bridge is a virtual switch. Physical NICs and virtual NICs connect to it.

                    ┌─────────────────────────────┐
   Physical Network │        vmbr0 (bridge)       │ Virtual Network
   ─────────────────│                             │──────────────────
        eno1 ───────│ port                   port │─── VM1 (tap100i0)
                    │                        port │─── VM2 (tap101i0)
                    │                        port │─── Proxmox host
                    └─────────────────────────────┘

All devices on the bridge see each other at Layer 2. Same broadcast domain, same VLAN (unless you add tagging).

Creating Additional Bridges

For network isolation, create multiple bridges:

# Edit network config
nano /etc/network/interfaces

Add:

# Management network (existing)
auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0

# DMZ network (new bridge, second NIC)
auto vmbr1
iface vmbr1 inet manual
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0

# Internal-only network (no physical port)
auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0

Apply:

ifreload -a

Now you have:

vmbr0: Management + production VMs
vmbr1: DMZ VMs (different physical NIC, isolated)
vmbr2: Internal-only (VMs can talk to each other, no outside access)

VLANs

VLANs separate traffic on the same physical network. Essential when you have one physical NIC but need multiple isolated networks.

VLAN-Aware Bridge (Recommended)

Modern approach. One bridge handles multiple VLANs:

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
    bridge-pvid 1

Key settings:

bridge-vlan-aware yes: Enable VLAN tagging
bridge-vids 2-4094: Allow these VLANs
bridge-pvid 1: Native VLAN (untagged traffic)

VMs specify their VLAN when connecting:

# In VM config (/etc/pve/qemu-server/100.conf)
net0: virtio,bridge=vmbr0,tag=100

Or via Web UI: VM → Hardware → Network → VLAN Tag: 100

Traditional VLAN Interfaces (Older Method)

Create a sub-interface for each VLAN:

auto eno1.100
iface eno1.100 inet manual

auto vmbr100
iface vmbr100 inet manual
    bridge-ports eno1.100
    bridge-stp off
    bridge-fd 0

This works but creates more interfaces. VLAN-aware bridges are cleaner.

Common VLAN Mistakes

1. Physical switch not configured for VLANs

Your Proxmox config is perfect, but the switch port is access-mode VLAN 1. Nothing works.

Fix: Configure switch port as trunk allowing your VLANs.

2. VLAN tag mismatch

VM is tagged VLAN 100, but there's no VLAN 100 on the switch.

Fix: Verify VLANs exist end-to-end: switch, router, Proxmox.

3. Native VLAN confusion

Management traffic is untagged (PVID), VM traffic is tagged. If PVID doesn't match switch native VLAN, management breaks.

Fix: Be explicit about native VLAN on both sides.

Bonding (Link Aggregation)

Multiple NICs acting as one for redundancy or throughput.

Bonding Modes

Mode	Name	Use Case
0	balance-rr	Round-robin, requires switch support
1	active-backup	Failover, no switch config needed
2	balance-xor	XOR hash, requires switch support
3	broadcast	Send on all, niche uses
4	802.3ad (LACP)	Dynamic aggregation, requires switch LACP
5	balance-tlb	Adaptive transmit, no switch config
6	balance-alb	Adaptive load balancing, no switch config

Recommended:

mode 1 (active-backup): Simplest, works everywhere, true redundancy
mode 4 (LACP): Best throughput, but requires switch configuration

Active-Backup Bond (Easy)

auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode active-backup
    bond-primary eno1

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

Behavior: eno1 is active, eno2 is standby. If eno1 fails, eno2 takes over in 100ms.

LACP Bond (Best Performance)

Requires switch LACP configuration first:

auto bond0
iface bond0 inet manual
    bond-slaves eno1 eno2
    bond-miimon 100
    bond-mode 802.3ad
    bond-lacp-rate fast
    bond-xmit-hash-policy layer3+4

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes

Check bond status:

cat /proc/net/bonding/bond0

Bonding Gotchas

Single flow doesn't benefit: A single TCP connection uses one link. Bonding helps aggregate throughput, not single-connection speed.

Switch must match: LACP requires switch-side configuration. Mismatched settings = no connectivity.

MII monitoring: 100ms (bond-miimon 100) is standard. Lower = faster failover but more CPU.

Network Isolation

Keeping networks separate is as important as connecting them.

Isolated Internal Network

auto vmbr2
iface vmbr2 inet manual
    bridge-ports none
    bridge-stp off
    bridge-fd 0

VMs on vmbr2 can only talk to each other. No physical network, no internet.

Use cases:

Database servers that only backends should reach
Development environments
Testing isolated from production

VMs Accessing Multiple Networks

A VM can have multiple NICs:

# VM config
net0: virtio,bridge=vmbr0,tag=10      # Production VLAN
net1: virtio,bridge=vmbr2             # Internal only

Inside VM, configure each interface appropriately.

Proxmox Host Networking

The host itself needs network access for:

Management (web UI, SSH)
Corosync (clustering)
Storage (NFS, Ceph, iSCSI)
Updates

Management Network

Always use static IP for management:

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.10/24
    gateway 10.0.0.1

Never DHCP for a hypervisor. You need to know where it is.

Separate Storage Network (If Needed)

For high-performance storage (Ceph, iSCSI), dedicate a network:

auto vmbr1
iface vmbr1 inet static
    address 10.10.0.10/24
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0

Configure storage to use this network, keeping management traffic separate.

Troubleshooting

VM Has No Network

Check in order:

# 1. Is the bridge up?
ip link show vmbr0

# 2. Is the physical port up?
ip link show eno1

# 3. Is the VM's tap interface in the bridge?
bridge link show

# 4. Inside VM, is the interface up?
# (via console, not SSH since network is broken)
ip a

# 5. Can VM ping gateway?
ping 10.0.0.1

# 6. Check for VLAN issues
tcpdump -i vmbr0 -n icmp

Traffic Not Reaching VM

# Check bridge forwarding
sysctl net.bridge.bridge-nf-call-iptables
# Should be 0, or iptables might interfere

# Check VM's interface is in bridge
bridge link show master vmbr0
# Should list tap100i0, tap101i0, etc.

# Capture on bridge
tcpdump -i vmbr0 host 10.0.0.100 -n

VLAN Traffic Not Working

# Check VLAN-aware bridge
bridge vlan show

# Should show VLANs per port:
# vmbr0    1 PVID Egress Untagged
# eno1     1 PVID Egress Untagged
#          100
#          200
# tap100i0 100 PVID Egress Untagged

# If VMs VLAN not listed, check VM config

My Network Layout

Simple Homelab

ISP Router (10.0.0.1)
     │
     └── eno1
          │
     ┌────┴────┐
     │  vmbr0  │ 10.0.0.10 (Proxmox)
     │ (bridge)│
     ├─────────┤
     │ VM1     │ 10.0.0.101 (VLAN-aware, tag=1)
     │ VM2     │ 10.0.0.102 (VLAN-aware, tag=1)
     └─────────┘

Flat network. Simple. Everything on same subnet.

Production with VLANs

Core Switch (trunk port)
     │ VLANs: 10, 20, 30, 100
     │
     └── eno1
          │
     ┌────┴────────────────────┐
     │        vmbr0            │ VLAN-aware
     │ (PVID 10 = management)  │
     ├─────────────────────────┤
     │ tag=10: Management VMs  │
     │ tag=20: Production VMs  │
     │ tag=30: DMZ             │
     │ tag=100: Storage        │
     └─────────────────────────┘

Proxmox management: 10.10.0.10/24 (VLAN 10, untagged on bridge)
Production VMs:     10.20.0.0/24 (VLAN 20)
DMZ VMs:           10.30.0.0/24 (VLAN 30)
Storage network:    10.100.0.0/24 (VLAN 100)

One physical NIC, multiple isolated networks.

The Lesson

99% of virtualization network problems are inconsistent Layer 2.

The config looks right. Proxmox is configured. VMs have IPs. But nothing works. Why?

Because somewhere in the chain — switch port, VLAN configuration, bridge settings, VM tag — something doesn't match.

Virtualization networking requires consistency:

Switch port must trunk the right VLANs
Bridge must be VLAN-aware if you're tagging
VM must use the correct tag
Physical network must route between VLANs (if needed)

When it breaks:

Start at the physical layer (is the cable plugged in?)
Check switch configuration (is VLAN allowed?)
Check bridge configuration (is VLAN-aware enabled?)
Check VM configuration (is tag correct?)
Check inside VM (is interface up?)

The fix is almost always a mismatch somewhere. Find it, fix it, document it so you don't repeat it.

Storage 101: Local, ZFS, LVM-thin — What I Actually Use and Why

berik@ashimov.com (Berik Ashimov) — Fri, 08 Aug 2025 00:00:00 GMT

Storage decisions in Proxmox affect everything downstream. Choose wrong and you're either rebuilding later or living with limitations. Choose right and you forget storage exists — it just works.

The problem is "right" depends on your use case. ZFS is amazing until your 8GB RAM server starts swapping. LVM-thin is fast until you need to migrate VMs. Directory storage is simple until you want snapshots.

This is what I actually use and why, after trying all of them.

Storage Types in Proxmox

Proxmox supports multiple storage backends. Each has trade-offs:

Type	Snapshots	Live Backup	Thin Provisioning	Notes
Directory	No*	Yes	No	Simple, on any filesystem
LVM	No	Yes	No	Block storage, no snapshots
LVM-thin	Yes	Yes	Yes	Block storage with thin volumes
ZFS	Yes	Yes	Yes	Best features, needs RAM
Ceph	Yes	Yes	Yes	Distributed, complex
NFS/CIFS	No*	Yes	Depends	Network storage

*Can use qcow2 format for snapshots, but slower

What Gets Stored Where

Before diving into backends, understand what you're storing:

VM Disks: The actual virtual hard drives. Performance critical.
ISO Images: Installation media. Read-once, performance doesn't matter.
Container Templates: LXC images. Small, read occasionally.
Backups: Compressed VM snapshots. Large, written sequentially.
Snippets: Cloud-init configs, hook scripts. Tiny files.

Not everything needs fast storage. Putting ISOs on your NVMe ZFS pool wastes space.

Directory Storage

The simplest option. Just a folder on a filesystem.

# Default directories after install
/var/lib/vz/template/iso      # ISO images
/var/lib/vz/template/cache    # Container templates
/var/lib/vz/dump              # Backups

When to Use Directory Storage

ISO images (read once during install)
Container templates (read once during creation)
Backups (sequential writes, then archive)
Small deployments where simplicity matters

Directory Limitations

No atomic snapshots (unless using qcow2 format, which is slower)
No thin provisioning — disk images use actual space
Performance depends entirely on underlying filesystem

Adding Directory Storage

# Create directory
mkdir -p /mnt/storage/proxmox

# Add to Proxmox
pvesm add dir backup-storage --path /mnt/storage/proxmox --content backup,iso,vztmpl

Or via Web UI: Datacenter → Storage → Add → Directory

LVM-thin

LVM with thin provisioning. You allocate a pool, then create thin volumes that share space.

Physical Disk (500GB)
└── Volume Group: pve
    └── Thin Pool: data (400GB allocated)
        ├── VM 100 disk (100GB virtual, 20GB actual)
        ├── VM 101 disk (100GB virtual, 35GB actual)
        └── VM 102 disk (100GB virtual, 15GB actual)
        → Total actual usage: 70GB in 400GB pool

LVM-thin Advantages

Thin provisioning: Allocate more than you have, pay for what you use
Snapshots: LVM snapshots work (with caveats)
Speed: Direct block access, no filesystem overhead
Low memory: No significant RAM overhead

LVM-thin Disadvantages

No checksums: Data corruption is silent
Snapshot overhead: Snapshots slow down writes
Pool can fill: Over-provisioning requires monitoring
Migration complexity: Moving thin volumes isn't trivial

Default LVM-thin Setup

Proxmox installer creates this automatically:

# Check LVM-thin pool
lvs
# NAME    VG  Attr       LSize
# data    pve twi-aotz-- 400g

# Check thin pool usage
lvs -o+data_percent

When LVM-thin Pool Fills

This is the danger zone. If your thin pool hits 100%, VMs pause or corrupt.

Monitor it:

# Check usage
lvs -o name,size,data_percent pve/data

# Set up alert (add to cron)
USAGE=$(lvs --noheadings -o data_percent pve/data | tr -d ' %')
if [ "$USAGE" -gt 80 ]; then
    echo "LVM thin pool at ${USAGE}%" | mail -s "Storage Alert" admin@example.com
fi

ZFS

My preferred choice for most deployments. ZFS is a filesystem and volume manager combined.

ZFS Advantages

Checksums: Every block is verified, silent corruption is detected
Snapshots: Instant, cheap, no performance penalty during creation
Compression: lz4 compression is basically free (often faster than uncompressed!)
Send/Receive: Efficient replication to another system
Self-healing: With redundancy, bad blocks are automatically repaired

ZFS Disadvantages

RAM hungry: Wants 1GB+ per TB of storage for optimal ARC
CPU for compression: Minimal with lz4, noticeable with zstd
Complexity: More knobs to understand
No shrink: Can't reduce pool size

ZFS Pool Status

# Pool health
zpool status
#   pool: rpool
#   state: ONLINE
#   config:
#     NAME         STATE     READ WRITE CKSUM
#     rpool        ONLINE       0     0     0
#       nvme0n1p3  ONLINE       0     0     0

# Space usage
zfs list
# NAME                     USED  AVAIL  REFER  MOUNTPOINT
# rpool                    120G   280G    96K  /rpool
# rpool/ROOT               50G    280G    96K  /rpool/ROOT
# rpool/data               70G    280G    96K  /rpool/data

# Check compression ratio
zfs get compressratio rpool/data

Tuning ZFS for Proxmox

Limit ARC to leave RAM for VMs:

# Check current ARC size
arc_summary | grep "ARC size"

# Limit to 4GB (adjust based on your RAM)
echo "options zfs zfs_arc_max=4294967296" > /etc/modprobe.d/zfs.conf
update-initramfs -u
reboot

Rule of thumb: Give ARC 1GB per TB of storage, minimum 1GB, maximum 50% of RAM.

Adding ZFS Storage

# Create new pool on separate disk
zpool create -o ashift=12 tank /dev/sdb

# Enable compression
zfs set compression=lz4 tank

# Create dataset for VMs
zfs create tank/vms

# Add to Proxmox
pvesm add zfspool tank-vms -pool tank/vms --content images,rootdir

My Actual Setup

Single Node Homelab

NVMe 500GB (rpool)
├── rpool/ROOT       # Proxmox OS (50GB)
└── rpool/data       # VM disks (ZFS, compression, snapshots)

SATA SSD 1TB (tank)
└── tank/backups     # Backups (directory storage)

Why:

ZFS for VM disks — I want snapshots and checksums
Separate disk for backups — if rpool dies, backups survive
Compression saves 20-40% space on typical workloads

Production Cluster

NVMe 256GB (rpool)
└── rpool/ROOT       # OS only, small and fast

2x SATA SSD 1TB (mirror, vmpool)
└── vmpool/data      # VM disks with redundancy

2x HDD 4TB (mirror, backup)
└── backup/proxmox   # Proxmox Backup Server storage

Why:

Separate OS from VMs — OS disk failure doesn't lose VMs
Mirrors for redundancy — single disk failure = no downtime
HDDs for backups — capacity over speed, write once read rarely

Snapshots vs Backups

This is where people get confused.

Snapshots Are Not Backups

A snapshot is a point-in-time view stored on the same disk:

# Create ZFS snapshot
zfs snapshot rpool/data/vm-100-disk-0@before-upgrade

# List snapshots
zfs list -t snapshot

# Rollback
zfs rollback rpool/data/vm-100-disk-0@before-upgrade

Snapshots are:

Instant: No performance penalty to create
Same disk: If disk dies, snapshots die too
For rollback: Made a bad change? Roll back in seconds

Snapshots are NOT:

Off-site: They're on the same physical disk
Disaster recovery: Disk failure loses everything
Long-term retention: Too many snapshots = space + performance issues

Backups Are Copies Elsewhere

# Proxmox backup (stores on backup storage)
vzdump 100 --storage backup-storage --mode snapshot

# ZFS send to another system
zfs send rpool/data/vm-100-disk-0@backup | ssh backup-server zfs recv tank/backups/vm-100

Backups are:

Off-system: Different disk, different machine, different building
Disaster recovery: Original dies, restore from backup
Long-term: Keep 30 days, 12 weeks, whatever you need

Use both. Snapshots for quick rollbacks (before upgrades, config changes). Backups for disaster recovery.

Storage Performance

Testing Your Storage

Before putting workloads on storage, benchmark it:

# Install fio
apt install fio

# Random 4K writes (database-like)
fio --name=rand-write --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60 --group_reporting --filename=/rpool/data/test.fio

# Sequential writes (backup-like)
fio --name=seq-write --ioengine=libaio --iodepth=1 --rw=write --bs=1m --direct=1 --size=4G --numjobs=1 --runtime=60 --group_reporting --filename=/rpool/data/test.fio

# Cleanup
rm /rpool/data/test.fio

Typical numbers:

NVMe: 500K+ IOPS random, 3GB/s+ sequential
SATA SSD: 50K IOPS random, 500MB/s sequential
HDD: 150 IOPS random, 150MB/s sequential

Fast Often Means Fragile

High-performance storage often sacrifices safety:

NVMe without power loss protection: Data corruption on power loss
Write caching without battery backup: Same problem
Consumer SSDs: Not designed for write-heavy workloads

For VMs that matter, use enterprise SSDs with power loss protection or ZFS with a proper setup (mirrors, proper RAM, UPS).

Storage Migration

Need to move VMs between storage backends?

Online Migration (VM Running)

# Move disk to different storage
qm move_disk 100 scsi0 target-storage

# Or via Web UI: VM → Hardware → Disk → Move Storage

Offline Migration

# Stop VM
qm stop 100

# Export/import
qm export 100 /tmp/vm-100.tar.gz
qm import 100 /tmp/vm-100.tar.gz target-storage

ZFS Send/Receive

For ZFS-to-ZFS, this is most efficient:

# Send to remote
zfs send rpool/data/vm-100-disk-0@migrate | ssh target-host zfs recv tank/data/vm-100-disk-0

The Lesson

Snapshots are not backups. And fast often means fragile.

Storage is where data lives. Get it wrong and you lose everything. The temptation is to optimize for speed — NVMe everything, no redundancy, maximum performance.

Then a disk fails. Or worse, corrupts silently. And you discover that your snapshots were on the same disk that died.

My approach:

ZFS for data that matters — checksums catch corruption
Mirrors for production — single disk failure = no panic
Separate backup storage — not on the same disk, not on the same host
Test restores — a backup you haven't restored is a backup you hope works

The boring, redundant setup survives. The fast, minimal setup survives until it doesn't.

Post-Install Baseline: Users, SSH, Firewall, Updates, and Hardening

berik@ashimov.com (Berik Ashimov) — Tue, 05 Aug 2025 00:00:00 GMT

A fresh Proxmox install works. You can create VMs, manage storage, access the web UI. But "works" isn't "secure." The default configuration prioritizes convenience over hardening. That's fine for the installer — you need access to finish setup. It's not fine for production.

Security is easier to implement now, in the first hour, than "someday later." Later never comes. And when it does, it's usually because something bad happened.

This is the post-install hardening I do on every Proxmox host before it runs any workload.

User Management

Stop Using Root for Everything

The web UI logs you in as root. SSH defaults to root. This works, but:

Root has no audit trail (who did what?)
Root can destroy everything with one typo
Shared root password = no accountability

Create personal admin accounts:

# Create user with admin access
useradd -m -s /bin/bash admin
passwd admin

# Add to sudo group
usermod -aG sudo admin

Now add this user to Proxmox:

# Create Proxmox user (realm = pam for Linux users)
pveum user add admin@pam

# Grant Administrator role
pveum acl modify / --user admin@pam --role Administrator

You can now log into the web UI as admin@pam instead of root.

Proxmox Authentication Realms

Proxmox has multiple authentication realms:

pam: Linux system users. Best for admins who also need SSH.
pve: Proxmox internal users. Web UI only, no SSH access.
LDAP/AD: Enterprise directory integration.

For small deployments, PAM users are simplest. One password for SSH and web UI.

Two-Factor Authentication

Enable 2FA for all admin accounts:

Web UI → Datacenter → Permissions → Two Factor
Add TOTP for your user
Scan QR code with authenticator app

This protects against password compromise. Even if someone gets your password, they need the TOTP code.

SSH Hardening

Default SSH is password authentication as root. Every botnet on the internet is scanning for this.

Generate SSH Keys

On your workstation:

ssh-keygen -t ed25519 -C "admin@proxmox"

Copy to the server:

ssh-copy-id admin@pve1.lab.local

Harden sshd_config

Edit /etc/ssh/sshd_config:

# Disable root login
PermitRootLogin no

# Disable password authentication
PasswordAuthentication no

# Only allow specific users
AllowUsers admin

# Use only strong algorithms
KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group16-sha512
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com

# Reduce login grace time
LoginGraceTime 30

# Limit authentication attempts
MaxAuthTries 3

# Disable unused features
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no

Apply changes:

systemctl restart sshd

Test before disconnecting. Open a new terminal, verify you can log in with your key. Don't lock yourself out.

Fail2Ban (Optional but Recommended)

Even with key-only auth, bots still try. Fail2ban reduces log noise:

apt install fail2ban

# Create local config
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600
EOF

systemctl enable --now fail2ban

Check banned IPs:

fail2ban-client status sshd

Host Firewall

Proxmox VMs get their own firewall (we'll cover that later). But the host itself needs protection too.

Proxmox Built-in Firewall

Proxmox has a built-in firewall. Enable it for the datacenter and node:

# Enable firewall at datacenter level
pvesh set /cluster/firewall/options --enable 1

# Enable for this node
pvesh set /nodes/pve1/firewall/options --enable 1

Or via Web UI: Datacenter → Firewall → Options → Enable: Yes

Default Policies

Set default to drop, then allow what you need:

# Set default input policy to DROP
pvesh set /cluster/firewall/options --policy_in DROP

# Allow established connections
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --macro ESTABLISHED

# Allow SSH
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 22 --proto tcp --comment "SSH"

# Allow Proxmox web UI
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 8006 --proto tcp --comment "Proxmox Web UI"

# Allow ICMP (ping)
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --proto icmp --comment "ICMP"

Management Network Restriction

Better: restrict management access to specific subnets:

# Only allow SSH from management network
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 22 --proto tcp --source 10.0.0.0/24 --comment "SSH from mgmt"

# Only allow web UI from management network
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 8006 --proto tcp --source 10.0.0.0/24 --comment "Web UI from mgmt"

Cluster Communication

If you're clustering, allow inter-node traffic:

# Corosync
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 5405:5412 --proto udp --source 10.0.0.0/24 --comment "Corosync"

# Live migration
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 60000:60050 --proto tcp --source 10.0.0.0/24 --comment "Migration"

# Proxmox cluster API
pvesh create /cluster/firewall/rules --action ACCEPT --type in --enable 1 --dport 85 --proto tcp --source 10.0.0.0/24 --comment "Cluster API"

Automatic Updates

Security updates shouldn't wait for you to remember. Automate them.

Unattended Upgrades

apt install unattended-upgrades apt-listchanges

# Enable automatic updates
dpkg-reconfigure -plow unattended-upgrades

Configure /etc/apt/apt.conf.d/50unattended-upgrades:

Unattended-Upgrade::Allowed-Origins {
    "${distro_id}:${distro_codename}";
    "${distro_id}:${distro_codename}-security";
    "${distro_id}:${distro_codename}-updates";
    "Proxmox:${distro_codename}";
};

// Email notification
Unattended-Upgrade::Mail "admin@example.com";
Unattended-Upgrade::MailReport "on-change";

// Don't auto-reboot (hypervisor needs planned reboots)
Unattended-Upgrade::Automatic-Reboot "false";

// Remove unused dependencies
Unattended-Upgrade::Remove-Unused-Dependencies "true";

Manual Update Workflow

For major updates or kernel changes, manual process is safer:

# Check what will be updated
apt update
apt list --upgradable

# Apply updates
apt full-upgrade

# Check if reboot needed
cat /var/run/reboot-required 2>/dev/null && echo "Reboot required"

# Check Proxmox version
pveversion -v

Schedule maintenance windows. Don't reboot during business hours if you can avoid it.

Additional Hardening

Disable Unused Services

Check what's listening:

ss -tlnp

If you're not using Spice or VNC consoles directly:

# Don't disable if you use them!
# systemctl disable --now spiceproxy

Kernel Parameters

Add to /etc/sysctl.d/99-security.conf:

# Disable IP forwarding (enable if needed for routing VMs)
# net.ipv4.ip_forward = 0

# Ignore ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

# Don't send ICMP redirects
net.ipv4.conf.all.send_redirects = 0

# Enable SYN flood protection
net.ipv4.tcp_syncookies = 1

# Log martian packets
net.ipv4.conf.all.log_martians = 1

# Ignore broadcast pings
net.ipv4.icmp_echo_ignore_broadcasts = 1

# Restrict dmesg
kernel.dmesg_restrict = 1

Apply:

sysctl -p /etc/sysctl.d/99-security.conf

Filesystem Hardening

Mount options for security:

# Check current mounts
mount | grep -E 'ext4|zfs'

For non-ZFS systems, consider noexec on /tmp. For ZFS, Proxmox handles mount options appropriately.

Audit Logging

Install and configure auditd for compliance requirements:

apt install auditd audispd-plugins

# Basic rules
cat >> /etc/audit/rules.d/audit.rules << 'EOF'
# Monitor sudo usage
-w /etc/sudoers -p wa -k sudoers
-w /etc/sudoers.d/ -p wa -k sudoers

# Monitor SSH config
-w /etc/ssh/sshd_config -p wa -k sshd

# Monitor user/group changes
-w /etc/passwd -p wa -k passwd
-w /etc/group -p wa -k group
EOF

systemctl restart auditd

Backup Your Config

Before you forget what you configured:

# Backup host config
tar -czf /root/pve-config-$(date +%Y%m%d).tar.gz /etc/pve /etc/ssh /etc/apt

# Copy off-host
scp /root/pve-config-*.tar.gz backup-server:/backups/

Do this after any significant configuration change.

Security Checklist

Run through this after every install:

[ ] Non-root admin user created
[ ] Admin user has 2FA enabled
[ ] SSH key-only authentication
[ ] Root SSH login disabled
[ ] Host firewall enabled
[ ] Management access restricted to trusted networks
[ ] Unattended security updates configured
[ ] Fail2ban installed (optional)
[ ] Initial config backed up

What I Don't Do

Some hardening guides go overboard. Here's what I skip and why:

Change SSH port: Security through obscurity. Attackers scan all ports. Fail2ban is more effective.

Disable IPv6: If you're not using it, fine. But disabling often breaks things in unexpected ways.

Complex MAC/SELinux policies: On a hypervisor, the VMs are the workload. Host runs minimal services. Default policies are usually sufficient.

Intrusion detection (OSSEC, etc.): Good for compliance. For homelab, log monitoring and backups are more practical.

The Lesson

Security is easier to do now than "someday later."

A fresh install is a blank slate. Every change you make is documented in your head. Wait a month, and you've forgotten what's default and what you configured. Wait a year, and the system is running workloads you're afraid to touch.

The first hour after install is when hardening is easiest:

You haven't forgotten what's there
No workloads depend on insecure defaults
Changes don't require maintenance windows
Documentation is fresh

Do it now. These configurations survive upgrades. The 30 minutes you spend today prevent the 3 AM incident next year.

Why I Chose Proxmox (and How to Install It the Boring, Correct Way)

berik@ashimov.com (Berik Ashimov) — Fri, 01 Aug 2025 00:00:00 GMT

I've run ESXi, Hyper-V, oVirt, and plain KVM with libvirt. They all work. But when Broadcom acquired VMware and started the licensing chaos, I moved everything to Proxmox. Not because it's trendy — because it's boring in the best way.

Proxmox is Debian with a web UI and good defaults. When things break (and they will), you're debugging Linux, not a proprietary hypervisor. The skills transfer. The logs make sense. The community has seen your problem before.

This is how to install Proxmox in a way that doesn't create pain later.

Why Proxmox

It's just Debian. Under the web UI, it's apt, systemd, and standard Linux networking. When the UI doesn't do what you need, drop to the shell.

ZFS first-class. Built-in ZFS support with proper integration. Snapshots, replication, compression — all accessible from the UI.

No licensing games. The "enterprise" repository requires a subscription, but the no-subscription repository works fine. You're not crippled without paying.

Clustering is free. Three nodes, shared storage, HA — no extra licenses.

Both VMs and containers. KVM for full VMs, LXC for lightweight containers. Same management interface.

Before You Install

Hardware Considerations

CPU: Intel or AMD with virtualization extensions (VT-x/AMD-V). Check BIOS — these are sometimes disabled by default.

RAM: Minimum 8GB for the host, but realistically 32GB+ for anything useful. ECC recommended for ZFS, not required.

Storage:

Boot drive: SSD, 32GB minimum (128GB comfortable)
VM storage: Separate drive(s), SSD strongly preferred
ZFS: Wants multiple drives for redundancy

Network: Dedicated NIC for management, additional for VM traffic if you're serious.

The Decision: ZFS vs LVM vs ext4

This is the first fork in the road. Choose wrong and you'll reinstall later.

ZFS (my choice for most cases):

Built-in checksums, catches silent corruption
Snapshots are instant and cheap
Compression saves space with minimal CPU overhead
Replication to another node is trivial
Requires more RAM (1GB per TB of storage, roughly)
Single-disk ZFS works fine, just no redundancy

LVM-thin:

Less RAM overhead
Snapshots work but less elegant
No checksums
Familiar if you know LVM
Good choice for simple setups or low-RAM systems

ext4/XFS on raw disk:

Simplest
No snapshots without external tools
Fine for the boot drive if VMs live elsewhere

My recommendation: ZFS unless you have less than 16GB RAM or specific reasons not to. The data integrity alone is worth it.

Installation

Download the ISO from proxmox.com. Write it to USB with dd, Rufus, or Etcher.

Boot and Initial Screens

Boot from USB
Select "Install Proxmox VE"
Accept EULA

Target Disk Selection

This is where most people make mistakes.

Single disk:

Target Harddisk: /dev/sda
Filesystem: zfs (RAID0)   # Yes, RAID0 for single disk

RAID0 on one disk sounds wrong, but it's just "use this one disk with ZFS."

Multiple disks for redundancy:

Filesystem: zfs (RAID1)   # Mirror, needs 2+ disks
# or
Filesystem: zfs (RAIDZ-1) # Needs 3+ disks

Advanced options (click "Options" button):

ashift: 12           # Correct for most SSDs (4K sectors)
compress: lz4        # Basically free compression
checksum: on         # Never turn this off
copies: 1            # 2 for paranoid, uses 2x space
hdsize: <leave blank or set limit>

If you're using NVMe, ashift=12 is still correct for most drives.

Network Configuration

Management Interface: eno1 (or whatever your NIC is)
Hostname (FQDN): pve1.lab.local
IP Address: 192.168.1.10
Netmask: 255.255.255.0
Gateway: 192.168.1.1
DNS Server: 192.168.1.1

Use a static IP. DHCP for a hypervisor is asking for trouble.

FQDN matters. Clustering uses hostnames. Get it right now or fix it painfully later.

Timezone and Password

Set your timezone. Set a strong root password. You'll create non-root users later.

Installation Completes

Remove USB, reboot. Access web UI at https://<ip>:8006.

Post-Install: Repository Configuration

Default install points to the enterprise repository, which requires a subscription. You'll see errors during updates. Fix this:

# Disable enterprise repository
mv /etc/apt/sources.list.d/pve-enterprise.list /etc/apt/sources.list.d/pve-enterprise.list.disabled

# Add no-subscription repository
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list

# Update
apt update && apt full-upgrade -y

For Ceph (if you're using it):

# Same pattern
mv /etc/apt/sources.list.d/ceph.list /etc/apt/sources.list.d/ceph.list.disabled
echo "deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription" > /etc/apt/sources.list.d/ceph-no-subscription.list

Verify Installation

Check ZFS

zpool status
# Should show your rpool, healthy

zfs list
# Should show rpool and rpool/data

Check VM Storage

pvesm status
# Should show local, local-lvm (or local-zfs)

Check Networking

ip a
# Should show vmbr0 bridge with your management IP

Check Services

systemctl status pvedaemon
systemctl status pveproxy

Initial Configuration via Web UI

Navigate to https://<ip>:8006, login as root.

Datacenter → Storage

You'll see:

local: ISO images, container templates, backups (directory storage)
local-lvm or local-zfs: VM disks (block storage)

This is fine to start. We'll discuss storage architecture later.

Node → System → DNS

Verify DNS is correct. Add a search domain if needed.

Node → System → Time

Verify timezone. NTP is configured by default (systemd-timesyncd).

Subscription Nag

You'll see a subscription popup on login. This is expected without a subscription. It's just a nag, not a limitation.

To remove it (optional, slightly hacky):

# This modifies the JS file - breaks on updates, needs reapply
sed -Ezi.bak "s/(Ext\.Msg\.show\(\{[^}]+title: gettext\('No valid sub)/void\(\{ \/\/ \1/g" /usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js
systemctl restart pveproxy

I don't bother. It's one click to dismiss.

The Disk Layout I Actually Use

For a single-node homelab:

/dev/nvme0n1 (500GB NVMe)
  └── ZFS: rpool
      ├── rpool/ROOT/pve-1    # Proxmox OS
      └── rpool/data          # VM disks

/dev/sda (2TB SATA SSD) - optional
  └── ZFS: tank
      └── tank/backups        # Backup storage

For a production cluster:

/dev/nvme0n1 (256GB NVMe)
  └── ZFS: rpool              # OS only, small and fast

/dev/sda, /dev/sdb (2x 1TB SSD)
  └── ZFS mirror: vmpool
      └── vmpool/data         # VM disks with redundancy

/dev/sdc, /dev/sdd (2x 4TB HDD)
  └── ZFS mirror: backup
      └── backup/pbs          # Proxmox Backup Server storage

Key principle: Separate OS from VM storage. If your VM pool fills up, your host still boots.

Updates: The Boring Part That Matters

# Regular updates
apt update && apt full-upgrade

# Check for kernel updates
pveversion -v

# Reboot if kernel updated
reboot

Before major version upgrades (7.x → 8.x):

Read the official upgrade guide completely
Backup everything
Test on non-production first
Run the upgrade checklist script

pve7to8 --full  # For 7→8 upgrade, shows potential issues

What I Wish I Knew

ZFS RAM usage. ZFS wants RAM for ARC (adaptive replacement cache). Default is up to 50% of RAM. For a VM host, you might want to limit it:

# Limit ARC to 4GB
echo "options zfs zfs_arc_max=4294967296" > /etc/modprobe.d/zfs.conf
update-initramfs -u

Enterprise vs no-subscription repo. They're nearly identical. Enterprise gets updates slightly earlier, that's all. No-subscription is fine for production.

Clustering from the start. If you might cluster later, plan for it now. Same network segment, unique hostnames, Corosync-compatible setup.

Backups are separate. Proxmox creates VMs. Proxmox Backup Server (PBS) backs them up. They're different products that work together. Plan your backup storage accordingly.

The Lesson

The most important thing isn't 'install' — it's laying the foundation for upgrades.

A Proxmox install takes 10 minutes. The choices you make during those 10 minutes affect the next 5 years. Wrong disk layout? Reinstall. Wrong hostname? Pain when clustering. Wrong storage? Juggling VMs later.

The boring install is the one that survives:

ZFS for data integrity
Static IP, proper FQDN
Repository configured correctly
Separate OS and VM storage
Documentation of what you did

Proxmox will upgrade through multiple major versions if you don't make weird choices at install time. That's the goal: a hypervisor you forget is there because it just works.